FoodIE: A Rule-based Named-entity Recognition Method for Food

Information Extraction

Gorjan Popovski

, Stefan Kochev

, Barbara Korou

c Seljak

and Tome Eftimov

Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University,

Rugjer Boshkovikj 16, 1000 Skopje, Macedonia

Computer Systems Department, Jo

zef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia

Keywords:

Information Extraction, Rule-based Named-entity Recognition, Food-entity Recognition.

Abstract:

The application of Natural Language Processing (NLP) methods and resources to biomedical textual data

has received growing attention over the past years. Previously organized biomedical NLP-shared tasks (such

as, for example, BioNLP Shared Tasks) are related to extracting different biomedical entities (like genes,

phenotypes, drugs, diseases, chemical entities) and ﬁnding relations between them. However, to the best of

our knowledge there are limited NLP methods that can be used for information extraction of entities related to

food concepts. For this reason, to extract food entities from unstructured textual data, we propose a rule-based

named-entity recognition method for food information extraction, called FoodIE. It is comprised of a small

number of rules based on computational linguistics and semantic information that describe the food entities.

Experimental results from the evaluation performed using two different datasets showed that very promising

results can be achieved. The proposed method achieved 97% precision, 94% recall, and 96% F

score.

1 INTRODUCTION

Nowadays, a large amount of textual information is

available in digital form and published in public web

repositories (e.g., online news, scientiﬁc publications,

social media). The textual information is presented as

unstructured data, meaning that the data has no pre-

deﬁned data model. Working with textual data is a

challenge because of its variability - the same con-

cepts can be mentioned in different ways regarding

the fact how people express themselves and use dif-

ferent writing styles.

Information Extraction (IE) is a task of automat-

ically extracting information from unstructured data

and, in most cases, is concerned with the processing

of human language text by means of natural language

processing (NLP) (Aggarwal and Zhai, 2012). The

idea behind IE is to provide a structured representa-

tion of extracted information obtained from analyzed

text. The information to be extracted is deﬁned by

users, and consists of predeﬁned concepts of interest

and related entities, as well as relationships between

entities and events.

One of the classic IE tasks is named-entity recog-

nition (NER), which addresses the problem of iden-

tiﬁcation and classiﬁcation of predeﬁned concepts

(Nadeau and Sekine, 2007). It aims to determine and

identify words or phrases in text into predeﬁned labels

(classes) that describe concepts of interest in a given

domain. Various NER methods exist: terminological-

driven, rule-based, corpus-based, methods based on

active learning (AL), and methods based on deep neu-

ral networks (DNNs).

In this paper, we focus on IE of food entities. To

the best of our knowledge, not a large amount of re-

search focusing on food entities has been done. How-

ever, nowadays, the knowledge about extracted food

entities and their relations with other biomedical enti-

ties (like genes, drugs, diseases, etc.) is important for

improving public health.

The main contributions of this paper are:

• A rule-based NER method for IE of food entities.

• Evaluation of the proposed method, which pro-

vides promising results on unstructured data,

without a need for an annotated corpus.

In the remainder of the paper, we ﬁrst present an

overview of the related work. Then, we present the

proposed rule-based NER method for IE of food en-

tities. Next, the data used for evaluation is explained,

followed by the results and discussion. Finally, the

conclusions of the paper and a discussion for future

work are presented.

Popovski, G., Kochev, S., Seljak, B. and Eftimov, T.

FoodIE: A Rule-based Named-entity Recognition Method for Food Information Extraction.

DOI: 10.5220/0007686309150922

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 915-922

ISBN: 978-989-758-351-3

915

2 RELATED WORK

IE from biomedical literature is a very important task

with the goal of improving public health. Because

NER methods which have the best performances are

usually corpus-based NER methods, there is a need

for an annotated corpus from biomedical literature

that includes the entities of interest. For this purpose,

different annotated corpora are produced by shared

tasks, where the main aim is to challenge and encour-

age research teams on NLP problems.

In comparison with the extensive work done for

biomedical tasks, in the food science domain the sit-

uation is different. Several studies have been con-

ducted, but with different goals. For example, in (Xia

et al., 2013) authors presented an approach to iden-

tify rice protein resistant to Xanthomonas oryzae pv.

oryzae, which is an approach to enhance gene prior-

itization by combining text mining technologies with

a sequence-based approach. Co-occurrence methods

were also used to identify ingredients mentioned in

food labels and extracting food-chemical and food-

disease relationship (do Nascimento et al., 2013;

Jensen et al., 2014).

A ML approach to Japanese recipe text processing

was proposed in (Mori et al., 2012), where one task,

which was evaluated, was food-named entity recog-

nition. This approach used the r-FG corpus, which

is composed solely from Japanese food recipes. An-

other similar approach for generating graph structures

from food recipes was proposed in (Chen, 2017),

where authors manually annotated a recipe corpus

that is then used for training a ML model.

The UCREL Semantic Analysis System (USAS)

is a framework for automatic semantic analysis of

text, which distinguishes between 21 major cate-

gories, one of which is “food and farming” (Rayson

et al., 2004), being heavily utilized in our rule-based

system - FoodIE. The USAS can provide additional

information about the food entity, but the limitation is

that it works on a token level. For example, if in the

text two words (i.e. tokens), like “grilled chicken”,

denote one food entity that needs to be extracted and

analyzed, the semantic tagger would actually parse

the words “grilled” and “chicken” as separate entities

and obtain separate semantic tags.

In (Eftimov et al., 2017), a rule-based NER used

for IE from evidence-based dietary recommendation,

called drNER, is presented, where among other enti-

ties, food entities were also of interest.

Recipe description

(text)

Food-related

pre-processing

Text POS-tagging

and

post-processing the tag

dataset

Semantic tagging

of food tokens

in the text

Food named-entity

recognition

Food entities

Figure 1: The ﬂowchart of the Foodie methodology.

3 FOODIE: A RULE-BASED

FOOD-NAMED ENTITY

RECOGNITION

To enable food-named entity recognition, in this pa-

per, we propose a rule-based approach, called FoodIE.

It works with unstructured data (more speciﬁcally,

with a recipe that includes textual data in form of in-

structions on how to prepare the dish) and consists of

four steps:

• Food-related text pre-processing

• Text POS-tagging and post-processing of the tag

dataset

• Semantic tagging of food tokens in the text

• Food-named entity recognition

The ﬂowchart of the methodology is presented in Fig-

ure 1. Further, we are going to explain each part in

more detail.

3.1 Food-related Text Pre-processing

The pre-processing step takes into account the dis-

crepancies that exist between the outputs of the tag-

gers we are utilizing, coreNLP tagger from the R pro-

gramming language (Arnold and Tilton, 2016) and the

UCREL Semantic Analysis System (USAS) (Rayson

et al., 2004). It is also used to remove any characters

that are unknown to the taggers.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

916

Firstly, quotation marks should be removed from

the raw text, for the simple reason that they are treated

differently by both used NLP libraries, causing a dis-

crepancy.

Secondly, every white space sequence (including

tabulation, newlines, etc.) is converted into a single

white space to provide a consistent structure to the

text.

Additionally, ASCII transliteration is performed,

which means characters that are equivalent to ASCII

characters are transliterated. An example of such

characters is [

a], which are transliterated to [e,

o, a], respectively.

Finally, fractions should be converted into real

numbers. Usually, when a food-related text is writ-

ten (e.g., recipe), fractions are used when discussing

quantities. However, they are usually written in plain

ASCII format and in a manner which is confusing to

NLP taggers. For example, “2.5” is usually written as

“2 1/2” in such texts. This does not bode well with

coreNLP and the USAS semantic tagger. Thus, in the

pre-processing step, all fractions are converted into

the standard mathematical decimal notation for real

numbers.

3.2 Text POS-tagging and

Post-processing of the Tag Set

To obtain the morphological information from a tex-

tual data, we use UCREL Semantic Analysis System

(USAS) and coreNLP.

The USAS semantic tagger provides word tokens

associated with their POS tags, lemmas, and seman-

tic tags. The semantic tags show semantic ﬁelds that

group together word senses that are related at some

level of generality with the same contextual concept.

The groups include not only synonyms and antonyms

but also hypernyms and hyponyms. More details

about semantic tags can be found in (Rayson et al.,

2004; Alexander and Anderson, 2012).

Furthermore, the same is done using the coreNLP

library, which includes all of the above except seman-

tic tags.

For example, the sentence “Heat the beef soup un-

til it boils” is processed by both libraries. The results

from the coreNLP library for the above mentioned ex-

ample sentence are presented in Table 1, while the re-

sults from USAS are presented in Table 2. Observing

the results presented in the tables, it is obvious that

there is a discrepancy between the POS tags for the

token “Heat”.

As is evident, both the USAS semantic tagger and

the coreNLP library, do not provide perfect tags (e.g.,

sometimes verbs are misclassiﬁed as nouns, as is the

Table 1: Tags obtained from coreNLP for one recipe sen-

tence.

Token ID Token Lemma POS tag

1 Heat heat NN

2 the the DT

3 beef beef NN

4 soup soup NN

5 until until IN

6 it it PRP

7 boils boil VBZ

8 . . .

case with the ﬁrst token in the example given in Table

1). For this reason, the tags returned by both taggers

are post-processed and modiﬁed using the following

linguistic rules:

• If at least one of the taggers classify a token as a

verb, mark it as a verb.

• If there exists a discrepancy between the tags for

a speciﬁc token, prioritize the tag given by the

USAS semantic tagger.

• If a past participle form or a past simple form of

a verb precedes and is adjacent to a noun, and it

is classiﬁed as a verb, change the tag from verb to

adjective.

Finally, we keep two versions of the modiﬁed tag

set, one in each format. These modiﬁed tags in the

coreNLP format and USAS format are presented in

Table 3 and Table 4, respectively.

3.3 Semantic Tagging of Food Tokens in

Text

To deﬁne phrases in the text related to food enti-

ties, we ﬁrst need to ﬁnd tokens that are related

to food entities. For this purpose, the USAS se-

mantic tagger is utilized. Using it, a speciﬁc rule

is deﬁned to determine the food tokens in the text.

Food tokens are predominantly nouns or adjectives,

so we account for this as to improve the false pos-

itive rate, i.e. allowing a token to be categorized

as a food token if and only if it is either a noun

or an adjective. The decision rule combines three

conditions using the following Boolean expression

((Condition

OR Condition

) AND Condition

). If

the expression is true, then the token is classiﬁed as

food token. For clarity, let us assume that t is a to-

ken and s

is the semantic tag that is assigned to it

using the USAS semantic tagger. Each condition is

constructed using the following rules:

• Condition

– Food tag F(1|2|3|4), or

FoodIE: A Rule-based Named-entity Recognition Method for Food Information Extraction

917

Table 2: Tags obtained from USAS for one recipe sentence.

Token ID Token Lemma POS tag Semantic tag 1 Semantic tag2

1 Heat heat VV0 O4.6+ AJ.03.c.02 [Heat]; AJ.03.c.02 [Heat]; AJ.03.c.02.a [Heating/making hot/warm];

2 the the AT Z5 ZC [Grammatical Item];

3 beef beef NN1 F1 AG.01.d.03 [Beef]; AE.14.m.03 [Subfamily Bovinae (bovines)]; AE.14.m.03 [Subfamily Bovinae (bovines)];

4 soup soup NN1 F1 AG.01.n.02 [Soup/pottage]; AA.04.g.04 [Wave]; AA.11.h [Cloud];

5 until until CS Z5 ZC [Grammatical Item];

6 it it PPH1 Z8 ZF [Pronoun];

7 boils boil VVZ O4.6+ E3- AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling];

8 . PUNC YSTP PUNC NULL

Table 3: Modiﬁed tags from coreNLP for one recipe sen-

tence.

Token ID Token Lemma POS tag

1 Heat heat VB

2 the the DT

3 beef beef NN

4 soup soup NN

5 until until IN

6 it it PRP

7 boils boil VBZ

8 . . .

– Living tag L(2|3), or

– Substance tag (liquid and solid) O1.(1|2).

• Condition

– Body part tag B1, and

– Not Linear order tag N4, and

– Not Location and direction tag M6, and

– Not Texture tag O4.5.

• Condition

– Not General Object tag O2, and

– Not Quantities tag N5, and

– Not Clothing tag B5, and

– Not Equipment for food preparation tag

AG.01.t.08, and

– Not Container for food, place for storing food

tag AG.01.u, and

– Not Clothing tag AH.02.

More formally, using Boolean algebra, we can

write these rules as:

Condition

∈ {F1, F2, F3, F4} ∨ s

∈ {L2, L3} ∨ s

∈

{O1.1, O1.2}

Condition

= B1 ∧ s

6= N4 ∧ s

6= M6 ∧ s

6= O4.5

Condition

6= O2 ∧ s

6= N5 ∧ s

6= B5 ∧ s

6= AG.01.t.08 ∧ s

AG.01.u ∧ s

6= AH.02.

Rule

(Condition

∨Condition

) ∧Condition

Additionally, we deﬁne one rule to determine ob-

ject tokens. Determining the object tokens will fur-

ther help us in the deﬁnition of food entities, mainly

to avoid false positives. The rule consists of

• General Object tag O2, or

• Clothing tag B5, and

• Not Body Part tag B1, and

• Not Living tag L(2|3), and

• Not a food token as deﬁned by the aforementioned

ﬁrst rule.

Using Boolean algebra, this rule is represented as

Rule

= O2 ∨ s

= B5) ∧ s

6= B1 ∧ s

6= L2 ∧ s

L3 ∧ ¬Rule

If this condition is met, the token is tagged as gen-

eral object.

The single rule for deﬁning color noun is consisted

• Color tag O4.3.

The rule for deﬁning a color noun is then formally

deﬁned as

Rule

= O4.3.

These tags are useful when food entities ending

on a color, such as “egg whites” or “hash browns”,

appear in the text, which indeed are to be treated as

food entities.

At the end, one additional rule is constructed for

deﬁning what is explicitly disallowed to be the main

token in a food entity, and is deﬁned as

• Equipment for food preparation AG.01.t.08, and

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

918

Table 4: Modiﬁed tags from USAS for one recipe sentence.

Token ID Token Lemma POS tag Semantic Tag 1 Semantic tag 2

1 Heat heat VV0 O4.6+ AJ.03.c.02 [Heat]; AJ.03.c.02 [Heat]; AJ.03.c.02.a [Heating/making hot/warm];

2 the the AT Z5 ZC [Grammatical Item];

3 beef beef NN1 F1 AG.01.d.03 [Beef]; AE.14.m.03 [Subfamily Bovinae (bovines)]; AE.14.m.03 [Subfamily Bovinae (bovines)];

4 soup soup NN1 F1 AG.01.n.02 [Soup/pottage]; AA.04.g.04 [Wave]; AA.11.h [Cloud];

5 until until CS Z5 ZC [Grammatical Item];

6 it it PPH1 Z8 ZF [Pronoun];

7 boils boil VVZ O4.6+ E3- AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling];

8 . PUNC YSTP PUNC NULL

• Container for food, place for storing food

AG.01.u, and

• Clothing tag AH.02, and

• Temperature tag O4.6, and

• Measurement tag N3.

This rule can be represented as

Rule

= AG.01.t.08 ∧ s

= AG.01.u ∧ s

= AH.02 ∧ s

O4.6 ∧ s

= N3.

This rule is utilized when isolating entities that

could be potential false positives. An example of

this would be “oil temperature” or “cake pan”. Ad-

ditionally, there are some manually added resources

in this disallowed category, which frequently occur in

the texts.

3.4 Food-named Entity Recognition

To obtain food chunks, we used the modiﬁed tag set

from the USAS semantic tagger obtained in Subsec-

tion 3.2 in combination with the food tokens obtained

in Subsection 3.3. The process of food-named entity

recognition consists of three steps.

Firstly, we iterate through every food token which

we extracted previously from the text, and for each

token we deﬁne a set of rules that constitute a food

entity.

Adjacent to the left of the food token we allow

chaining of adjectives (JJ), nouns (NN), proper nouns

(NP), genitive tag (GE), unknown tags (Z99) and gen-

eral tokens tagged as food, but explicitly omit general

objects. The purpose of including the unknown POS

tag (Z99) is to catch tokens that do not concisely fall

into one of the tags in the standard POS tag set, yet

still are of importance to the semantics of the food en-

tity. Such an example would be “Colby-Jack cheese”,

whose POS tags are Z99 and NN, respectively.

Adjacent to the right the logic is the same, differ-

ing only by allowing general object to be part of the

food entity and tokens that have been been tagged as

a color noun by the rule engine. We also keep track

not to use a token twice.

Then, to determine if it truly is a food entity chunk

or just a chunk related to food but not a food entity in

and of itself, we check the last token of the chunk.

The whole chunk is discarded if the last token is:

• A noun (starts with NN) and a general non-food

object, or

• in the disallowed category as deﬁned by the rule

engine, or

• in the disallowed category as deﬁned by the re-

sources.

Some examples where this would be a false pos-

itive are “mufﬁn liner”, “casserole dish” or “egg

timer”. If this check passes and the last token is not

a general object, we mark each token in the new food

chunk with an index unique to the whole chunk and

continue iterating through the remaining food tokens.

After the ﬁrst step, we now must concatenate all

relevant information for each food entity. For each

indexed food entity, we join all the instances into one

entry, thus creating a vector where each token is its

own entry, except for the food entities which are rep-

resented as one entry. If initially we had a vector of

tokens such as [Chop, the, hot, Italian, sausage, into,

pieces, .] the output would be [Chop, the, hot Ital-

ian sausage, into, pieces, .]. This also applies to other

relevant information we might want to track, such as

lemmas, POS tags, sentence indexes or even individ-

ual token indexes.

For additional robustness, we perform a check to

assure that each food chunk we have isolated indeed

contains a food token, and that the token is marked

under some food chunk. For this we only mark a

chunk as a food entity if it contains at least one word

that has previously been tagged as a food token and

has been indexed as part of the respective chunk as

well.

4 EVALUATION

The evaluation was performed manually, since there

is no pre-existing method to evaluate such a text cor-

pus. To avoid any kind of bias when evaluating food-

related text, one person was tasked with manually

FoodIE: A Rule-based Named-entity Recognition Method for Food Information Extraction

919

performing food chunk extraction from each individ-

ual text, while another person cross referenced those

manually obtained chunks with the ones obtained

from FoodIE. Using this method, a ﬁgure for true pos-

itives (TPs), false negatives (FNs) and false positives

(FPs) was procured, while it was decided that the cat-

egory true negative was not applicable to the nature

of the problem and its evaluation. Additionally, it was

decided that a “partial (inconclusive)” category was

necessary, as some of the food chunks were incom-

plete, but nevertheless caught, thus including signiﬁ-

cant information. This category encompasses all the

extracted food chunks which were caught, but missed

at least one token. An example would be “bell pep-

per”, where FoodIE would only catch “pepper”.

We would like to compare the results using the

model presented in (Chen, 2017), but we were un-

able to obtain the requested model and corpus. We

provide a small example of comparing FoodIE with

drNER (Eftimov et al., 2017), in order to show that

they provide food entities on different level, so a fair

comparison cannot be made.

While the evaluation was being done, we kept

track of all the False Negative instances and have con-

structed a resource set that will improve the perfor-

mance of FoodIE in future implementations.

4.1 Data

Firstly, a total of 200 recipes were processed and eval-

uated. The original 100 recipes, which were analyzed

and upon which the rule engine was built, were taken

into consideration, as well as 100 new recipes which

had not been analyzed beforehand. The recipes were

taken from two separate user-based sites, Allrecipes

(https://www.allrecipes.com/) and MyRecipes (https:

//www.myrecipes.com/), where there is no standard-

ized format for the recipe description. This was cho-

sen as such to ensure that the linguistic constructs uti-

lized in each written piece varied and had no pattern

behind them. The texts were chosen from a variety of

topics, as to provide further diversity.

Secondly, we selected 1,000 independently ob-

tained recipes from Allrecipes (Groves, 2013), which

is the largest food-focused social network, where ev-

eryone plays part in helping cooks discover and share

home cooking. We selected the Allrecipes because

there is no limitation as to who can post recipes, so

we have variability in how users express themselves.

The recipes were selected from ﬁve recipe categories:

Appetizers and snacks, Breakfast and Lunch, Dessert,

Dinner, and Drinks. From each recipe category 200

recipes were included in the evaluation set.

The evaluation datasets, including the obtained

results, are publicly available at http://cs.ijs.si/

repository/FoodIE/FoodIE\ datasets.zip.

4.2 Results and Discussion

The results for TPs, FPs, and FNs of evaluating the

FoodIE using the dataset of 200 recipes are presented

in Table 5. The group “Partial (Inconclusive)” was

left out of these evaluations, as some would argue

they should be counted as TPs, while other that they

should be included in the FNs. Some examples in-

cluded here are: “empty passion fruit juice”, “cinna-

mon” and “soda”, where the actual food entity chunks

would be “passion fruit juice”, “cinnamon sticks” and

“club soda”, respectively. These are mostly due to the

dual nature of words, meaning that a word that is a

synonym of both a noun and a verb or an adjective and

a verb, occur. For such words, the tagger sometimes

incorrectly classiﬁes the tokens. In these examples,

“empty” is tagged as an adjective, where in context

it, in fact, is a verb. The same explanation holds for

the other two examples. For these reasons, when the

evaluation metrics were calculated, this category was

simply omitted. Moreover, even if they are grouped

with either TPs or FNs, this does not signiﬁcantly af-

fect the results.

Regarding the FN category (type II error), there

were some speciﬁc patterns that produced the most

instances. One very simple type of a FN instance

is where the author of the text refers to a speciﬁc

food using the brand name, such as “allspice” or

“J

agermeister”. These are difﬁcult to catch if there is

no additional information following the brand name.

However, if the user includes the general classiﬁ-

cation of the branded food, FoodIE will catch it.

An example of this would be by simply writing

“J

agermeister liqueur”. Another instance of a type II

error is when the POS taggers give incorrect tags, as

was the case with some “Partial (Inconclusive)” in-

stances. An example of this is when the tagger misses

chunks such as “mint leaves” and “sweet glazes”,

where both “leaves” and “glazes” are incorrectly clas-

siﬁed as verbs when in this context they should be

tagged as nouns. Another example would be when

the semantic tagger incorrectly classiﬁes some token

within the given context, such as “date” being clas-

siﬁed as a noun meaning day of year, as opposed to

it being a certain fruit. Furthermore, there exist FNs

which are simply due to the rarity of the food, such

as “keﬁr”, “couscous” or “stevia”, the last one being

of immense importance to people suffering from dia-

betes, as it is a safe sugar substitute. Another category

of type II errors is due to the fact that some foods are

often referred by their colloquial name, such as “half-

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

920

and-half” and “spring greens”. The ﬁnal category of

this type of error is where there exist spelling varia-

tions for a single food, such as “eggnog”, “egg nog”,

“egg-nog”. These are very difﬁcult, if not impossible,

to correctly predict since grammatical and morpho-

logical styles vary with each user, which extend as far

as including simply improper use of the English lan-

guage. This is a separate problem in and of itself, i.e.

spellchecking and spelling correction.

The second type of error to discuss is the FP cate-

gory (type I error), which is often due to the existence

of objects that are not foods, but are closely related to

food entities. These include instances such as “dol-

lop” or “milk frother”, where the ﬁrst example has a

meaning very closely related to food, thus making it

difﬁcult to distinguish using the semantic tags. The

second chunk is simply an instrument related to food

and cooking, while being rare enough such that the

semantic tagger does not classify it properly as an ob-

ject.

Table 5: Predictions (200 recipes).

True Positive (TP) 3063

False Positive (FP) 75

False Negative (FN) 185

Partial (Inconclusive) 97

Using the results reported in Table 5, the evalu-

ation metrics for F

score, precision, and recall, are

presented in Table 6.

Table 6: Evaluation metrics (200 recipes).

Score Precision Recall

0.9593 0.9761 0.9430

The results from evaluation the FoodIE on the

dataset with 1000 recipes are reported in tables 7 and

Table 7: Predictions (1000 recipes).

True Positive (TP) 11461

False Positive (FP) 258

False Negative (FN) 684

Partial (Inconclusive) 359

Comparing the results obtained from the evalua-

tions (tables 6 and 8), we can conclude that FoodIE

behaves consistently. Evaluating the dataset with 200

recipes, which consists of 100 recipes that were ana-

lyzed to build the rule engine and 100 new recipes that

were not analyzed beforehand, we obtained 0.9761

precision, 0.9430 recall, and 0.9593 F

score. Fur-

thermore, by evaluating it on a dataset that consists

of 1000 new recipes, it obtained 0.9780 for precision,

0.9437 for recall, and 0.9605 for F

score. Comparing

Table 8: Evaluation metrics (1000 recipes).

Score Precision Recall

0.9605 0.9780 0.9437

these results provides that FoodIE gives very promis-

ing and consistent results.

We also provided the TPs, FPs, FNs, and Par-

tial predictions, together with the evaluation metrics

for each recipe category separately (Table 9). Us-

ing them, we can see that Dinner category provides

most FNs (223), while the Breakfast/lunch category

provides the least FNs (82). Regarding the FNs,

the Breakfast/lunch category provides the most FPs

(108), while the Drinks category provides the least

FPs (31). Looking at the results, it is evident that

FoodIE retains the aforementioned consistency, even

when comparing the evaluation metrics from each cat-

egory between themselves.

Table 9: Predictions and evaluation metrics for each recipe

category.

Recipe category TP FP FN Partial F

Score Precision Recall

Appetizers/snacks 2147 27 162 45 0.9578 0.9876 0.9298

Breakfast/lunch 2443 33 82 108 0.9770 0.9876 0.9675

Desserts 2612 87 127 124 0.9607 0.9678 0.9536

Dinner 3176 47 223 51 0.9592 0.9854 0.9344

Drinks 1083 64 90 31 0.9336 0.9442 0.9233

In Table 10, we present the results obtained for

10 sentences (i.e evidence-based dietary recommen-

dations) previously used in (Eftimov et al., 2016; Ef-

timov et al., 2017), in order to present the difference

between FoodIE and drNER. Semicolon was used to

split separate food entities. Using the table, we can

see that drNER and FoodIE provide results on a dif-

ferent level. For example, let us consider the sixth rec-

ommendation. drNER extracted only one food entity,

which is “Milk, cheese, yogurt and other dairy prod-

ucts”, while FoodIE extracted four separate food enti-

ties, i.e. “Milk”, “cheese”, “yogurt”, and “other dairy

products”. From this, it follows that FoodIE provides

more precise results, which means it can also be used

as a post-processing tool for drNER in order to extract

the food entities on a individual level.

The performance of the rule-based system FoodIE

heavily depends on the taggers used, so the improve-

ment of the qualities of the POS-tagging and seman-

tic tagging methods will also improve the evaluation

metrics for FoodIE.

5 CONCLUSIONS

To extract food entities from unstructured textual

data, we propose a rule-based named-entity recogni-

tion method for food information extraction, called

FoodIE. It is a rule engine, where the rules are

FoodIE: A Rule-based Named-entity Recognition Method for Food Information Extraction

921

Table 10: Food entities extracted by drNER and FoodIE.

Recommendation drNER FoodIE

Good sources of

magnesium are:

fruits or vegetables,

nuts, peas and beans,

soy products, whole grains

and milk.

fruits or vegetables,

nuts, peas

and beans;

soy products;

whole grains and milk

fruits; vegetables;

nuts; peas; beans;

whole grains; milk

The RDAs for Mg

are 300 mg

for young women

and 350 mg

for young men.

- -

Increase potassium

by ordering

a salad, extra steamed

or roasted vegetables,

bean-based dishes

fruit salads,

and low-fat milk

instead of soda.

salad;

extra steamed or

roasted vegetables;

fruit salads;

low-fat milk

salad;

roasted vegetables;

bean-based dishes;

fruit salads;

low-fat milk;

soda

Babies need

protein about 10

g a day.

- -

1 teaspoon of

table salt contains

2300 mg of sodium.

table salt table salt

Milk, cheese,

yogurt and other

dairy products

are good sources

of calcium and protein,

plus many other

vitamins and minerals.

Milk, cheese,

yogurt and

other dairy

products

Milk; cheese;

yogurt;

other dairy products

Breast milk

provides sufﬁcient

zinc, 2 mg/day

for the ﬁrst 4-6

months of life.

Breast milk

milk

If you’re

trying to

get more

omega-3, you

might choose

salmon, tuna

or eggs enriched

with omega-3.

salmon, tuna;

eggs

salmon; tuna;

eggs

If you need

to get more ﬁber,

look to beans,

vegetables, nuts

and legumes.

beans, vegetables,

nuts, and legumes

beans;

vegetables;

nuts; legumes

10.

Excellent sources

of alpha-linolenic

acid, ALA,

include ﬂaxseeds

and walnuts.

ﬂaxseeds and walnuts

alpha-linolenic acid;

ﬂaxseeds;

walnuts

based on computational linguistics and semantic in-

formation that describe the food entities. Evaluation

showed that FoodIE behaves consistently using differ-

ent independent evaluation datasets and very promis-

ing results have been achieved.

To the best of our knowledge, there is a lim-

ited number of NLP tools that can be used for IE

of food entities. Moreover, there is a lack of anno-

tated corpora that can be used to train corpus-based

NER methods. Motivated by the evaluation results

obtained, we are planning to use it in order to build

an annotated corpus that can be further used for ex-

tracting food entities together with their relations to

other biomedical entities. By performing this, we can

easily follow the new knowledge that comes rapidly

with each day with new scientiﬁcally published pa-

pers aimed at improving public health.

ACKNOWLEDGEMENTS

This work was supported by the Slovenian Research

Agency Program P2-0098 and ERA Chair ISO-

FOOD for isotope techniques in food quality, safety

and traceability [grant agreement no. 621329].

REFERENCES

Aggarwal, C. C. and Zhai, C. (2012). Mining text data.

Springer Science & Business Media.

Alexander, M. and Anderson, J. (2012). The hansard cor-

pus, 1803-2003.

Arnold, T. and Tilton, L. (2016). coreNLP: Wrappers

Around Stanford CoreNLP Tools. R package version

0.4-2.

Chen, Y. (2017). A Statistical Machine Learning Approach

to Generating Graph Structures from Food Recipes.

PhD thesis.

do Nascimento, A. B., Fiates, G. M. R., dos Anjos, A.,

and Teixeira, E. (2013). Analysis of ingredient

lists of commercially available gluten-free and gluten-

containing food products using the text mining tech-

nique. International journal of food sciences and nu-

trition, 64(2):217–222.

Eftimov, T., Korou

c Seljak, B., and Koro

sec, P. (2017).

A rule-based named-entity recognition method for

knowledge extraction of evidence-based dietary rec-

ommendations. PloS One, 12(6):e0179488.

Eftimov, T., Korou

c Seljak, B., and Koro

sec, P. (2016).

Grammar and dictionary based named-entity linking

for knowledge extraction of evidence-based dietary

recommendations. In KDIR, pages 150–157.

Groves, S. (2013). How allrecipes. com became the worlds

largest food/recipe site. roi of social media (blog).

Jensen, K., Panagiotou, G., and Kouskoumvekaki, I. (2014).

Integrated text mining and chemoinformatics analy-

sis associates diet to health beneﬁt at molecular level.

PLoS computational biology, 10(1):e1003432.

Mori, S., Sasada, T., Yamakata, Y., and Yoshino, K. (2012).

A machine learning approach to recipe text process-

ing. In Proceedings of the 1st Cooking with Computer

Workshop, pages 29–34.

Nadeau, D. and Sekine, S. (2007). A survey of named entity

recognition and classiﬁcation. Lingvisticae Investiga-

tiones, 30(1):3–26.

Rayson, P., Archer, D., Piao, S., and McEnery, A. (2004).

The ucrel semantic analysis system.

Xia, J., Zhang, X., Yuan, D., Chen, L., Webster, J., and

Fang, A. C. (2013). Gene prioritization of resistant

rice gene against xanthomas oryzae pv. oryzae by us-

ing text mining technologies. BioMed research inter-

national, 2013.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

922