POS Tagging-probability Weighted Method for Matching the Internet

Recipe Ingredients with Food Composition Data

Tome Eftimov

1, 2

and Barbara Korou

c Seljak

Computer Systems Department, Jo

zef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia

zef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia

Keywords:

Part of Speech Tagging, Probability Model, Information Retrieval, Food Composition Databases, Ingredient

Matching.

Abstract:

In this paper, we present a new method that can be used for matching recipe ingredients extracted from the

Internet to nutritional data from food composition databases (FCDBs). The method uses part of speech tagging

(POS tagging) to capture the information from the names of the ingredients and the names of the food analyses

from FCDBs. Then, probability weighted model is presented, which takes into account the information from

POS tagging to assign the weight on each match and the match with the highest weight is used as the most

relevant one and can be used for further analyses. We evaluated our method using a collection of 721 lunch

recipes, from which we extracted 1,615 different ingredients and the result showed that our method can match

91.82% of the ingredients with the FCDB.

1 INTRODUCTION

It is evidence based that a healthier diet is required to

prevent diet-related chronic diseases and to increase

the quality of life. However, to assess the quality

of a diet, advanced approaches still need to be de-

veloped. There is a lot of information about health-

ier diet and nutrition principles presented in different

forms, available in books, magazines, television pro-

grams and Internet. But from other side, people are

lacking of knowledge about all the nutrition princi-

ples and also lack of time and motivation to explore

the resources where this kind of information is pre-

sented.

A lot of free data sources that contain recipe

databases exist and can be used for nutritional assis-

tance or recommendation systems. For this purpose,

it is important to have accurate nutritional data for

recipes, but most of the recipes have no such data

available or have data of suspect quality. The most

important is that people need to understand the nu-

tritional value of the individual meals and also how

they reﬂect their nutritional needs with respect to their

lifestyle.

In the past, different technological solutions were

represented, dealing with problems to assess and im-

prove diets. They used the information from the

recipes and food composition data. Food composition

databases (FCDBs) provide detailed information on

nutritional composition of foods, usually from a par-

ticular country. They contain information for a huge

number of components including: energy, macronu-

trients and their components, minerals and vitamins.

Food composition data is used for planning diets with

speciﬁc nutrient composition in clinical practice and

for assessment of the nutritional value of the food

consumed by individuals and populations (H. Green-

ﬁeld and D. Southgate, 2003).

Using all this information is useful to generate

a system that automatically calculates the nutritional

value of the recipe and than the recipe can be used in

planing the diet for some individuals or populations.

The main problem is that the information on the In-

ternet is incomplete - on the other side FCDBs are

lacking of recipes and as chemical analysis is costly,

we need to ﬁnd a way of calculating nutritional values

for recipes from the Internet considering food com-

position data of recipe ingredients. One of the key

problems is a lack of structure in the names of the in-

gredients used in the recipes and a lack of structure

in the names of the food analyses from the FCDB.

To calculate the nutritional value of the recipe, we

need for each ingredient from the recipe to ﬁnd the

perfect or the most relevant ingredient match from

the FCDB. For example, we can ﬁnd ”chicken breast,

raw” in a recipe, and several food analyses in the

330

Eftimov, T. and Seljak, B..

POS Tagging-probability Weighted Method for Matching the Internet Recipe Ingredients with Food Composition Data.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 330-336

ISBN: 978-989-758-158-8

FCDB, which can be ”chicken breast, cooked, salted”

or ”raw chicken breast” or other name that contain

”chicken breast”. This is the problem from which de-

pends how accurate will be the calculated nutritional

value for the recipe using the food composition data

presented into the FCDBs.

In this paper, we present an information retrieval

method, which is a probability weighted method that

enables us to perform a search and ﬁnd the most rel-

evant match for each recipe ingredient in the FCDB.

After having the most relevant match, we can use the

data from the FCDB to calculate the nutritional value

of the recipe.

In Section II, we review appropriate related work.

Section III describes the problem in depth. In Sec-

tion IV, we present our solution and in Section V, the

evaluation and results are presented using real data.

Section VI provides the discussion of the results, the

beneﬁts of our method and comparison with other ap-

proaches presented in the literature. In Section VII,

we conclude the paper by discussing the proposed

method and our plans for future work.

2 RELATED WORK

The task of matching text concepts to an entry in a

knowledge base is a very popular one and has been

addressed in many ways. In 1988, term-weighting

approaches in automatic text retrieval systems were

presented, which could be designed on a comparison

between the stored text and users’ information queries

(Salton and Buckley, 1988). Another approach is POS

tagging, which means automatic assignment of de-

scriptors, or tags, to input tokens, where the tags are

the appropriate grammatical descriptors to words in

text. POS taggers can be used for several purposes,

and one of them is for text indexing and retrieval,

which can beneﬁt from POS information (Schmid,

1994; Tian and Lo, 2015). The task of matching con-

cepts in text has progressed a lot and there are dif-

ferent methods for automatic text retrieval systems.

In (Mihalcea and Csomai, 2007), an automatic text

annotation system was presented, which combines

keyword extraction and word-sense disambiguation

to identify relevant links to Wikipedia pages. The

system is known as ”Wikify”, and involves automati-

cally extracting the most important words and phrases

in the document (keywords) and identifying for each

keyword the appropriate link to a Wikipedia article.

Another approach is the entity linking (EL), which is

the task of linking name mentions in text with their

referent entities in a knowledge base. One method

dealing with EL is presented in (Han et al., 2011),

which is a graph-based collective EL method, which

can model and exploit the global interdependence be-

tween different EL decisions. Also, there are ap-

proaches that are dealing with automatic ontology

based knowledge extraction, for example the Arte-

quakt project presented in (Alani et al., 2003) links a

knowledge extraction tool with an ontology to achieve

continuous knowledge support and guide information

extraction.

Technological solutions have been proposed to

improve recipe recommendations. The idea is to de-

sign systems that are able to provide meal recom-

mendations for individuals based on their nutritional

needs and lifestyle. One approach is presented in (J.

Freyne and S. Berkovsky, 2010), which give recom-

mendations of healthy recipes. In order to give rec-

ommendations, we need to calculate the nutritional

content of a recipe, which can be done using chemi-

cal analyses of ﬁnal cooked dishes (Y.Pic

o, 2012) or

having a system that automatically calculate the nutri-

tional content of a recipe. In (M. Muller et al., 2012),

the authors presented a system that automatically cal-

culates the nutritional content of recipes sourced on

Internet. To match the ingredient to an appropriate

entry from the ofﬁcial nutritional table of the Ger-

man ministry for nutrition, agriculture and consumer

protection, the ingredient name is preprocessed by

removing the punctuations and converting to lower

case. Because the database search can return numer-

ous results and only a single item can be chosen, they

presented a system which can rank the list and the top

ranked item need to be used as appropriate match. To

learn the ranking function, they treated the problem as

two-class classiﬁcation task where the negative class

is poor choices and the positive class is the correct

choice. To obtain the data, they asked 6 researches

to evaluate manually lists of ingredients for ambitious

ingredient descriptions. To learn from the data, they

extracted a number of features from the original in-

gredients name and the selected ingredients from the

database. At the end they performed penalised regres-

sion model, where the output is between -1 and 1 in-

citing the expected relevance of the ingredient to the

name. Using this method, 91.1% of the recipes they

used were matched completely and less than 1% have

more than one unmatched ingredient.

3 PROBLEM DEFINITION

The problem we want to address is to ﬁnd the most

relevant match for the ingredients used in the recipes

using their name and the names of the food analyses

that are presented in the FCDBs.

POS Tagging-probability Weighted Method for Matching the Internet Recipe Ingredients with Food Composition Data

331

There are several issues that we need to consider

when we want to solve this problem. As we said be-

fore, there are a lot of online data sources which pro-

vide recipes. Some of them allow the users to log

into the system and to submit their own recipes, so

everyone can use it. The ﬁrst issue is that people use

the human natural language, which is the main vehi-

cle through humans transmit and exchange informa-

tion, and write the name of the used ingredients in the

unstructured form. For example, in different recipes

we can ﬁnd ”salt, iodised”, ”iodised salt” or ”salt-

iodised”. From this, we can conclude that the lack

of structured way of representation is presented, and

this happens because of the different ways of people

expression. Another issue is the ingredient synonymy

problem. So we need to match the synonyms to the

single term which is used in the databases. Some in-

gredients can have multiple matches in the food com-

position database. For example, if we are looking for

”salt”, we can ﬁnd ”salt”, ”salt, table”, ”salt, iodised”

and many more. But all these matches may have very

different nutritional properties, so we need to chose

the most relevant one. Also, a very important fac-

tor when we want to calculate the nutritional proper-

ties is the preparation method of the ingredient. It is

different to have cooked or raw ingredient, for exam-

ple, ”smoked ham” and ”non-smoked ham”, ”chicken

breast, raw” and ”chicken breast, cooked”, because

they have different nutritional properties.

All of these issues need to be considered and need

to be solved when we want to ﬁnd the relevant ingre-

dients matching, which can be used to calculate the

nutritional value of the recipes.

4 POS TAGGING-PROBABILITY

WEIGHTED METHOD

One method for ingredient matching is presented in

(M. Muller et al., 2012). The method treats the prob-

lem as two-class classiﬁcation problem, which re-

quired evaluation by nutrition experts, and after that

they use a linear regression model to match the ingre-

dients.

Intend to solve the ingredient matching problem

with food composition data, we looked for the exist-

ing ontologies in this domain (LIRMM, 2015; On-

tology, 2015), and we have found that there are fo-

cused on food recipes, ingredients and nutrients, but

an information about the structure of the ingredient

name is still missing. An ingredient name is repre-

sented by noun, and it can be additional explain with

the form of the ingredient (adjective) and the cook-

ing process (verb), which are very important and need

to be considerate in case when we want to calculate

the nutritional value. Have in mind the importance of

the nouns, adjectives and verbs presented in the in-

gredient name, the Part Of Speech tagging (POS tag-

ging) is one technique that can be used for ingredient

matching with food composition data (A. Voutilainen,

2003).

Our method is a probability method with which

we assign a weight on each matching and we consid-

ered the match with the highest weight as the most rel-

evant one. First, for each ingredient from the recipe,

we use POS tagging, also called grammatical tag-

ging or word-category disambiguation, to identify the

nouns, verbs and adjectives. The nouns carry the

most of the information of the name, the adjectives

explain the ingredient in most speciﬁc form, for ex-

ample ”frozen”, ”fresh”, and the verbs are at the most

cases related with the preparation method, for exam-

ple ”cooked”, ”drained” etc. Then, we search the

FCDB for the ingredient with a simple SQL search

using the provided nouns from the ingredient name

in the recipe. For each found name as a result of the

SQL search, we also perform POS tagging to identify

the nouns, verbs and adjectives. Next, we deﬁne an

event (X) which is the similarity between the ingredi-

ent name from the recipe and each of the food names

that are returned from the SQL search of the FCDB.

At the end, the weight we assign to the matching pairs

is the probability of the event.

Let D

be the name of a single ingredient from

the recipe, and D

is the single food name which is a

result from the SQL search of the FCDB. Let’s deﬁne,

= {nouns extracted f rom D

= {verbs extracted f rom D

= {ad jectives extracted f rom D

}, (1)

where i = 1, 2.

To ﬁnd the probability of the similarity between the

ingredient name from the recipe and the food name

from the FCDB, we present the event as a product of

three other events.

X = N ·V · A, (2)

where N is the similarity between the nouns which are

in N

and N

, V is the similarity between the verbs

which are in V

and V

and A is the similarity between

the adjectives which are in A

and A

Because all these events are independent, the proba-

bility of the event X can be ﬁnd as

P(X) = P(N) · P(V ) · P(A). (3)

Now, we need to deﬁne the probabilities of each of

the events, N, V and A. Because we want to ﬁnd the

similarity between two sets, it is logical to use the Jac-

card index, J, which is used in statistic for comparing

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

332

the similarity and diversity of sample sets (R. Real

and J. M. Vargas, 1996). For this purpose, we use the

modiﬁcation of the Jaccard index in combination with

Laplace probability estimate. We do this because in

some ingredients description the additional informa-

tion provided by the adjectives or verbs can be miss-

ing, but we can also ﬁnd the relevant match into the

FCDB, so we will have non-zero probabilities. The

probabilities of the events can be ﬁnd as

P(N) =

∩ N

| +1

∪ N

| +2

J(N

, N

) +

∪N

1 +

∪N

P(V ) =

∩V

| +1

∪V

| +2

J(V

, V

) +

∪V

1 +

∪V

P(A) =

∩ A

| +1

∪ A

| +2

J(A

, A

) +

∪A

1 +

∪A

. (4)

We obtained the probability of the event X, substitut-

ing the relations (4) into the relation (3), which is the

weight we assigned to each matching pair and at the

end, the pair with the highest weight is the most rele-

vant found match.

Important aspect of the ingredient matching is also

pre-processing. First each ingredient name without

the difference from where is it, we converted in a

lower case letters and also we removed the punc-

tuations. For the nouns, we use lemmatisation to

avoid the difference between the singular and the plu-

ral form of the noun (J. Plisson et al., 2004). Be-

cause, there are names that contain ”without skin” and

some other ”skinless”, or ” with salt” and ”salted”,

we mapped all of these phrases using rules which we

created manually, and are speciﬁc for this area. In

Figure 1, the architecture of the proposed method is

presented.

5 EVALUATION AND RESULTS

We performed the evaluation of the method by two

experiments. The ﬁrst experiment is not the proper

evaluation of the method, but an illustration of the

problem that we are trying to solve, while the second

one is the matching between the Internet extracted in-

gredients and the food composition data.

The data we used for evaluation is a collection of

721 recipes written in English, from which we ex-

tracted 1,615 different names of ingredients. We col-

lected it using an HTML parser and a free recipes web

site (AllRecipes, 2015). For each of the recipes, we

considered only the names of the ingredients, while

the quantity-unit pair associated with the ingredient

was ignored, as our global goal was to ﬁnd the ingre-

dients matching.

Algorithm 1: POS tagging-probability weighted method.

1: for each ingredient name in recipe do

2: - set matching pairs = null

3: - set counter = 1

4: - ingredient name pre-processing

5: - extract the sets of nouns N

, verbs V

, and

adjectives A

using POS tagging

6: - query the FCDB using the set of provided

nouns N

7: for each food name from the result of search-

ing the FCDB do

8: - food name pre-processing

9: - extract the sets of nouns N

, verbs V

and adjectives A

using POS tagging

10: - calculate P(X) = P(N)P(V )P(A)

11: - matching pairs[counter] = P(X )

12: - counter = counter + 1

13: end for

14: - return the most relevant match,

max(matching pairs)

15: end for

We used the EuroFIR FCDB as our database. Eu-

roFIR AISBL is an international, non-proﬁt Associa-

tion under the Belgian law (EuroFIR, 2015). Its pur-

pose is to develop, publish and exploit food compo-

sition information and to promote international stan-

dards to improve data quality, storage and access.

EuroFIR presented data model for food composition

data management and data interchange. The EuroFIR

FCDB contains analyses from several European coun-

tries.

We extracted 44,033 English names of foods anal-

yses, which exist in the EuroFIR database. Before

we start with the evaluation, we preprocessed the in-

gredients names from the recipes and the food names

from the EuroFIR FCDB. First, we removed the punc-

tuations from them, and then we converted them in

lower-case letters.

5.1 Experiment 1

The ﬁrst experiment we made is the ingredients

matching for one recipe and it is not the proper evalu-

ation of the method, but an illustration of the problem

that we are trying to solve. We used the recipe for

”World’s Best Lasagna”, extracted from (AllRecipes,

2015). The result of the ingredients matching is pre-

sented in the Figure 2. Using the information pre-

sented in the Figure 2, for the recipe that contains 20

ingredients, we were unable to ﬁnd match only for

POS Tagging-probability Weighted Method for Matching the Internet Recipe Ingredients with Food Composition Data

333

Figure 1: Architecture of the method.

Figure 2: Ingredients matching for ”World’s best Lasagna”.

one ingredient ”Italian seasoning”. The problem hap-

pened because ”seasoning” is not annotated here as

noun, so we can not continue with the search in the

FCDB. We used here the POS tagging which is the

part of R programming language. This kind of prob-

lem can be solved using some other implementation

of POS tagging or some post-processing methods. For

other ingredients, we found 18 perfect matches, and

for one ingredient, ”lasagna noodles”, we found most

similar match, which is the up close to it, and this hap-

pened because ”lasagne noodles” is not presented in

the FCDB.

We need to mention here that this experiment was

carried out without ﬁnding the synonyms and map-

ping the special deﬁnite rules.

5.2 Experiment 2

Using the 721 lunch recipes, we extracted 1,615 dif-

ferent names of the ingredients that appear in these

recipes. In Figure 3, the word cloud of the names of

the ingredients found in the recipes is presented. For

each ingredient name, using the probability weighted

model we found a match in the FCDB that can be in

one of the four categories (perfect match, very simi-

lar match, similar match, and incorrect match), which

we used for evaluation and we manually added to

each matching pair. A perfect match is with the same

meaning as the ingredient name. A very similar match

is the most similar and strongly related to the ingredi-

ent name. A similar match is weakly related with the

ingredient name. And an incorrect match is incorrect

and it can not be used for further analyses. The last

two categories appear according to some speciﬁc in-

gredients typical for some cultures and the coverage

of the FCDB.

In Figure 4, the pie chart of matching the Inter-

net recipe ingredients with food composition data is

presented.

Using the probability weighted model for match-

ing the ingredients, we found 1,210 perfect matches

(74.92%), 273 very similar matches (16.90%), 78

similar matches (4.84%) and 54 incorrect matches

(3.34%). Let we use the pair (D

ingredient

; D

FCDB

) to

describe the match we found, where D

ingredient

is the

ingredient name from the recipe and the D

FCDB

the name from the FCDB. For example, some perfect

matches are (black olives; olives black), and (fresh

ginger; ginger, fresh), very similar matches are (fresh

cilantro; spices, coriander seed (cilantro)), and (un-

cooked egg noodles; egg noodles), similar match is

(dry penne pasta; pasta, without egg, dry), and incor-

rect matches are (angel hair pasta; cake, angelfood,

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

334

salt

garlic

onion

ground-black-pepper

olive-oil

water

vegetable-oil

soy-sauce

all-purpose-flour

pepper

garlic-powder

butter

white-sugar

brown-sugar

cheddar-cheese

cornstarch

mozzarella-cheese

worcestershire-sauce

milk

ketchup

green-bell-pepper

dried-oregano

green-onions

parmesan-cheese

cayenne-pepper

black-pepper

eggs

sesame-oil

red-onion

fresh-mushrooms

ground-beef

chicken-broth

freshly-ground-black-pepper

celery

honey

lemon-juice

paprika

red-pepper-flakes

skinless,-boneless-chicken-breast-halves

tomato

hot-pepper-sauce

chili-powder

egg

pork-spareribs

carrots

eggs,-beaten

lean-ground-beef

swiss-cheese

bacon

ground-cumin

onions

tomatoes

fresh-parsley

barbeque-sauce

dried-thyme

fresh-cilantro

red-bell-pepper

fresh-ginger-root

mayonnaise

sour-cream

ground-ginger

active-dry-yeast

garlic-salt

pizza-sauce

green-onion

peanut-oil

warm-water

condensed-cream-of-mushroom-soup

fresh-ginger

kosher-salt

zucchini

balsamic-vinegar

crumbled-feta-cheese

mushrooms

onion-powder

tomato-sauce

beef-broth

ground-nutmeg

monterey-jack-cheese

sesame-seeds

sweet-onion

cooking-spray

mushrooms,-drained

rice-vinegar

sugar

tomato-paste

broccoli-florets

carrot

condensed-cream-of-chicken-soup

egg,-beaten

fresh-rosemary

orange-juice

pepperoni

cold-water

curry-powder

dijon-mustard

dried-basil

ground-cinnamon

hamburger-buns

lime-juice

oyster-sauce

red-pepper

baby-back-pork-ribs

beef-short-ribs

black-olives

butter,-melted

dried-parsley

dry-bread-crumbs

ground-turkey

pork-tenderloin

cider-vinegar

dry-sherry

green-bell-pepper,-seeded

pre-baked-pizza-crust

prepared-mustard

ricotta-cheese

unbaked-pie-crust

vinegar

beer

boneless-pork-loin-roast

country-style-pork-ribs

dry-mustard

fish-sauce

fresh-basil

heavy-cream

pastry

red-wine-vinegar

teriyaki-sauce

white-vinegar

broccoli

cooked-ham

distilled-white-vinegar

hoisin-sauce

italian-seasoning

lemon,-juiced

oil

white-wine

yellow-onion

apple-cider-vinegar

baking-powder

barbecue-sauce

bay-leaves

bread-flour

buttermilk

canola-oil

dried-rosemary

dry-onion-soup-mix

evaporated-milk

extra-virgin-olive-oil

fresh-spinach

liquid-smoke-flavoring

orange-zest

chicken-stock

chili-sauce

fresh-chives

fresh-lemon-juice

ground-chicken

ground-white-pepper

lettuce

low-sodium-soy-sauce

melted-butter

peanut-butter

potatoes,-peeled

romaine-lettuce

toasted-sesame-seeds

baking-soda

beef-chuck-roast

bok-choy

cabbage

coarsely-ground-black-pepper

cornmeal

cream-cheese,-softened

eggs,-lightly-beaten

fresh-thyme

ground-cloves

half-and-half

ham

margarine

prepared-yellow-mustard

provolone-cheese

red-wine

seasoned-salt

avocado---peeled,-pitted

brown-rice

celery-salt

chinese-five-spice-powder

coconut-milk

corn-tortillas

cucumber

dried-cranberries

enchilada-sauce

fresh-broccoli

garlic,-peeled

ginger

ground-coriander

heavy-whipping-cream

italian-seasoned-bread-crumbs

lemon-zest

lime,-juiced

panko-bread-crumbs

plain-yogurt

pork-baby-back-ribs

potatoes,-peeled-and-cubed

red-bell-peppers

red-potatoes

refrigerated-pizza-dough

rice-wine

rice-wine-vinegar

salami

salsa

skinless,-boneless-chicken-breast

steak-sauce

sun-dried-tomatoes

white-wine-vinegar

whole-chicken

artichoke-hearts,-drained

baby-back-ribs

black-beans,-rinsed-and-drained

bulk-italian-sausage

butter,-softened

cajun-seasoning

carrots,-peeled

cauliflower

celery-seed

cooked-chicken

egg-whites

flour-tortillas

fresh-lime-juice

fresh-mint

frozen-spinach,-thawed-and-drained

garam-masala

ground-pork

ground-turmeric

herbes-de-provence

hot-chile-paste

hot-sauce

hot-water

iceberg-lettuce

jalapeno-pepper

lemon-pepper

maple-syrup

molasses

pepperoni-sausage

pork-chops

prepared-horseradish

reduced-sodium-soy-sauce

refrigerated-crescent-rolls

roma-(plum)-tomatoes

romano-cheese

sharp-cheddar-cheese

shortening

shrimp,-peeled-and-deveined

spaghetti

taco-seasoning-mix

uncooked-long-grain-white-rice

walnuts

water-chestnuts,-drained

yellow-cornmeal

applesauce

black-olives,-drained

blue-cheese,-crumbled

bread-crumbs

bulk-pork-sausage

cashews

cherry-tomatoes,-halved

chile-garlic-sauce

coarse-grain-brown-mustard

condensed-tomato-soup

cooked-white-rice

dark-soy-sauce

dried-tarragon

dry-milk-powder

dry-ranch-style-dressing-mix

egg,-lightly-beaten

eggplant

flour

fresh-asparagus

fresh-bread-crumbs

fresh-dill

frozen-bread-dough,-thawed

garbanzo-beans,-drained-and-rinsed

garlic-cloves

green-bell-peppers

ground-allspice

ground-thyme

hickory-flavored-barbeque-sauce

italian-cheese-blend

jalapeno-pepper,-seeded

lime

mexican-cheese-blend

monosodium-glutamate

oil-for-deep-frying

pecans

pineapple

pineapple,-drained

plum-sauce

pork-loin

poultry-seasoning

ranch-dressing

red-bell-pepper,-seeded

refrigerated-pizza-crust

sea-salt

seasoned-bread-crumbs

seasoning-salt

shallot

skinless,-boneless-chicken-breast-half

steak-seasoning

tomato-juice

tomatoes-with-green-chile-peppers

tomatoes,-drained

white-pepper

whole-cloves

yellow-bell-pepper

asiago-cheese

baby-spinach-leaves

bay-leaf

bean-sprouts

beef-bouillon-granules

beef-brisket

black-beans,-drained

black-peppercorns

boneless-beef-short-ribs

boneless-chicken-breast-halves,-cooked

boneless-country-style-pork-ribs

boneless-pork-loin-chops

burgundy-wine

cherry-tomatoes

chile-paste

chile-sauce

chinese-rice-wine

colby-monterey-jack-cheese

condensed-cream-of-celery-soup

crumbled-gorgonzola-cheese

cubed-cooked-chicken

cubed-cooked-ham

curry-paste

dark-sesame-oil

deep-dish-frozen-pie-crusts

dried-marjoram

dry-white-wine

eggs,-well-beaten

english-muffins

fennel-seed

feta-cheese

firm-tofu,-drained-and-cubed

flank-steak

french-fried-onions

fresh

fresh-coriander

fresh-corn-kernels

fresh-curry-leaves

fresh-green-chile-peppers

frozen-green-peas,-thawed

frozen-puff-pastry,-thawed

garlic,-pressed

granulated-garlic

grapeseed-oil

green-chile-peppers

ground-lamb

ground-sirloin

hamburger-buns,toasted

heinz-tomato-ketchup

honey-mustard

italian-style-salad-dressing

kikkoman-soy-sauce

lean-ground-turkey

lemon

lettuce-leaves

light-soy-sauce

limes

marinara-sauce

mustard-powder

onions,-coarsely

parsley

peanuts

peeled-and-deveined-shrimp

pesto

pitted-kalamata-olives

pork-tenderloins

portobello-mushroom-caps

prepared-pizza-crust

prime-rib-roast

red-cabbage

roast-beef

roma-tomatoes

rubbed-sage

self-rising-flour

shallots

snap-peas

snow-peas

spaghetti-sauce

spicy-brown-mustard

spinach

sugar-snap-peas

tarragon-vinegar

tuna,-drained

unbaked-pie-crusts

unbaked-pizza-crust

uncooked-white-rice

vegetable-broth

white-bread

whole-wheat-bread

alfredo-sauce

american-and-cheddar-cheese-blend

american-cheese

angel-hair-pasta

apple

apple-jelly

apple-juice

as-desired

banana

basil

beef-bouillon

beef-chuck-flanken

beef-sirloin-steak

beef-stock

bell-pepper

blue-cheese

brandy

bread

brioche

brown-gravy-mix

canned-tomatoes

cannelloni-noodles

chili-oil

coconut-oil

cold-milk

cooking-oil

couscous

cream-cheese

crumbled-feta

cumin-seed

cumin-seeds

deli-ham

dried-chives

dried-dill-weed

dry-oatmeal

dry-polenta

extra-firm-tofu

firm-tofu

flaked-coconut

flounder

frozen-peas

ginger-paste

ice-water

instant-rice

kaiser-rolls

kale

linguine

longhorn

onion-salt

peeled-potatoes

pita-breads

porter-beer

potatoes

ranch-style-beans

rice-flour

salt,

softened-butter

tortilla-chips

yeast

almond-meal

arugula

bagel

beef-frankfurters

bell-peppers

bitter-ale

breadcrumbs

brown-ale

buds

capers

chicken

chicken-bouillon-powder

cilantro

clams

corn

corn-bread-mix

corn-oil

cumin

dark-beer

dark-rum

dill

dry-grits

dry-lentils

dry-red-wine

egg-yolk

farfalle-pasta

fennel

fine-salt

flank

fresh-basil-leaves

fresh-okra

garlic,

ginger-ale

glaze:

halibut

imitation-crab-meat

lamb-ribs

lard

leeks

onion,-halved

oregano

panko

port

radish

raisins

red

root-beer

rum

sake

salsa,

sunflower-seeds

surimi

tamari

thighs*

tilapia

tuna

turnips

white-rice

whole-milk

Figure 3: Word cloud of the ingredients.

Figure 4: Pie chart of ingredients matching.

commercially prepared), and (dried onion ﬂakes; ce-

real ﬂakes with dried fruits, type Muesli). The perfect

and very similar matches are 91.82% together. They

can be used to calculate the nutritional properties on

a recipe.

The experiment is done with preprocessed data.

6 DISCUSSION

There are some beneﬁts in our method, comparing it

with the method that is used to ﬁnd the most relevant

match in (M. Muller et al., 2012). In order to ﬁnd the

most relevant match, they treated the problem as two-

class classiﬁcation problem and to obtain labeled data

they asked 6 human assessors to manually evaluate

list of ingredients for ambiguous ingredient names.

This process ended with 1,515 positively classiﬁed in-

stances to which they added the same number again

of negatively classiﬁed instances. Instead of manu-

ally collecting list of ingredients that are positively

classiﬁed, our method can be used as pre-processing

task, and for each of the ingredient can return the rele-

vant ingredient or a list of relevant ingredients, if there

are few matching pairs with the maximum weight

for the same ingredient. After that, this data can be

used for building models, starting with feature selec-

tion and then solving two-class or multi-class classi-

ﬁcation problems. So our method is a beneﬁt to the

method proposed in (M. Muller et al., 2012) and can

be used as pre-processing step to ﬁnd the list of in-

gredients for each ingredient without using the man-

ually evaluation by human assessors. Another beneﬁt

is that our method also returned the most similar in-

gredient that exist in the FCDB and does not require

labeled data for supervised learning, the poor choices

that appeared are consequence from some ingredients

typical for some culture or missing chemical analy-

ses in the FCDB. Also, there are a lot of websites on

which we can ﬁnd recipes by the ingredients we have

(MyFridgeFood, 2015; RecipeMatcher, 2015; Super-

cook, 2015), but using them we can select from a list

of ingredients they have, and in the most cases they

have only the basic name of the ingredients, without

the possibility of using the additional information (the

form of the ingredient, or the cooking process). Us-

ing them the result is more general, and if we use our

method to search the recipe database, the result will

contain only the most speciﬁc recipes.

We are also working on food image recognition,

in order to identify the ingredients in recipes, which is

more realistic and challenging task, but the approach

is beyond the scope of the paper.

7 CONCLUSION

We presented a method, that can be used for match-

ing the recipe ingredients with food composition data.

Using this method, we can weight each match be-

tween the ingredient name from recipe and food anal-

yses names from FCDBs and then the match with

the highest weight is used as the most relevant one.

Having this information, we will be able to calculate

the nutrition value of each of the recipe which is pre-

sented, because for each ingredient used in the recipe

we can ﬁnd the nutritional properties from a FCDBs.

Also, this method can be used to weight the ingredi-

ents matching, and the weighted data can be used to

help more other models, which can be obtained using

data mining approaches. This method can be used to

explore what is missing in the FCDBs, and this infor-

mation can be addressed to the chemical laboratories

in order to perform food composition data analyses.

POS Tagging-probability Weighted Method for Matching the Internet Recipe Ingredients with Food Composition Data

335

We plan to implement this method into a system

which will be used for computing the nutritional value

of recipes, and to compare the accuracy of the ob-

tained values comparing them with the values from

the chemical analyses, which are obtained by chem-

ical analyses of the dishes prepared using the same

recipes.

ACKNOWLEDGEMENTS

This work was supported by the project ISO-FOOD,

which received funding from the European Union’s

Seventh Framework Programme for research, techno-

logical development and demonstration under grant

agreement no 621329 (2014-2019).

REFERENCES

Alani, H., Kim, S., Millard, D. E., Weal, M. J., Hall, W.,

Lewis, P. H., and Shadbolt, N. R. (2003). Automatic

ontology-based knowledge extraction from web doc-

uments. Intelligent Systems, IEEE, 18(1):14–21.

AllRecipes. Allrecipes website. http://allrecipes.com/. Ac-

cessed: 2015-05-04.

A. Voutilainen (2003). Part-of-speech tagging. The Oxford

handbook of computational linguistics, pages 219–

232.

EuroFIR. Euroﬁr website. http://www.euroﬁr.org/. Ac-

cessed: 2015-05-04.

Han, X., Sun, L., and Zhao, J. (2011). Collective entity

linking in web text: a graph-based method. In Pro-

ceedings of the 34th international ACM SIGIR con-

ference on Research and development in Information

Retrieval, pages 765–774. ACM.

H. Greenﬁeld and D. Southgate (2003). Food composi-

tion data: production, management, and use. Food

& Agriculture Org.

J. Freyne and S. Berkovsky (2010). Intelligent food plan-

ning: personalized recipe recommendation. In Pro-

ceedings of the 15th international conference on In-

telligent user interfaces, pages 321–324. ACM.

J. Plisson, N. Lavrac, and D. Mladenic (2004). A rule based

approach to word lemmatization. Proceedings of IS-

2004, pages 83–86.

LIRMM. Lirmm. http://data.lirmm.fr/ontologies/food/. Ac-

cessed: 2015-05-04.

Mihalcea, R. and Csomai, A. (2007). Wikify!: linking doc-

uments to encyclopedic knowledge. In Proceedings of

the sixteenth ACM conference on Conference on infor-

mation and knowledge management, pages 233–242.

ACM.

M. Muller, M. Harvey, D. Elsweiler, and S. Mika (2012). In-

gredient matching to determine the nutritional proper-

ties of internet-sourced recipes. In Pervasive Comput-

ing Technologies for Healthcare (PervasiveHealth),

2012 6th International Conference on, pages 73–80.

IEEE.

MyFridgeFood. Myfridgefood website. http:// myfridge-

food.com/. Accessed: 2015-08-20.

Ontology, B.-F. Bbc - food ontology. http://www.bbc.co.uk/

ontologies/fo/. Accessed: 2015-05-04.

RecipeMatcher. Recipematcher website. http://

www.recipematcher.com/. Accessed: 2015-08-20.

R. Real and J. M. Vargas (1996). The probabilistic basis

of jaccard’s index of similarity. Systematic biology,

pages 380–385.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information pro-

cessing & management, 24(5):513–523.

Schmid, H. (1994). Probabilistic part-of-speech tagging us-

ing decision trees. In Proceedings of the international

conference on new methods in language processing,

volume 12, pages 44–49. Citeseer.

Supercook. Supercook website. http://

www.supercook.com/. Accessed: 2015-08-20.

Tian, Y. and Lo, D. (2015). A comparative study on the

effectiveness of part-of-speech tagging techniques on

bug reports. In Software Analysis, Evolution and

Reengineering (SANER), 2015 IEEE 22nd Interna-

tional Conference on, pages 570–574. IEEE.

Y.Pic

o (2012). Chemical analysis of food: Techniques and

applications. Academic Press.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

336