There are several issues that we need to consider
when we want to solve this problem. As we said be-
fore, there are a lot of online data sources which pro-
vide recipes. Some of them allow the users to log
into the system and to submit their own recipes, so
everyone can use it. The first issue is that people use
the human natural language, which is the main vehi-
cle through humans transmit and exchange informa-
tion, and write the name of the used ingredients in the
unstructured form. For example, in different recipes
we can find ”salt, iodised”, ”iodised salt” or ”salt-
iodised”. From this, we can conclude that the lack
of structured way of representation is presented, and
this happens because of the different ways of people
expression. Another issue is the ingredient synonymy
problem. So we need to match the synonyms to the
single term which is used in the databases. Some in-
gredients can have multiple matches in the food com-
position database. For example, if we are looking for
”salt”, we can find ”salt”, ”salt, table”, ”salt, iodised”
and many more. But all these matches may have very
different nutritional properties, so we need to chose
the most relevant one. Also, a very important fac-
tor when we want to calculate the nutritional proper-
ties is the preparation method of the ingredient. It is
different to have cooked or raw ingredient, for exam-
ple, ”smoked ham” and ”non-smoked ham”, ”chicken
breast, raw” and ”chicken breast, cooked”, because
they have different nutritional properties.
All of these issues need to be considered and need
to be solved when we want to find the relevant ingre-
dients matching, which can be used to calculate the
nutritional value of the recipes.
4 POS TAGGING-PROBABILITY
WEIGHTED METHOD
One method for ingredient matching is presented in
(M. Muller et al., 2012). The method treats the prob-
lem as two-class classification problem, which re-
quired evaluation by nutrition experts, and after that
they use a linear regression model to match the ingre-
dients.
Intend to solve the ingredient matching problem
with food composition data, we looked for the exist-
ing ontologies in this domain (LIRMM, 2015; On-
tology, 2015), and we have found that there are fo-
cused on food recipes, ingredients and nutrients, but
an information about the structure of the ingredient
name is still missing. An ingredient name is repre-
sented by noun, and it can be additional explain with
the form of the ingredient (adjective) and the cook-
ing process (verb), which are very important and need
to be considerate in case when we want to calculate
the nutritional value. Have in mind the importance of
the nouns, adjectives and verbs presented in the in-
gredient name, the Part Of Speech tagging (POS tag-
ging) is one technique that can be used for ingredient
matching with food composition data (A. Voutilainen,
2003).
Our method is a probability method with which
we assign a weight on each matching and we consid-
ered the match with the highest weight as the most rel-
evant one. First, for each ingredient from the recipe,
we use POS tagging, also called grammatical tag-
ging or word-category disambiguation, to identify the
nouns, verbs and adjectives. The nouns carry the
most of the information of the name, the adjectives
explain the ingredient in most specific form, for ex-
ample ”frozen”, ”fresh”, and the verbs are at the most
cases related with the preparation method, for exam-
ple ”cooked”, ”drained” etc. Then, we search the
FCDB for the ingredient with a simple SQL search
using the provided nouns from the ingredient name
in the recipe. For each found name as a result of the
SQL search, we also perform POS tagging to identify
the nouns, verbs and adjectives. Next, we define an
event (X) which is the similarity between the ingredi-
ent name from the recipe and each of the food names
that are returned from the SQL search of the FCDB.
At the end, the weight we assign to the matching pairs
is the probability of the event.
Let D
1
be the name of a single ingredient from
the recipe, and D
2
is the single food name which is a
result from the SQL search of the FCDB. Let’s define,
N
i
= {nouns extracted f rom D
i
},
V
i
= {verbs extracted f rom D
i
},
A
i
= {ad jectives extracted f rom D
i
}, (1)
where i = 1, 2.
To find the probability of the similarity between the
ingredient name from the recipe and the food name
from the FCDB, we present the event as a product of
three other events.
X = N ·V · A, (2)
where N is the similarity between the nouns which are
in N
1
and N
2
, V is the similarity between the verbs
which are in V
1
and V
2
and A is the similarity between
the adjectives which are in A
1
and A
2
.
Because all these events are independent, the proba-
bility of the event X can be find as
P(X) = P(N) · P(V ) · P(A). (3)
Now, we need to define the probabilities of each of
the events, N, V and A. Because we want to find the
similarity between two sets, it is logical to use the Jac-
card index, J, which is used in statistic for comparing
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
332