Mapping Food Composition Data from Various Data Sources to a
Domain-Specific Ontology
Gordana Ispirova
1,2
, Tome Eftimov
1,2
, Barbara Koroušić Seljak
1
and Peter Korošec
1,3
1
Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
2
Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
3
Faculty of Mathematics, Natural Science and Information Technologies, Glagoljaška ulica 8, 6000 Koper, Slovenia
Keywords: Semantic Web, Food Domain Ontology, Food Composition Data, Text Similarity, Text Normalization.
Abstract: Food composition data are detailed sets of information on food components, providing values for energy and
nutrients, food classifiers and descriptors. The data of this kind is presented in food composition databases,
which are a powerful source of knowledge. Food composition databases may differ in their structure between
countries, which makes it difficult to connect them and preferably compare them in order to borrow missing
values. In this paper, we present a method for mapping food composition data from various sources to a
terminological resource-a food domain ontology. An existing ontology used for the mapping was extended
and modelled to cover a larger portion of the food domain. The method was evaluated on two food
composition databases: EuroFIR and USDA.
1 INTRODUCTION
Food composition data (FCD) are detailed sets of
information on the nutritional components of foods,
providing values for energy and nutrients, food
classifiers and descriptors. This type of data is
presented in Food Composition Databases (FCDBs)
(Greenfield, Southgate, 2003). Nowadays, FCDBs
tend to be compiled using a variety of methods,
including: chemical analysis of food samples carried
out in analytical laboratories, imputing and
calculating values from data already within the
database and estimating values from other sources,
including manufacturers food labels, scientific
literature and FCDBs from other countries.
The three main limitations of FCDBs are:
variability in the composition of foods between
countries, age of data (limited resources mean that,
inevitably, some values are not current) and
incomplete coverage of foods or nutrients leading to
missing values. Foods, being biological materials,
exhibit variations in composition. Therefore, a
database cannot accurately predict the composition of
any given single sample of a food. Further, FCDBs
cannot predict accurately the nutrient levels in any
food and the composition of a given food may change
with time. Predictive accuracy is also constrained by
the ways in which data are maintained in a database
(as averages, for example). FCDBs frequently cannot
be used as literature sources for comparison with
values obtained for the food elsewhere. Values from
one country should be compared with values obtained
in other countries by reference to the original
literature. Despite major efforts on harmonizing food
descriptions, nutrient terminology, analytical
methods, calculation and compilation methods,
values from existing food composition tables and
databases are not readily comparable across
countries. The description of food composition data-
nutrient terminology in different FCDBs can differ
(e.g. beta carotene, carotene-beta). To harmonize
them there is a need of text normalization methods.
Normalizing text means converting it to a more
convenient, standard form. Text normalization is the
process of transforming text into a single canonical
form. This process requires awareness of the type of
text being normalized and how it will be processed
afterwards, there is no all-purpose normalization
procedure. The main idea of normalizing a text is to
map a text description to an already existing
description contained in a domain specific
terminological resource, which can be a classical
dictionary, thesauri or a domain specific ontology.
A domain ontology represents concepts which
belong to a specific domain. Food domain ontologies
Ispirova G., Eftimov T., KorouÅ ˛aiÄ
˘
G Seljak B. and KoroÅ ˛aec P.
Mapping Food Composition Data from Various Data Sources to a Domain-Specific Ontology.
DOI: 10.5220/0006504302030210
In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KEOD 2017), pages 203-210
ISBN: 978-989-758-272-1
Copyright
c
2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
model and represent the domain of food. Having
different food related datasets, it is possible to match
an entity mention from the datasets to a concept in the
terminological resource i.e. the domain specific
ontology. Being more specific, the data from a FCDB
can be matched to a food domain ontology, and to
each of the entities an ontology tag can be assigned,
thus linking the dataset to the ontology. This type of
data mapping is an ontology-based data integration
(Leida, Ceravolo, Damiani, Cui, Gusmini, 2010;
Kerzazi, Navas-Delgado, F.Aldana-Montes, 2009).
This linking opens up a whole window of new
opportunities. Major problem in FCDBs are missing
values of components. One of the solutions to this
problem is borrowing data from other FCDBs. This
can be accomplished from this type of linking. By
linking datasets to a domain specific ontology, the
linked datasets can borrow missing values
interchangeably.
In this paper we compare the results for text
normalization of short text segments, specifically
names or descriptions of nutrients, obtained using two
different approaches: standard text similarity
measures and a modified version of Part of Speech
(POS) tagging probability weighted method
(Eftimov, Koroušić-Seljak, 2017). Starting with an
overview of related work concerning text
normalization methods and food domain ontologies
in Section 2, we continue with explaining the two
datasets and the methods used in our experiments
(Section 3 and Section 4). In Section 5 we give the
results obtained from the experimental work, and a
comparison of the methods. The last section is an
overall discussion of the problem, the method used
and the obtained results as a conclusion to our work.
2 RELATED WORK
In this section an overview of related work is
presented. Starting from existing text normalization
methods and food domain ontologies. To the best of
our knowledge there is no text normalization method
specifically developed for the food domain.
2.1 Text Normalization
The aim of text normalization methods is mapping
same concepts coming from different sources,
described in different ways to a concept from
terminological resource, which will imply that the
information contained in these concepts is the same.
The majority of normalization methods are based on
matching entity mentions to concept synonyms listed
in a terminological resource (Aronson, 2001; Savova,
Masanz, Ogren, Zheng, Shon, Kipper-Schuler, et al.,
2010; Friedman, Shagina, Socratous, Zeng, 1996;
Friedman, 2000; Garla, Brandt, 2013). More
sophisticated methods combine or rank the results
obtained using a number of different terminological
resources (Collier, Oellrich, Groza, 2015; Fu, Batista-
Navarro, Rak, Ananiadou, 2014). Pattern-matching
or regular expressions approaches (Fan, Sood, Huang,
2013; Ramanan, Broido, Nathan, 2013; Wang,
Akella, 2013) can account for frequently occurring
variations not listed in the terminological resource.
Methods based on machine learning, or hybrid
methods combining rules and machine learning, have
also been proposed (Goudey, Stokes, Martinez, 2007;
Leaman, Doğan, Lu, 2013).
String similarity methods have been employed in
a number of normalization efforts (Doğan, Lu, 2012;
Kate, 2015). These methods assign a numerical score
representing the degree of similarity between an
entity mention and a concept synonym, which means
that, unlike the limited types of variations that can be
handled by rules or regular expressions, string
similarity methods can handle a virtually unlimited
range of variations.
Character-level methods consider the number of
edits (e.g., insertions, deletions or substitutions)
required to transform one phrase into another (Jaro,
1995), or look at the proportion and/or ordering of
characters that are shared between the phrases being
compared (Jaro, 1995; Winkler, 1999; Kondrak,
2005). This can help to account for the fact that
concepts may be mentioned in text using words that
have the same basic root but many different forms,
including different inflections (e.g., reduced vs.
reduce), alternative spellings (e.g. fiber vs. fibre) and
nominal vs. verbal forms (e.g., reduce vs. reduction).
Word-level similarity metrics (Jaccard, 1912) can
be more appropriate when the phrases to be compared
consist of multiple words. Such metrics make it
possible to ensure that a match is only considered if a
certain proportion of words is shared. Weights may
be applied to the individual words (as is the case for
TF-IDF (Term Frequency-Inverse Document
Frequency) (Moreau, Yvon, Cappe, 2008)), to ensure
that greater importance is placed on matching words
with high relevance to the domain, than function
words like: the, of, etc.
Hybrid methods (e.g. SoftTFIDF (Cohen,
Ravikumar, Fienberg, 2003)) also operate at word
level, but use a character based similarity method to
allow matches between words that closely resemble
each other, even if they do not match exactly. This
helps to account for the fact that concepts may be
mentioned in text using multi-word terms whose
exact forms may vary from synonyms listed in the
terminological resource. Such methods can also help
to address the problem of normalizing entity mentions
containing spelling errors. The accuracy of string
similarity methods could be improved by integrating
semantic-level information.
In the paper (Alnazzawi, Thompson, Ananiadou,
2016) the authors present a method, called
PhenoNorm. It was developed using the PhenoCHF
corpus, which is a collection of literature articles and
narratives in Electronic Health Records, annotated for
phenotypic information relating to congestive heart
failure (CHF). This method links CHF-related
phenotype mentions to appropriate concepts in the
UMLS Metathesaurus, using a version of PhenoCHF.
However, in the food domain, where we
concentrate our research, this type of work has not
been previously done.
2.2 Food Domain Ontologies
There are several food ontologies: FoodWiki (Celik,
2015), AGROVOC (Caracciolo et al., 2012), Open
Food Facts (Open food facts ontology, 2017), Food
Product Ontology (Kolchin, Zamula, 2013), FOODS
(Diabetics Edition) (Snae, Bruckner, 2008) and
FoodOn Ontology (FoodOn Ontology, 2017). In the
paper (Boulos, Yassine, Shirmohammadi, Namahoot,
Bruuckner, 2015), the authors provided a review of
the mentioned food ontologies.
Despite the attempts of building an ontology for
wider uses, thus the attempt of FoodOn for
generalized ontology for the food domain, all of the
mentioned ontologies are developed for very specific
uses. To overcome the limited scope of food
ontologies an ontology that covers wider domain is
needed.
Not scientifically validated data in the systems
providing data about food and nutrition is the main
cause of having invalid data in FCDBs. This problem
has been mostly solved with the project QuaLiFY
(Qualify, 2017). In this project a new food ontology
for harmonization of food-and nutrition-related data
and knowledge, called Quisper (Eftimov, Koroušić-
Seljak, 2015), has been developed. We have updated
and extended this ontology with additional concepts.
In Figure 1 the updated structure is shown. In the
Figure 1: Updated structure of Quisper ontology.
developing process of this ontology first, using the
POS tagging-probability weighted method, the
similar terms provided from the web services are
extracted, then an initial taxonomy with the similar
terms is created, to which terms that are typical only
for one web service are added. At the end, using the
extracted terms in the taxonomy and the relations
between the terms in the web services, an ontology
from scratch is created using the software Protégé
(Protégé, 2016), which is available to the human
experts. The proposed approach could be also used in
other domains simply by modifying the probability
model in order to fit the purposes of the domain of
interest.
3 DATA
In this section we describe the data used in our
experiments, which comes from two sources. The
purpose is to link the nutrient information from both
sources presented on a different way to a food domain
ontology. For example: “fatty acids 18:1-11 t (18:1t
n-7)” needs to be linked to “fatty acid 18:1 n-7 trans”;
“Tocotrienol, gamma” needs to be linked to “gamma-
tocotrienol”; etc.
3.1 EuroFIR Dataset
European Food Information Resource Network
(EuroFIR) AISBL is an international, non-profit
association under the Belgian law. As an organization
its purpose is developing, publishing and exploiting
food composition information and promoting
international cooperation and harmonization of
standards to improve data quality, storage and access
(European Food Information Resource Network,
2017). The EuroFIR data interchange uses files in
XML format, which follow a nested structure.
EuroFIR presents a data model for FCD management
and data interchange. The EuroFIR format for FCD
starts with “Foods” element which holds separate
“Food” elements that report the data for each
individual food item. Within each “Food” element,
together with the elements describing the food, are
nested collections of “Component” records, each with
its set of “Value” records. For the purposes of this
project, an XML file from EuroFIR Component
Thesaurus version 1.3 is extracted. There are 997
components in total and for each component a short
abbreviation, the full name/description of the
component, the date when it was added to the
database and the date when it was last updated are
listed.
3.2 USDA Dataset
The United States Department of Agriculture
(USDA), is the federal executive department of the
U.S., whose responsibility is developing and
executing federal laws related to farming, agriculture,
forestry, and food. Its aims are meeting the needs of
farmers and ranchers, promoting agricultural trade
and production, assuring food safety, protecting
natural resources, foster rural communities and
ending hunger in the United States and internationally
(USDA, Food Composition Database, 2017).
This department has produced the USDA
National Nutrient Database, which is a database that
provides the nutritional content of many generic and
proprietary-branded foods. New releases occur about
once per year. The database can be searched online,
queried through a REST API (NDB API, 2017), or
downloaded. For the needs of this project we accessed
the nutrients list from the USDA FCDB through the
REST API.
The obtained file is in XML format, following a
nested structure. There are 190 nutrients in the list,
and for each nutrient an identification number and the
nutrient’s name is listed.
4 METHOD
In this section, we describe the pre-processing of the
datasets and the two approaches of text normalization
used in our experiments.
4.1 Pre-processing
After obtaining the XML files from both datasets the
relevant information is extracted. From the EuroFIR
dataset for each component we extracted the short
abbreviation and the full name/description of the
component in a CSV file. The same is applied for the
USDA dataset, where the names of the nutrients
alongside with their identification numbers are
extracted in a CSV file. From the OWL file of the
extended Quisper ontology, all the sub-concepts of
the concept “Component” and their corresponding tag
from the ontology are extracted. The conversion from
XML to CSV is made using simple parsing in R
(Development Core Team, 2008).
4.2 Normalization of FCD
Having the CSV files from the three sources the next
step was to match the names of the nutrients from
both food composition databases to the names of the
nutrients from the ontology. The matching is made by
using two different methods: using text similarity
measures and using a modified version of POS
tagging combined with probability theory.
4.2.1 Normalization using Text Similarity
Measures
The first method of normalization is performed in
RStudio IDE (RStudio Team, 2015), using the
package ‘stringdist’ (Van der Loo, Van der Laan,
Logan, 2016). A total of eight text similarity
measures are applied:
1. Optimal string alignment (OSA), (restricted
Damerau-Levenshtein distance) - Levenshtein
distance is the number of deletions, insertions
and substitutions necessary to turn string into
string . OSA is like the Levenshtein distance but
also allows transposition of adjacent characters.
Here, each substring may be edited only once.
2. Full Damerau-Levenshtein distance is like the
OSA distance except that it allows multiple edits
on substrings.
3. Longest common substring distance is defined as
the longest string that can be obtained by pairing
characters from string and string while
keeping the order of characters intact. This
distance is defined as the number of unpaired
characters, and it is equivalent to the edit distance
allowing only deletions and insertions, each with
weight one.
4. -gram distance - A -gram is a subsequence of
consecutive characters of a string. If () is
the vector of counts of -gram occurrences in
string (), the -gram distance is given by the
sum over the absolute differences |
−
|.
The computation is aborted when is larger than
the length of any of the strings. In that case
is returned.
5. Cosine distance between -gram profiles is
computed as:
1−∙/(
|
|

|
|
) (1)
Where and were defined above.
6. Jaccard distance between -gram profiles - Let
be the set of unique -grams in and the set of
unique -grams in . The Jaccard distance is
given by:
1−
|
∩
|
/
|
∪
|
(2)
7. Jaro, or Jaro-Winker distance - The Jaro distance,
is a number between 0(exact match) and 1
(completely dissimilar) measuring dissimilarity
between strings. It is defined to be 0 when both
strings have length 0, and 1when there are no
character matches between and . Otherwise,
the Jaro distance is defined as:
1 (1/3)(
/||+
/||+

( )/) (3)
Here, || indicates the number of characters in ,
is the number of character matches and t the
number of transpositions of the matching
characters. The
are weights associated with
the characters in , characters in and with
transpositions. Two matching characters are
transposed when they are matched but they occur
in different order in string and . The Jaro-
Winkler distance adds a correction term to the
Jaro-distance. It is defined as:
−·· (4)
Where is the Jaro-distance. Here, is obtained
by counting, from the start of the input strings,
after how many characters the first character
mismatch between the two strings occurs, with a
maximum of four. The factor is a penalty
factor, which in the work of Winkler is often
chosen 0.1.
8. Distance based on soundex encoding - This text
similarity measure translates each string to a
soundex code. The distance between strings is 0
when they have the same soundex code,
otherwise 1.
All eight text similarity measures are applied two
times on the data, without any previous pre-
processing, and with an additional pre-processing
step. The pre-processing step is removing the
punctuation from the names of the nutrients from both
datasets and from the names of the nutrients from the
ontology.
4.2.2 Normalization using POS and
Probability Theory
The second method of normalization is also
performed in RStudio IDE and it includes using POS
tagging combined with probability theory. This
particular method has been previously used (Eftimov,
Koroušić-Seljak, Korošec, 2017; Eftimov, Korošec,
Koroušić-Seljak, 2017; Eftimov, Koroušić-Seljak,
2017) and for the requirements of this project it is
modified.
Because we are working with terms related to
chemical names, on each description of the nutrient,
using POS tagging, nouns, adjectives and numbers
are extracted. These three morphological tags or
categories are selected with previously examining the
datasets. The descriptions of the nutrients can be
consisted of:
only nouns (Example: copper; gluten;
sodium; …)
nouns and adjectives (Example: acetic acid;
amino acids, total aromatic; pentoses in
dietary fibre …)
nouns and numbers (Menaquinone-10;
vitamin_B1; …)
nouns, adjectives and numbers (Example:
10-formylfolic acid; fatty acid 10:0 (capric
acid; starch, resistant_RS4; …)
The nouns carry the most information about the
term’s description, the adjectives explain the terms in
most specific form and the numbers are in most cases
related to the chemical nomenclature.
If:

= {nouns extracted from the -th
dataset}

= {adjectives extracted from the -th
dataset}

= {numbers extracted from the -th
dataset}
Correspondingly  is the similarity between the
nouns extracted in one of the nutrient datasets
(EuroFIR or USDA) and the nouns extracted from the
names of the nutrients from the ontology. Same
implies for  and . Having that these events are
independent from each other:
(
)
=
(

)
×
(

)
×
(

)
(5)
The formula for calculating each of the
probabilities is:
() = (
∩
+1) (
∪
+2) (6)
Where ={,,}. The probability that
two strings match is obtained with replacing, equation
(6) for each of the probabilities in equation (5).
5 RESULTS
The results from both methods are exported in CSV
format files.
In order to compare the methods, we manually
label all the instances in the result files. The labels
assigned are the following:
0 -Either no match is found (which is a case
only with the second method) or the match
found is not the correct one.
1 - A match is found.
2- Multiple matches are found, one of them
being the correct (most suitable) one.
3 - Multiple matches are found with none of
them being suitable or correct.
After labelling the instances, simple statistics are
performed, counting the instances from each
category. Because the Quisper ontology is
constructed based on an ontology-learning method
where one of the initial sets is the EuroFIR dataset,
when matching the nutrient names from EuroFIR and
the nutrient names from the ontology we obtained
perfect matches and 100% accuracy, and with this the
goal of assigning an ontology label to each nutrient is
met. However, for the nutrient names from USDA
dataset we obtained different results.
Table 1: Results from text normalization on USDA dataset.
Measure
Label
‘0’
Label
‘1’
Label
‘2’
Label
‘3’
Optimal
string
alignment
20 114 22 34
Full
Damerau -
Levenshtein
21 113 24 32
Longest
common
substring
33 127 13 17
- gram
36 112 24 18
Cosine
56 127 3 4
Jaccard
45 94 34 17
Jaro -
Winker
52 131 1 6
Soundex
2 48 110 30
POS
tagging
with
probability
theory
23 161 6 0
Judging from the results shown in Table 1, the
POS tagging method with probability theory gives the
best results. For a total of 167 instances it gives the
correct matches and only for 23 instances either it
cannot find a match or it returns the incorrect match,
which makes it have an accuracy of 87.9%. There are
no instances that belong to label′3′, and only 6
instances that belong to label ‘2’, which implies that
this method gives multiples choices only for a few
instances. By writing a simple code in R we
determined the threshold of the probability:
if P < 0.067, then label ′0′ is assigned
The second best is Soundex text similarity
measure, with 158 instances with correct matches
and 32 instances with incorrect matches. It is clear
that this method works with giving a lot of options,
thus the number of instances with label′2′ is much
larger than the number of instances with label′1′, and
the number of instances with label ′3′ is also larger
than the number of instances with label′0′.
From further observation of the results we are
able to see that 19 out of those 23 instances for which
the best method is giving either no matches or the
incorrect matches the other eight measures also do not
give correct matches. From this we've come to the
conclusion that the other eight measures cannot be
applied to these instances in a cascade type of method
for improvement. For 4out of the 23 instances for
which this method doesn't give matches: “Ash”,
“Fiber, total dietary”, “Lutein + zeaxanthin” and
“Thiamin”, the other eight methods give correct
matches: “ash, total”, “fibre, total dietary”, “lutein
plus zeaxanthine” and “thiamine”. From looking into
this problem we have come to the conclusion that this
is because of the fact that the POS tagging method
does not recognize them as part of any morphological
class and cannot assign morphological tags to them.
In order to improve this results the second best
(Soundex) and third best measure (Longest common
substring) are applied separately on these instances
and the correct matches are obtained, with the
difference that the Soundex measure, again, for some
of the instances, gives more than one option, but the
Longest common substring measure gives only one
option for each, thus making it the better measure in
this case. After this step the total number of correctly
matched instances is 171 which is an
accuracyof90%.
6 CONCLUSIONS
This paper focuses on the problem of mapping food-
related information from different FCDBs to a
domain specific ontology that covers a large portion
of the food domain. Our work focuses on using text
normalization methods for linking short text
segments, in this case nutrient terminology, to a
concept from a domain specific terminological
resource, in this case a food domain ontology. The
implementation of this work allows the same nutrient
data represented on different ways in various data
sources to be linked to a concept from food domain
ontology, which makes sharing, combining and
reusing this kind of data easier. So far, we have linked
the largest two FCDBs to the Quisper ontology on
nutrient level. This concept can be modified and
further extended on food level.
With this work a certain level of harmonizing
FCD is achieved. If the principles of this work are
further followed by existing and newly constructed
FCDBs, the quality of the data and the database will
improve significantly.
ACKNOWLEDGMENTS
This work is supported by the project RICHFIELDS,
which received funding from the European Union’s
Horizon 2020 research and innovation programme
under grant number 654280.
REFERENCES
Greenfield, H., Southgate, D. A. T., 2003. Food
Composition Data: Production, Management, and Use,
Food & Agriculture Org, 2
nd
edition.
Celik, D., 2015. Foodwiki: Ontology-driven mobile safe
food consumption system, The Scientic World Journal.
Caracciolo, C., et al., 2012. Thesaurus maintenance,
alignment and publication as linked data: the agrovoc
use case. In International Journal of Metadata, Seman-
tics and Ontologies. Inderscience Enterprises Ltd.
Open food facts, accessed April 2017. http://world.open
foodfacts.org/who-we-are.
Kolchin, M., Zamula, D., 2013. Food product ontology:
Initial implementation of a vocabulary for describing
food products. In: Proceeding of the 14th Conference of
Open Innovations Association FRUCT.
Snae, C., Bruckner, M., 2008. Foods: a food-oriented
ontology-driven system. In Second IEEE International
Conference on Digital Ecosystems and Technologies.
FoodOn Ontology, accessed April 2017. http://food
ontology.github.io/foodon/.
Qualify, accessed April 2017. http://quisper.eu.
European Food Information Resource Network-EuroFIR,
accessed April 2017. http://www.eurofir.org/.
Eftimov, T., Koroušić-Seljak, B., 2015. QOL - Quisper
Ontology Learning using personalized dietary services.
In IJS delovno poročilo, 11985, confidential.
Boulos, M. N. K., Yassine, A., Shirmohammadi, S.,
Namahoot, C. S., Bruuckner, M., 2015. Towards an
internet of food: Food ontologies for the internet of
things. In Future Internet 7.
Protégé, accessed March 2017. http://protege.stanford.edu/.
Aronson, A., 2001. Effective Mapping of Biomedical Text
to the UMLS Metathesaurus: The MetaMap Program.
In Proceedings of the AMIA Annual Symposium.
Savova, G., Masanz, J., Ogren, P., Zheng, J., Shon, S.,
Kipper-Schuler, K., et al., 2010. Mayo clinical Text
Analysis and Knowledge Extraction System
(cTAKES): architecture, component evaluation and
applications. In Journal of the American Medical
Association.
Friedman, C., Shagina, L., Socratous, S. A., Zeng, X., 1996.
A WEB-based version of MedLEE: A medical
language extraction and encoding system. In:
Proceedings of the AMIA Annual Fall Symposium.
Friedman, C., 2000. A broad-coverage natural language
processing system. In Proceedings of the AMIA
Symposium. American Medical Informatics
Association.
Garla, V. N., Brandt, C., 2013. Knowledge-based
biomedical word sense disambiguation: an evaluation
and application to clinical document classification. In
Journal of the American Medical Informatics
Association.
Collier, N., Oellrich, A., Groza, T., 2015. Concept selection
for phenotypes and diseases using learn to rank. In
Journal of Biomedical Semantics.
Fu, X., Batista-Navarro, R, Rak, R, Ananiadou, S., 2014. A
strategy for annotating clinical records with phenotypic
information relating to the chronic obstructive
pulmonary disease. In Proceedings of Phenotype Day
at ISMB.
Fan, J., Sood, N., Huang, Y., 2013. Disorder concept
identification from clinical notes an experience with the
ShARe/CLEF 2013 challenge. In Proceedings of the
ShARe/CLEF Evaluation Lab.
Ramanan, S., Broido, S., Nathan, P. S., 2013. Performance
of a Multi-class Biomedical Tagger on Clinical
Records. In Proceedings of the ShARe/CLEF
Evaluation Lab.
Wang, C., Akella, R., 2013. UCSC's System for CLEF
eHealth. In: Proceedings of the ShARe/CLEF
Evaluation Lab.
Goudey, B., Stokes, N, Martinez, D., 2007. Exploring
Extensions to Machine Learning-based Gene
Normalisation. In Proceedings of the Australasian
Language Technology Workshop.
Leaman, R., Doğan, R. I., Lu, Z., 2013. DNorm: disease
name normalization with pairwise learning to rank. In
Bioinformatics.
Doğan, R. I., Lu, Z., 2012. An inference method for disease
name normalization. In Proceedings of the 2012 AAAI
Fall Symposium Series.
Kate, R. J., 2015. Normalizing clinical terms using learned
edit distance patterns. In Journal of the American
Medical Informatics Association.
Jaro, M. A., 1995. Probabilistic linkage of large public
health data files. In Statistics in medicine.
Winkler, W. E., 1999. The state of record linkage and
current research problems. In Statistical Research
Division, US Census Bureau.
Kondrak, G., 2005. N-gram similarity and distance. In
String Processing and Information Retrieval. Springer,
Berlin, Heidelberg.
Jaccard, P., 1912. The distribution of the flora in the alpine
zone. In New Phytologist.
Moreau, E., Yvon, F., Cappe, O., 2008. Robust similarity
measures for named entities matching. In Proceedings
of the 22nd International Conference on Computational
Linguistics.
Cohen, W., Ravikumar, P, Fienberg S., 2003. A comparison
of string metrics for matching names and records. In
Proceedings of the KDD workshop on data cleaning
and object consolidation.
Alnazzawi, N., Thompson, P., Ananiadou, S., 2016.
Mapping Phenotypic Information in Heterogeneous
Textual Sources to a Domain-Specific Terminological
Resource. In PLOS ONE 11.
USDA, Food Composition Database, accessed April 2017.
https://ndb.nal.usda.gov/ndb/.
NDB API, accessed April 2017. https://ndb.nal.usda.gov/
ndb/doc/index.
Development Core Team, 2008. R: A language and
environment for statistical computing. http://www.R-
project.org.
RStudio Team, 2015. RStudio: Integrated Development for
R. https://www.rstudio.com/. RStudio Inc., Boston. R
Foundation for Statistical Computing, Vienna.
Eftimov, T., Koroušić-Seljak, B., Korošec, P., 2017. A rule-
based Named-entity Recognition Method for
Knowledge Extraction of Evidence-based Dietary
Recommendations. In PLOS ONE.
Eftimov, T., Korošec, P., Koroušić-Seljak, B., 2017.
StandFood: Standardization of Foods Using a Semi-
Automatic System for Classifying and Describing
Foods According to FoodEx2. In Nutrients.
Eftimov, T., Koroušić-Seljak, B., 2017. POS Tagging-
probability Weighted Method for Matching the Internet
Recipe Ingredients with Food Composition Data. In
Proceedings of the 7th International Joint Conference
on Knowledge Discovery, Knowledge Engineering and
Knowledge Management (IC3K 2015).
Van der Loo, M., Van der Laan, J., Logan, N., 2016.
Approximate String Matching and String Distance
Functions.
Leida, M., Ceravolo, P., Damiani, E., Cui, Z., Gusmini, A.,
2010. Semantics-aware matching strategy (SAMS) for
the Ontology meDiated Data Integration (ODDI). In
Int. J. Knowledge Engineering and Soft Data
Paradigms, Vol. 2, No. 1.
Kerzazi, A., Navas-Delgado, I., F.Aldana-Montes, J., 2009.
Towards an Ontology-based Mediation Framework for
Integrating Biological Data. In SWAT4LS.