How to Extract Unit of Measure in Scientiﬁc Documents?

Soumia Lilia Berrahou

1,2

, Patrice Buche

1,2

, Juliette Dibie-Barthelemy

and Mathieu Roche

LIRMM – CNRS – Univ. Montpellier 2, 34095 Montpellier Cedex 5, France

INRA – UMR IATE, INRIA – GraphIK 2, place Pierre Viala, 34060 Montpellier Cedex 2, France

INRA – M

et@risk & AgroParisTech, 75231 Paris Cedex 5, France

Keywords:

Information Retrieval, Unit of Measure Extraction, Ontological and Terminological Resource, Machine

Learning.

Abstract:

A large amount of quantitative data, related to experimental results, is reported in scientiﬁc documents in a

free form of text. Each quantitative result is characterized by a numerical value often followed by a unit of

measure. Extracting automatically quantitative data is a painstaking process because units suffer from different

ways of writing within documents. In our paper, we propose to focus on the extraction and identiﬁcation of

the variant units, in order to enrich iteratively the terminological part of an Ontological and Terminological

Resource (OTR) and in the end to allow the extraction of quantitative data. Focusing on unit extraction

involves two main steps. Since we work on unstructured documents, units are completely drowned in textual

information. In the ﬁrst step, our method aims at handling the crucial time-consuming process of unit location

using supervised learning methods. Once the units have been located in the text, the second step of our method

consists in extracting and identifying candidate units in order to enrich the OTR. The extracted candidates are

compared to units already known in the OTR using a new string distance measure to validate whether or not

they are relevant variants. We have made concluding experiments on our two-step method on a set of more

than 35000 sentences.

1 INTRODUCTION

Discovering and extracting information reported in

published scientiﬁc documents is a crucial stake in

several scientiﬁc domains in order to be able to reuse,

manage, exploit and analyze the information they

contain. Nevertheless, a large amount of data in any

scientiﬁc ﬁeld is still published in today’s web in the

form of unstructured text rather than as structured se-

mantic information. Indeed, published scientiﬁc ar-

ticles use natural language combined with domain-

speciﬁc terminology that is extremely tedious to ex-

tract in the free form of text.

Linguistic methods for information extraction

(IE), such as the ones proposed in GATE (Cunning-

ham et al., 2002), are mostly rule-based techniques,

which usually use a combination of gazetteer lists and

pattern-matching rules in order to deﬁne if a candi-

date extracted term is valid or not. Those techniques

are then included in an overall architecture providing

various language processing tasks as building and an-

notating corpora. The system involves semantic anal-

ysis at the simplest level in the process with named

entities. Other systems integrate directly formal on-

tologies in the natural language processing (NLP)

environment, called ontology-based information ex-

traction (OBIE), and incorporate semantic knowledge

about terms in the context involving reasoning (May-

nard et al., 2008; Wimalasuriya and Dou, 2010).

Our work deals with unit recognition and extrac-

tion, widely known as a challenging issue of IE as

described in Related Work section below. Indeed,

most of techniques used in classical IE systems, based

on linguistic methods, cannot handle quantitative data

extraction issues, especially units of measure because

they rely on completely different syntactic deﬁni-

tions. New methods and rules of information extrac-

tion must be provided and take into account those par-

ticularities such as simple or complex units, preﬁx of

multiple or submultiple of units, dimensions, consid-

ering the presence or absence of special characters,

superscript numbers and so on. Currently, none of

the existing NLP methods of the state-of-the-art con-

sider efﬁciently the extraction of units of measure.

However, it remains a paramount importance in many

quantitative research ﬁelds such as physics, chem-

istry, biology and food science, where the process of

quantitative data extraction in experimental results is

249

Berrahou S., Buche P., Dibie-Barthelemy J. and Roche M..

How to Extract Unit of Measure in Scientiﬁc Documents?.

DOI: 10.5220/0004666302490256

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge

Management and Information Sharing (SSTM-2013), pages 249-256

ISBN: 978-989-8565-75-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

difﬁcult to overcome.

Our work aims at leveraging knowledge in food

science domain, which relies on an Ontological and

Terminological Resource (OTR). The concepts re-

lated to units of measure in this OTR have been de-

ﬁned from the most extended ontology of units, On-

tology of units of Measure and related Concepts (OM)

(Rijgersberg et al., 2011), (Rijgersberg et al., 2013).

OM respects the International System of Units (SI)

(Thompson and Taylor, 2008), which organizes quan-

tities and units of measure. The deﬁnitions of base

and derived units of SI have been translated in a for-

mal representation in OM. Our work focuses on the

unit recognition and extraction issues in order to ex-

tract typographic variants, synonyms, abbreviations

and new units of measure, which are then used to

populate the OTR. This is a crucial step to be com-

pleted in order to build robust add-in semantic tag-

ging and extracting tools concerning units of measure

that could be integrated in OBIE or IE systems. These

tools facilitate domain expert participation in the task

of annotating textual corpus and data tables for gold

standard corpus construction. Moreover, the OTR

such populated is used as a basis of reasoning and to

make multi-criteria analysis in decision-making sys-

tems (Hignette et al., 2008).

We propose a two-step approach in order to per-

form unit recognition and extraction, which focuses

on the following contributions: (i) the ﬁrst step of our

work demonstrates how supervised learning methods

can be successfully used to meet the challenge of re-

ducing the time-consuming process of unit location

in free form of text. (ii) the second step of our work

deals with unit extraction and identiﬁcation using a

new string distance measure in order to overcome the

scaling issue.

The paper is structured as follows. In section 2, we

point out related work on unit recognition. In section

3, we describe our two-step approach, based on locat-

ing and extracting variant terms of units. In sections

4 and 5, each step is precisely detailed. In section 6

we deliver our results on real-world data and ﬁnally,

discuss and conclude on our work.

2 RELATED WORK

Recent work (Touhami et al., 2011), (Jessop et al.,

2011a) has revealed that most of extraction of exper-

imental data fails because units suffer from a large

variability of writing within scientiﬁc documents.

Despite works to produce proper formal standard

(Thompson and Taylor, 2008), (Rijgersberg et al.,

2013) in order to exchange and process quantitative

information, if a unit of measure is written differently

in the text, the process of identiﬁcation fails even in

the smallest differences (e.g. mol written mole).

Other work in related ﬁelds as chemistry, focus

not only on the identiﬁcation and annotation of chem-

ical entities (Jessop et al., 2011b) but also on the rela-

tionships linking these entities to each other (Hawizy

et al., 2011) using ChemicalTagger. ChemicalTagger

is the best performing state-of-the-art tool for text-

mining in chemistry, a phrase-based semantic natural

language processing tool for the language of chemi-

cal experiments. However, in (Jessop et al., 2011a),

the observation is similar when the authors note that

ChemicalTagger (Hawizy et al., 2011) fails in the pro-

cess of recognizing chemical names as reagent be-

cause of the variability of writing units in the text. It

perfectly matches the units of measure written in re-

spect to the SI (Thompson and Taylor, 2008) but fails

automatically when small changes occur in the text.

The authors present this issue as a challenge to be met

in future work.

In (Willems et al., 2012), in the same trend as we

aim at working, the authors present a method for an-

notating and extracting quantities and units. Their ap-

proach relies on the use of semantic markups avail-

able in Latex ﬁles. Our approach must address the

full-text unstructured documents.

In (Van Assem et al., 2010), the authors pro-

pose to overcome unit identiﬁcation issue in data ta-

bles by using a string distance metric, called Jaro-

Winkler-TFIDF (Cohen et al., 2003). We have used

a new string distance measure, we adapted from

Damerau-Levenshtein distance (Damerau, 1964), in

order to tackle unit identiﬁcation issue when units are

drowned in textual information. Moreover, unlike our

work, they do not need to consider unit location in or-

der to overcome scalability issue since they work on

structured data tables.

The current state-of-the-art methods of related do-

mains is not suitable enough to handle efﬁciently the

identiﬁcation of variant terms of unit issue. Our ap-

proach aims at addressing this concern using a novel

method to extract unit information in a two-step ex-

traction process.

3 GLOBAL APPROACH

In this section, we present our two-step approach to

extract units in order to enrich a domain OTR. We

assume that the more we extract units from text, the

more we will succeed in the experimental data iden-

tiﬁcation process. Indeed, variant terms of units en-

countered in scientiﬁc documents, due to the variabil-

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

250

ity of writing, are seen as potential synonyms in the

terminological part of the OTR. Once referenced in

the OTR, they will increase the positive results in the

identiﬁcation and extraction process of experimental

data. Thus, our work on unit identiﬁcation task is seen

as a sub-step of an overall process of experimental re-

sult identiﬁcation and extraction but remains a crucial

stake.

Before presenting our global approach, let us

brieﬂy present the unit concepts of the considered

OTR. As shown in Figure 1, units are speciﬁcally or-

ganized around the concept Unit

concept and divided

into 4 classes of units (units being considered as in-

stances):

• Singular Unit such as day or meter;

• Unit Exponentiation such as Square Meter;

• Unit Division Or Multiplication such as

Milliliter Per Square Meter Per Day;

• Unit Multiple Or Submultiple such as Microme-

ter.

Figure 1: Excerpt of the OTR: Unit concept.

Each unit has several possible labels, called terms,

which are deﬁned in the terminological part of the

OTR and denote unit instances, called units in the fol-

lowing, deﬁned in the conceptual part of the OTR.

In Figure 2, we give an example in order to

clearly illustrate the main problem of unit recognition.

The terms amol.m

−1

.Pa

−1

and amol/m/s/Pa refer-

enced in the terminological part of the OTR denote the

unit AttoMole Per Meter Per Second Per Pascal,

instance of the Unit Division Or Multiplication con-

cept. Those terms are not sufﬁcient enough to recog-

nize variant terms of units such as suggested in Fig-

ure 2. Our method aims at locating those variant terms

and extracting them to enrich the OTR with new terms

(e.g. amol x m

-1

x s

-1

x Pa

-1

where a multiplication

sign has been added).

The method must face several challenges: (i) the

question of locating units within the text, do we need

to skim all the text or can we predict the best informa-

tion location without any assumption. The question

is important to be asked because the major limit of

classical methods of string distance metrics concerns

Figure 2: Relevant variant terms of units.

the scalability issue. Currently, these metrics can not

be applied on a huge corpus. The ﬁrst step is a ma-

jor concern in order to reduce signiﬁcantly the search

space to the relevant sentences where we will be able

to apply string distance metrics (ii) the question of ex-

tracting variant terms of units, once the best location

has been identiﬁed, how do we extract variant terms

of units from that part of text; and (iii) once extracted,

the question of semantically identifying these variant

terms of units and ﬁnd which unit instances they can

denote in the OTR. Our approach tackles these key

points and intends at discovering more variant terms

of units in order to enrich the terminological part of

OTR as an iterative process.

The objective of the ﬁrst step is to reduce sig-

niﬁcantly the size of search space and overcome the

painstaking process of unit location. As a matter of

fact, it is often assumed that experimental data are lo-

cated in speciﬁc paragraphs of the text, e.g. Exper-

iments or Results and discussion in (Hawizy et al.,

2011). We aim at locating it without any of these as-

sumptions. However, we are aware that going through

all documents involves a time-consuming process.

The method to reduce search space involves text min-

ing and supervised machine learning approaches. Our

aim in this ﬁrst step is obtaining the most efﬁcient

model from machine learning algorithms that could

be then applied in any corpus of a domain where ex-

tracting units is a major concern. In section 4, we give

more details on different experiments that have been

led with the purpose of achieving this best model.

The second step, detailed in section 5, works on

textual fragments obtained from the learning model

and focuses on terms of unit extraction. As soon as

the optimal location for discovering variant terms of

units is deﬁned, we ﬁrst highlight in this location the

ones to be extracted. After extracting potential candi-

dates, we compute a new string distance measure in

order to distinguish which of the known units in the

OTR can match these variant terms and so are poten-

tial synonyms to enrich the OTR.

HowtoExtractUnitofMeasureinScientificDocuments?

251

4 LOCATING UNITS

In the ﬁrst step of the approach, our objective is to fo-

cus on locating relevant data concerning units of mea-

sure. In the work of (Hawizy et al., 2011), the authors

use a Regex-tagger based on regular expressions in or-

der to capture the sentences where quantitative data,

chemical entities and units appear. However, when

testing the tagger in (Jessop et al., 2011a), the process

of recognition fails because of typographic variations

of units in the text. This observation has motivated

our choices towards supervised learning methods in-

stead of using regular expressions. Hence, our pro-

cess consists of using supervised classiﬁer to deter-

mine whether a part of text is classiﬁed as contain-

ing ”units” or ”non-units”. In order to obtain the best

model to be applied on the corpus, we work on sev-

eral training sets, each one corresponds to a speciﬁc

textual context. Thus, as soon as the process recog-

nizes a unit annotation in the sentence of the text, we

observe the context around it. We consider different

case studies to build these training sets for supervised

learning methods:

• the ﬁrst case study takes into account only the sen-

tence where at least one unit appears and is recog-

nized according to the units deﬁned in the OTR;

• the second and third case study take respectively

into account one sentence before and one sentence

after the one where at least one unit appears;

• the fourth and ﬁfth case study take respectively

into account two sentences before and two sen-

tences after the one where at least one unit ap-

pears.

The following is a description of the transformation

process tasks and explains our choices in order to ob-

tain the model.

4.1 Preparing Data

Data preparation involves text processing and text

transformation tasks. Text processing sub-step con-

sists of:

• text segmentation in order to generate a set of sen-

tences, which deals with our issues, e.g. not going

to new sentence when units with point separators

are encountered;

• text cleanup, which removes punctuation and spe-

cial characters from text, except those involved in

units;

• text tokenization, which splits up a string of char-

acters into a set of tokens;

• text reducing, which prunes away tokens contain-

ing junk words according to a list of stop words;

• text tagging, which annotates the units in the text.

Text tagging relies on comparing the terms denot-

ing the referenced units in the OTR with tokens of

each sentence. As soon as the process recognizes a

unit or a part of unit, this one is tagged. Our main

purpose is obtaining a tagged data used to test our dif-

ferent case studies under learning algorithms.

4.2 Transforming Data

As soon as we have obtained our training data, we can

proceed with text transformation, which consists of

representing a text according to the words it contains

and their occurrences in the sentences. The repre-

sentative words (features) form the bag-of-words ac-

cording to a relevant threshold we can pre-select (e.g.

words appearing more than just once), and we com-

pute their occurrences in each sentence according to

three word weighting measures:

• Term Frequency (TF), which considers that the

word is more important if it appears several times

in sentences;

• Term Frequency-Inverse Document Frequency

(TF.IDF (Hiemstra, 2000)), which considers that

the word is more important if it appears in less

sentences;

• Okapi BM25 (Jones et al., 2000), which takes also

into consideration the sentences length where the

word appears to deﬁne its relevancy.

Sentences become vectors that constitute our

training corpus in order to deﬁne, with supervised

learning methods, whether we are in a ”unit” or ”non-

unit” context.

4.3 Learning Model

Our ﬁve case studies performed under those three

weight-based measures are run under several learning

algorithms. Our aim is to obtain exhaustive experi-

ments and results in order to conclude about the best

model in our issue. Various learning algorithms have

been tested in our experiments:

• Naive Bayes classiﬁer and the Discriminative

Multinominal Naive Bayes (DMNB) classiﬁer;

• J48 decision tree classiﬁer ;

• Sequential Minimal Optimization (SMO), which

is a Support Vector Machines classiﬁer (SVM).

The motivation of our choices relies on compar-

ing the behavior of those widely known learning algo-

rithms on a corpus containing many quantitative data.

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

252

Naive Bayes (John and Langley, 1995) is competitive

for computational efﬁciency. Decision tree (Kohavi

and Quinlan, 2002) classiﬁers are known to obtain

good classiﬁcation results but are less competitive in

speed of execution. SMO (Platt, 1999) is a discrim-

inative classiﬁer known to efﬁciently behave on text

classiﬁcation and, DMNB is a new text classiﬁcation

algorithm (Su et al., 2008) which performs competi-

tively with discriminative classiﬁers in terms of accu-

racy, without losing the advantages of computational

efﬁciency of Naive Bayes classiﬁers.

The conclusion on the best model is taken after

paying a particular attention to several key points:

• Training time and best classiﬁcation balance;

• Precision, Recall and F-measure results, we fo-

cus on the recall, which means we want to maxi-

mize the fraction of relevant sentences that are re-

trieved, without losing too much precision in the

results ;

• Confusion matrix, which compares the results

of the classiﬁer under test with trusted external

judgements.

As we want to estimate how accurately the model

of each classiﬁer will perform in practice, we use a

10-fold cross-validation: the original sample is ran-

domly partitioned into 10 equal size subsamples. One

subsample is used as the validation data for testing the

model while the other subsamples are used as training

data. This process is repeated 10 times with each of

the subsample used once as the validation data. The

averaged result produces the estimation of the model.

Using cross-validation is a crucial task in order to

avoid ”overﬁtting” effect of the model.

5 EXTRACTING UNITS

In the second step of the approach, we work on tex-

tual fragments obtained from the ﬁrst step. We have

signiﬁcantly reduced the search space and need now

to detect, extract and identify variant terms of units or

new units. It involves working on textual fragments

as follows:

• detecting terms of units inside textual fragments:

the tags used in the annotation process of the ﬁrst

step are our entry point;

• extracting terms (or fragment of terms of units)

from the entry point: both sides of the tag are

scanned and the process stops when it encounters

a common word of the dictionary;

• identifying variant terms of units or new units us-

ing the new string distance measure.

The identifying process relies on a new string dis-

tance measure, adapted from Damerau-Levenshtein

(DL) distance (Damerau, 1964). The DL distance be-

tween two strings is deﬁned as the minimum number

of edits needed to transform one string into another,

with the edit operations being insertion, deletion, or

substitution of a single character. The distance can

then be normalized by using (Maedche and Staab,

2002) approach:

(u1,u2) = max[0;

min(|u1|,|u2|) − DL(u1,u2)

min(|u1|,|u2|)

]

∈ [0;1]

The similarity measure is computed and the higher it

is, the closer the variant terms of units is to the units

of the OTR.

Example 1. Let us consider the similarity between

amol/(m.s.Pa) and amol/m.sec.Pa according to the

classical DL distance. 2 characters are removed ( ”(”

and ”)”), 2 characters are inserted (”e” and ”c”).

The D

(the distance between those units) is 4 and

the similarity distance normalized according to D

then:

(amol/(m.s.Pa),amol/m.sec.Pa) = max[0;

13 − 4

]

= 0.69

We adapted this distance by considering each

block of characters forming the terms of units rather

than the simple characters. Thus, in our previous ex-

ample, the process considers each block as amol, m,

s, Pa and compares each of them with each block of

characters from the units referenced in the OTR. The

edit distance adapted to the block becomes D

= 1

since only the block sec varies and

(amol/(m.s.Pa),amol/m.sec.Pa) = max[0;

4 − 1

]

= 0.75

The new string distance measure is more accurate

and more relevant to address units of measure simi-

larity.

6 EXPERIMENTS AND RESULTS

6.1 Data Description

Experiments have been led on real-world dataset com-

posed of 115 scientiﬁc documents in food science do-

main, which represents a set of more than 35 000 sen-

tences. The positive corpus, where at least one unit

appears, represents more than 5 000 sentences. We

HowtoExtractUnitofMeasureinScientificDocuments?

253

use a list of 211 terms denoting units of the OTR in

order to conduct unit identiﬁcation. During experi-

ments, we can set the number of instances that will

constitute our training data. The results delivered in

our paper rely on a training set of 2 000 instances ran-

domly chosen from the corpus, and size balanced be-

tween the positive and negative instances. According

to the different case studies, the bag-of-words varies

from 3 000 to 4 800 features.

6.2 Learning the Best Location

6.2.1 Learning under Various Supervised

Algorithms

We have tested our ﬁve case studies. Only the 3 more

relevant cases are reported in Table 1. Each case is

represented by a symbol in order to make a readable

table:

• XP1: only the sentence where at least one unit

appears;

• XP2: two sentences after the one where at least

one unit appears;

• XP3: two sentences before the one where at least

one unit appears.

Table 1 restitutes the results with all weight-based

measures included and according to the case studies.

This ﬁrst table helps us to determine which case is the

most relevant context.

Table 1: Results of ”Unit” instances: Precision (P), Recall

(R), F-measure (F) are given for each case study. Best recall

in bold considering P and F.

Dec. Tree J48 Naive Bayes DMNB SMO

P R F P R F P R F P R F

XP1 0.99 0.87 0.93 0.83 0.93 0.88 0.95 0.96 0.95 0.99 0.99 0.99

XP2 0.99 0.92 0.96 0.95 0.77 0.85 0.93 0.96 0.95 0.99 0.97 0.99

XP3 0.99 0.92 0.95 0.77 0.98 0.86 0.94 0.96 0.95 0.99 0.97 0.98

Table 2: Results of ”Unit” instances: Precision (P), Recall

(R), F-measure (F) are given for each weight-based mea-

sures and boolean matrix. Best results in bold.

Algo. Dec. Tree J48 Naive Bayes DMNB SMO

P R F P R F P R F P R F

Boolean 0.99 0.87 0.93 0.83 0.93 0.88 0.95 0.96 0.95 0.99 0.99 0.99

TF 0.99 0.86 0.92 0.69 0.85 0.76 0.95 0.96 0.95 0.84 0.90 0.87

TF.IDF 0.99 0.86 0.92 0.69 0.85 0.76 0.95 0.96 0.95 0.84 0.90 0.87

Okapi 0.99 0.86 0.92 0.69 0.86 0.76 0.95 0.96 0.95 0.77 0.88 0.82

We take into consideration several parameters in

order to conclude about the most relevant context: re-

call, overall classiﬁcation, F-measure, stability of the

various results in the weight-based measures that are

presented in section 6.2.2 and confusion matrix re-

sults. We are particularly interested in recall since our

aim is obtaining the most relevant instances that are

retrieved considering ”unit” class but, without losing

too much precision in the results, which is described

by the F-measure. Firstly, we can say that Naive

Bayes returns F-measure rates ranging from 0.85 to

0.88. Decision tree (i.e. J48) returns better rates from

0.93 to 0.96. DMNB and SMO return better values

(0.95 to 0.99).

Secondly, we can observe that a larger context (i.e.

composed of two sentences – XP2 and XP3) does not

improve the results. We can conclude that consider-

ing the smallest context based on one sentence (i.e.

XP1) is enough for unit extraction. This allows us to

signiﬁcantly reduce the search space while being in

an optimal context of discovery.

6.2.2 Learning under Various Weight-based

Measure

Table 2 restitutes the results on the XP1 context, pre-

viously underlined, according to the three weight-

based measures and the boolean matrix. This second

table allows us to conclude whether the learning algo-

rithm is constant on the weight-based measures.

As the previous analysis, we take into consid-

eration recall, overall classiﬁcation and F-measure.

Firstly, we can say that, all weight-based measures

included, Naive Bayes returns rates declining from

86.9% (boolean matrix) to 73.4% (okapi measure),

decision tree J48 overall classiﬁcation falls a bit

around 92 − 93%. SMO loses almost 20% with a

rate declining from 99.6% to 80.9%. DMNB stays

constant with 95.5% regardless of weight-based mea-

sures. The DMNB model stays constant on precision,

recall and F-measure and achieves a very good recall

at 0.96.

6.2.3 Discussion

In this ﬁrst step, we have studied how supervised

learning methods can be successfully used in order

to optimize the process of unit location drowned in

unstructured scientiﬁc documents. For this, we have

compared several algorithms and weight-based mea-

sures before concluding that the best context for dis-

covering variant terms of units and even new units

(cf. next sub-section) is the sentence where at least

one unit appears. This result is constantly conﬁrmed

regardless weight-based measures under DMNB al-

gorithm with a recall at 0.96 and a precision at

0.95. Thus, we have successfully overcome the time-

consuming process of location issue.

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

254

Table 4: Identifying variant terms of units.

Variant term Reference SMDc SMDb

10e10 (cm3.m-1.sec-1.Pa-1) 10e10.cm3.m-1.sec-1.Pa-1 0.87 1

10e-14(cm3/m.s.Pa) 10e-14.cm3/(m.s.Pa) 0.89 1

10e-16cm3.cm/cm.cm2.s.Pa (10e-16cm3.cm)/(cm2.s.Pa) 0.76 0.8

10e18 (mol.m/Pa.sec.m2) 10e18.mol.m/(Pa.sec.m2) 0.87 1

amol.m-1.s-1.Pa-1 amol.s-1.m-1.Pa-1 0.88 0.75

amol/m.s.Pa amol/(m.s.Pa) 0.84 1

amol/m.sec.Pa amol/(m.s.Pa) 0.69 0.75

cm3.um/m2.d.kPa cm3.µm/(m2.d.kPa) 0.77 0.8

Table 3: Discovering new terms of units.

Class Nb Relevancy Terms of units

Singular Unit 9 55.5% mol, s, m, g, d, etc.

Unit Exponentiation 7 85.7% m

−2

, m

−1

, d

−1

, s

−1

, m

, etc.

Unit Division or Multi. 39 43.6% mol.L

-1

, mol/L,

mL/kg/h, etc.

Unit Multiple or Submulti. 4 50.0% amol, mL, etc.

Total 59 50.8% All previous units

Besides that, we have performed some extra ex-

periments in order to know what kind of features are

involved in the learning methods. The question asked

is: are the terms of units referenced in the OTR the

best features used in learning step? To answer this

question, we have removed from the bag-of-words

”unit” features and we have relaunched the learning

algorithm DMNB in the same context. The recall

and the precision drop slightly but remains quite hon-

orable with respectively 0.81 and 0.76. This result

shows that other features are involved in the learning

process. We have then conducted a feature classiﬁ-

cation according to their different weight-based mea-

sures. The top k classiﬁcation with k=10 shows that

there are not only units but features like nouns, verbs,

or adverbs describing a context where unit may ap-

pear (e.g. water, temperature, RH for relative humid-

ity, thickness, modiﬁed, solutions, samples, concen-

tration, min, permeability, respectively). We can say

that units and other types of terms are closely related

in the overall expression of an experimental data.

6.3 Extracting Variant Terms of Units

and Discovering New Ones

From the results obtained in step 1, we conclude that

the best location to work in, is the sentence where at

least one unit appears. We apply the process accord-

ing to different thresholds of the new similarity mea-

sure, adapted to the block and compare those experi-

ments to the results obtained with the classical string

distance measure as described in section 5.

In our experiments, we assume that a similarity

measure between 0.2 and 0.4 is favourable for discov-

ering new terms of units. Indeed, these values rep-

resent the case where at least one part of the blocks,

a subset of units is recognized using the new string

distance measure. Let us consider the following de-

rived unit g.m

-1

.Pa

-1

. This unit is referenced in

the OTR and used in the process of recognition. In

our approach, we consider 4 blocks g, m

-1

, d

-1

and

-1

. In the process of recognition, the base unit g,

found in the text and not referenced in the OTR yet, is

recognized compared to the derived unit with a sim-

ilarity measure equals 0.2. The g block is identiﬁed.

It is then relevant to propose this new base unit to be

introduced in the OTR. Considering this assumption,

we obtained 59 new possible units for the four classes

of the Unit concept, organized as described in Table

The similarity measure between 0.5 and 1 is

favourable for identifying variant terms of units. In-

deed, the more the similarity measure is close to 1,

the more terms of units are the same. We obtained

17 variant terms of units, we have 8 relevant variant

terms that can be introduced in the OTR. The relevant

variant terms of units are described in Table 4.

In this second step, we have successfully detected

and extracted either relevant variant terms of units or

new units. We can potentially enrich the OTR of 38

terms of units or 18% of enrichment (the actual OTR

references 211 units). We have adapted a new string

distance measure in order to compare variant and new

terms of units to the ones referenced in the OTR. We

compare blocks of units rather than characters, which

proves to be more efﬁcient in our issue. We have no-

ticed that this process is quite relevant to identify new

terms of units but is under our expectations concern-

ing the identiﬁcation of variant terms of units.

In our process of identifying blocks, we currently

ignore all separators between those blocks, which are

very speciﬁc in the case of units (e.g. /, ., (, ), x).

Those characters have semantic meanings that need

to be added in the concern of extraction and identiﬁ-

cation, in order to improve the results of the similarity

measure. We intend to carry out this work given these

promising results. Another future work is completing

the process with a new step in order to identify what

HowtoExtractUnitofMeasureinScientificDocuments?

255

kind of terms of units are discovered (e.g. a tempera-

ture unit, a relative humidity unit, a permeability unit)

and allow an automatic integration of variant or new

units directly in the OTR.

7 CONCLUSIONS AND FUTURE

WORK

The paper presents our overall two-step approach in

order to identify variant terms of units or new units in

unstructured scientiﬁc documents. We have success-

fully demonstrated how supervised learning methods

can help reduce the search space of terms of units

drowned in textual information. We have achieved

almost 86% of reducing. For this purpose, we have

used many algorithms and weight-based measures to

prove the relevancy of our approach. We have also

presented our ﬁrst results to detect and extract vari-

ant terms and new units, using a new string distance

measure adapted to our unit identiﬁcation issue. As

presented in the paper, this second step needs to be

deepened. Our approach, presented in this paper, ad-

dresses the entire issue: location, extraction and iden-

tiﬁcation of units. Further work on semantical identi-

ﬁcation of variant terms and new units should be ad-

dressed in order to automatically populate the OTR.

REFERENCES

Cohen, W. W., Ravikumar, P., and Fienberg, S. E. (2003).

A comparison of string distance metrics for name-

matching tasks. In Proc. of the Workshop on Informa-

tion Integration on the Web (IIWeb-03), volume 47.

Cunningham, H., Maynard, D., Bontcheva, K., and Tablan,

V. (2002). GATE: A framework and graphical de-

velopment environment for robust NLP tools and ap-

plications. In Proceedings of the 40th Annual Meet-

ing of the Association for Computational Linguistics,

Philadelphia, PA, USA.

Damerau, F. J. (1964). A technique for computer detec-

tion and correction of spelling errors. Commun. ACM,

7(3):171–176.

Hawizy, L., Jessop, D., Adams, N., and Murray-Rust, P.

(2011). ChemicalTagger: a tool for semantic text-

mining in chemistry. Journal of cheminformatics,

3(1):17.

Hiemstra, D. (2000). A probabilistic justiﬁcation for using

tf x idf term weighting in information retrieval. Int. J.

on Digital Libraries, 3(2):131–139.

Hignette, G., Buche, P., Couvert, O., Dibie-Barth

elemy, J.,

Doussot, D., Haemmerl

e, O., Mettler, E., and Soler,

L. (2008). Semantic annotation of Web data applied

to risk in food. International Journal of Food Micro-

biology, 128(1):174–180.

Jessop, D. M., Adams, S. E., and Murray-Rust, P. (2011a).

Mining chemical information from open patents.

Journal of cheminformatics, 3(1):40.

Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy,

L., and Murray-Rust, P. (2011b). OSCAR4: a ﬂexi-

ble architecture for chemical text-mining. Journal of

cheminformatics, 3(1):1–12.

John, G. H. and Langley, P. (1995). Estimating continuous

distributions in bayesian classiﬁers. In Proc. of the

conf. on Uncertainty in artiﬁcial intelligence, pages

338–345.

Jones, K. S., Walker, S., and Robertson, S. E. (2000). A

probabilistic model of information retrieval: develop-

ment and comparative experiments - part 1. Inf. Pro-

cess. Manage., 36(6):779–808.

Kohavi, R. and Quinlan, J. R. (2002). Data mining tasks and

methods: Classiﬁcation: decision-tree discovery. In

Handbook of data mining and knowledge discovery,

pages 267–276. Oxford University Press, Inc.

Maedche, A. and Staab, S. (2002). Measuring similar-

ity between ontologies. In Knowledge Engineering

and Knowledge Management: Ontologies and the Se-

mantic Web, volume 2473 of LNCS, pages 251–263.

Springer.

Maynard, D., Li, Y., and Peters, W. (2008). Nlp techniques

for term extraction and ontology population. In Pro-

ceeding of the 2008 conference on Ontology Learning

and Population: Bridging the Gap between Text and

Knowledge, page 107127.

Platt, J. C. (1999). Fast training of support vector machines

using sequential minimal optimization. In Advances

in kernel methods, pages 185–208. MIT Press.

Rijgersberg, H., van Assem, M., and Top, J. (2013). Ontol-

ogy of units of measure and related concepts. Seman-

tic Web.

Rijgersberg, H., Wigham, M., and Top, J. (2011). How se-

mantics can improve engineering processes: A case of

units of measure and quantities. Advanced Engineer-

ing Informatics, 25(2):276–287.

Su, J., Zhang, H., Ling, C. X., and Matwin, S. (2008). Dis-

criminative parameter learning for bayesian networks.

In Proc. of the int. conf. on Machine learning, pages

1016–1023.

Thompson, A. and Taylor, B. N. (2008). Guide for the use

of the international system of units (SI).

Touhami, R., Buche, P., Dibie-Barth

elemy, J., and Ibanescu,

L. (2011). An ontological and terminological resource

for n-ary relation annotation in web data tables. On

the Move to Meaningful Internet Systems: OTM 2011,

pages 662–679.

Van Assem, M., Rijgersberg, H., Wigham, M., and Top, J.

(2010). Converting and annotating quantitative data

tables. The Semantic Web–ISWC 2010, pages 16–31.

Willems, D. J., Rijgersberg, H., and Top, J. (2012). Identify-

ing and extracting quantitative data in annotated text.

SWAIE.

Wimalasuriya, D. C. and Dou, D. (2010). Ontology-based

information extraction: An introduction and a survey

of current approaches. Journal of Information Sci-

ence, 36(3):306323.

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

256