The GENIE Project

A Semantic Pipeline for Automatic Document Categorisation

Angel L. Garrido, Maria G. Buey, Sandra Escudero, Alvaro Peiro, Sergio Ilarri and Eduardo Mena

IIS Department, University of Zaragoza, Zaragoza, Spain

Keywords:

Knowledge Management, Text Mining, Ontologies, Linguistics.

Abstract:

Automatic text categorisation systems is a type of software that every day it is receiving more interest, due not

only to its use in documentaries environments but also to its possible application to tag properly documents

on the Web. Many options have been proposed to face this subject using statistical approaches, natural lan-

guage processing tools, ontologies and lexical databases. Nevertheless, there have been no too many empirical

evaluations comparing the inﬂuence of the different tools used to solve these problems, particularly in a mul-

tilingual environment. In this paper we propose a multi-language rule-based pipeline system for automatic

document categorisation and we compare empirically the results of applying techniques that rely on statistics

and supervised learning with the results of applying the same techniques but with the support of smarter tools

based on language semantics and ontologies, using for this purpose several corpora of documents. GENIE is

being applied to real environments, which shows the potential of the proposal.

1 INTRODUCTION

In almost any public or private organization that man-

ages a considerable amount of information, activi-

ties related to text categorisation and document tag-

ging can be found. To do this job, large organiza-

tions have documentation departments. However, the

big amount of information in text format that orga-

nizations usually accumulate cannot be properly pro-

cessed and documented by these departments. Be-

sides, the manual labor of labeling carried out by

these people is subject to errors due to the subjectivity

of the individuals. That is why a tool that automates

categorisation tasks would be very useful, and would

help to improve the quality of searches that are per-

formed later over the data.

To perform these tasks, software based on statis-

tics and the frequency of use of words can be

used, and it is also very common to use machine

learning systems (Sebastiani, 2002). However, we

think that other kinds of tools capable of dealing

with aspects related to Natural Language Processing

(NLP) (Smeaton, 1999) are also necessary to comple-

ment and enhance the results provided by these afore-

mentioned techniques.

Moreover, to perform any task related to the pro-

cessing of text documents, it is highly recommended

to own the know-how of the organization, so it is

highly advisable to manage ontologies (Gruber et al.,

1993) and semantic tools such as reasoners (Mishra

and Kumar, 2011) to make knowledge explicit and

reason over it, respectively. Furthermore, it is very

common for organizations to have their own catalog

of labels, known as thesaurus (Gilchrist, 2003), so it

is important that the system is able to obtain not only

keywords from the text, but also know how to relate

them to the set of thesaurus descriptors.

Furthermore, this same issue is also found in the

Web where much of the information is in text for-

mat. Providing tools capable of automatically tag-

ging web pages is something very helpful in order

to improve the search and retrieval tasks of informa-

tion using search engines, and today is still an open

problem (Atkinson and Van der Goot, 2009; Chau and

Yeh, 2004) due (among others) to the existing seman-

tic and linguistic difﬁculties to process.

Our purpose is to bring together these techniques

into an architecture that enables the automatic classi-

ﬁcation of texts, with the particular feature that it ex-

ploits different semantic methods, which is added as

a new element in text categorization to support typ-

ical techniques that rely on statistics and supervised

learning. Although there are some researches in text

categorization that takes into account Spanish texts

as examples, there are no tools especially focused on

the Spanish language. Moreover, the proposed sys-

161

Garrido A., Buey M., Escudero S., Peiro A., Ilarri S. and Mena E..

The GENIE Project - A Semantic Pipeline for Automatic Document Categorisation.

DOI: 10.5220/0004750601610171

In Proceedings of the 10th International Conference on Web Information Systems and Technologies (WEBIST-2014), pages 161-171

ISBN: 978-989-758-024-6

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

tem has been implemented to be open to allow the

possibility to add the analysis of other languages, like

English, French, or Portuguese.

Other important characteristics of the architec-

ture is that it has been proposed as a pipeline sys-

tem and it has been implemented with different mod-

ules. We consider these as important features because

a pipeline system gives us the chance to control the

results at each phase of the process and also the struc-

ture with different modules allows us to easily up-

grade its individual components. For example, ge-

ographic or lexical databases change over time, and

our modular arthitecture easily acommodates these

changes.

The fact that the system is implemented in differ-

ent modules is also interesting because it is ideal when

performing the analysis of a text. Sometimes, we may

want not to have to use all the modules that make up

the architecture to achieve a desired result. For exam-

ple, we may want to extract only statistical informa-

tion from the words present in a text, but nothing re-

lated about their semantics. Also, it is possible that we

need to change the order of the modules a text passes

through depending on the type of analysis of the text

we want to perform. For these reasons it is impor-

tant to consider a modular architecture: it makes the

system easy to use and it facilitates improving it over

time.

This paper provides two main contributions:

• Firstly, we present a tool called GENIE, whose

general architecture is valid for text categorisation

tasks in any language. This system has been in-

stalled and tested in several real environments us-

ing different datasets. The set-up of our algorithm

is rule-based and we use for inference the docu-

ment’s features as well as the linguistic content of

the text and its meaning.

• Secondly, we experimentally quantify the inﬂu-

ence of using linguistic and semantic tools when

performing the automatic classiﬁcation, working

on a real case with Spanish texts previously classi-

ﬁed by a professional documentation department.

The rest of this paper is structured as follows. Sec-

tion 2 explains the general architecture of the pro-

posed categorisation system. Section 3 discusses the

results of our experiments with real data. Section 4

analyzes other related works. Finally, Section 5 pro-

vides our conclusions and future work.

2 PROPOSED ARCHITECTURE

In this section, we explain the general architecture of

the proposed system as well as and the corresponding

working methodology. The system relies on the ex-

istence of several resources. First, we will describe

these resources, and then we will explain in detail the

classiﬁcation process (see Figure 1).

2.1 Resources

Regarding resources, we have to consider both static

data repositories and software tools:

• Thesaurus. A thesaurus is a list of words and

a set of relations among them, used to classify

items (Gilchrist, 2003). We use its elements as

the set of tags that must be used to categorize the

set of documents. Examples of thesaurus entries

are words like HEALTH, ACCIDENT, FOOT-

BALL, BASKETBALL, REALMADRID, CIN-

EMA, JACK NICHOLSON, THEATER, etc. The

terms can be related. For example, FOOTBALL

and BASKETBALL could depend hierarchically

on SPORTS. Each document may take a variable

number of terms in the thesaurus during the cate-

gorisation process.

• Gazetter. A gazetteer is a geographic directory

containing information about places and place

names (Goodchild and Hill, 2008). In our system,

it is used to identify geographic features.

• Morphological Analyzer. It is an NLP tool whose

mission is the identiﬁcation, analysis and descrip-

tion of the structure of a set of given linguistic

units. This analyzer consists of a set of differ-

ent analysis libraries, which can be conﬁgured

and used depending on the working language, and

a custom middleware architecture which aims to

store all the different analysis results in structures

that represent the desired linguistic units, such as

words (with their morphological and lexical in-

formation), sentences (with their syntax and de-

pendency trees) and texts. With this approach we

can provide the same entities to the other modules

that work with NLP, resulting in an architecture

that can work with multiple analysis tools and lan-

guages.

• Lexical Database. A lexical database is a lexi-

cal resource which groups words into sets of syn-

onyms called synsets, includes semantic relations

among them, and provides deﬁnitions. Examples

could be WordNet (Miller, 1995) and Euroword-

Net (Vossen, 1998).

• Stop Word List. This is a list of frequent words

that do not contain relevant semantic informa-

tion (Wilbur and Sirotkin, 1992). In this set we

may include the following types of words: arti-

cles, conjunctions, numbers, etc.

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

162

Figure 1: General pipeline of GENIE, the proposed text categorisation system.

• Knowledge Base. This refers to the explicit repre-

sentation of knowledge related to the topics cov-

ered in the documents that have to be catalogued.

As a tool for knowledge representation in a soft-

ware system, we use ontologies. An ontology is a

formal and explicit speciﬁcation of a shared con-

ceptualization (Gruber et al., 1993) that provides

a vocabulary of classes and relations to describe

a particular area, supporting automatic inferences

by using a reasoner (Mishra and Kumar, 2011).

The idea is to represent in these ontologies the

concepts that could help to label a document in a

given context, and to populate the ontologies with

as many instances as possible.

• Statistical Information. This consists of a set of

ﬁles with information about the use frequency of

each word, related to the attributes of the text and

to the set of elements in the thesaurus. For ex-

ample: the word “ONU” appears more frequently

in documents of type “International” and it is re-

lated with the descriptor INTERNAT in a the-

saurus used in the documentation department of a

newspaper we have worked with. These frequen-

cies allow us to estimate if a given text can be cate-

gorized with a particular element of the thesaurus.

• Relationships Table. This table relates items in the

Gazetteer and knowledge base concepts with the-

saurus elements. It may be necessary in an organi-

zation because the concepts stored in the seman-

tic resources available may not match the labels

in the thesaurus that must be considered for clas-

siﬁcation. The construction of this table could be

manual or automatic, using any machine learning

method.

As we will show in the experimental evaluation,

the use of some resources is optional, leading to dif-

ferent results in terms of the expected performance

of the system. This system could be used with dif-

ferent languages by changing the language-dependent

resources, i.e. the Gazetteer, the NLP tool, the lexical

database, and the stop word list.

2.2 Process Pipeline

We have used a pipeline scheme with separated

stages. Each of the stages is associated with only one

type of process and they communicate between them-

selves through different ﬁles. Although it is a pipeline

system, the process can be conﬁgured so that each of

the tasks can be activated or deactivated depending on

whether we want the text document to go through cer-

tain phases or not. For example, we may want to use

the pipeline without having activated the Geographi-

cal Classiﬁer. This choice has three purposes:

1. The early stages perform a more general classi-

ﬁcation, and later phases make more speciﬁc la-

beling that requires more precise resources. We

have veriﬁed, through experimental evaluation,

that taking advantage of a ﬁlter to select the most

appropriate resources for the later stages improves

the results.

2. Separating each stage simpliﬁes control for eval-

uation. We know that there are certain tasks that

could be parallelized, but the aim is to analyze the

results in the best possible way, rather than to pro-

vide an optimized algorithm.

3. We have more freedom to add, delete or modify

any of the stages of the pipeline if they are inde-

pendent. If we would like to use a different tool

in any of the stages, changing it is very easy when

there is a minimum coupling between phases.

TheGENIEProject-ASemanticPipelineforAutomaticDocumentCategorisation

163

Our system works over a set of text documents,

but we have to note that each of them could have a

variable number of attributes (author, title, subtitle,

domain, date, section, type, extension, etc.), that we

will use during the categorisation process. These at-

tributes vary according to the origin of the document:

a digital library, a database, a website, etc. Numeric

ﬁelds, dates, strings, or even HTML tags may be per-

fectly valid attributes to the system. We could also

consider as attributes those elements that are speciﬁc

to the structure of the type of document, such as hy-

perlinks (Shen et al., 2006) in the case of web pages.

As a very ﬁrst stage, the system includes speciﬁc in-

terfaces to convert the original documents into XML

ﬁles with a speciﬁc ﬁeld for plain text and others for

attributes.

The tasks for the proposed automatic text categori-

sation system are:

1. Preprocessing of the text of the document, which

consists of three steps:

(a) Lemmatization. Through this process we ob-

tain a new text consisting of a set of words

corresponding to the lemmas (canonical forms)

of the words in the initial text. This process

eliminates prepositions, articles, conjunctions

and other words included in the Stop Words

List. All the word information (Part of Speech,

gender, number) is stored in the corresponding

structure, so it can be recovered later for future

uses.

(b) Named Entities Recognition (NER). Named en-

tities are atomic elements in a text represent-

ing, for example, names of persons, organiza-

tions, locations, quantities, monetary values, or

percentages (Sekine and Ranchhod, 2009). By

using a named entity extractor, this procedure

gets a list of items identiﬁed as named entities.

This extractor can be paired with a statistical

Named Entity Classiﬁcation (NEC) in a ﬁrst at-

tempt to classify the named entity into a pre-

deﬁned group (person, place, organization) or

leave it undeﬁned so the following tasks (Geo-

graphical Classiﬁer) can disambiguate it.

lected from the text that are in fact key elements

to consider to categorize the document. We

use the lemmatized form of such words and the

TF/IDF algorithm (Salton and Buckley, 1988).

These processes produce several results that are

used in subsequent stages. The resources used in

this stage are the morphological analyzer, the Stop

Word List and the statistical data.

2. Attributes-Based Classiﬁer. Taking advantage of

the attributes of each of the documents, this ruled-

based process makes a ﬁrst basic and general tag-

ging. For example, if we ﬁnd the words “ﬁlm

review” in the “title” ﬁeld the system will infer

that the thesaurus descriptor CINEMA could be

assigned to this document. At the same time, it es-

tablishes the values of the attributes to be used for

the selection of appropriate resources in the fol-

lowing steps, choosing for instance an ontology

about cinema for the Ontological Classiﬁer stage.

3. Statistical Classiﬁer. Using machine learning

techniques (Sebastiani, 2002), the document text

is analyzed to try to ﬁnd patterns that correspond

to data in the ﬁles storing statistical information.

This step is mainly useful to try to obtain labels

that correspond to the general themes of the doc-

ument. Trying to deduce if a document is talking

about football or basketball could be a good ex-

ample.

4. Geographical Classiﬁer. By using the gazetteer,

named entities (NE) corresponding to geograph-

ical locations are detected. This stage is man-

aged by a ruled-based system. Besides, it can

deal with typical disambiguation problems among

locations of the same name and locations whose

names match other NE (e.g., people), by using

the well-known techniques described in (Amitay

et al., 2004): usually there is only single sense

per discourse (so, an ambiguous term is likely to

mean only one of its senses when it its used multi-

ple times), and place names appearing in the same

context tend to show close locations. Other im-

portant considerations that GENIE takes into ac-

count are to look at the population of the location

candidates as an important aspect to disambiguate

places (Amitay et al., 2004) and consider the con-

text where the text is framed to establish a list of

bonuses for certain regions (Quercini et al., 2010).

Other used techniques are to construct an N-term

window on both sides of the entity considered to

be a geographic term, as some words can con-

tribute with a positive or negative modiﬁer (Rauch

et al., 2003), or to try to ﬁnd syntactic structures

like “city, country” (e.g. “Madrid, Spain”) (Li

et al., 2002). Finally, using techniques explained

in (Garrido et al., 2013b), the system uses ontolo-

gies in order to capture information about impor-

tant aspects related to certain locations. For ex-

ample: most important streets, monuments and

outstanding buildings, neighborhoods, etc. This is

useful when a text has not explicit location identi-

ﬁed. Besides, it takes advantage too of the results

of previous stages. For example, if in the previ-

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

164

ous stages we got the descriptor EUROPE we can

assign higher scores to the results related to Eu-

ropean countries and major European cities than

to results related to locations in other continents.

The geographical tagging unit is very useful be-

cause, empirically, near 30% of tags in our exper-

imental context are related to locations.

5. Ontological Classiﬁer. To perform a detailed la-

beling, the knowledge base is queried about the

named entities and keywords found in the text.

If a positive response is obtained, it means that

the main related concepts can be used to label

the text. A great advantage is that these con-

cepts need not appear explicitly in the text, as they

may be inferred from existing ontological rela-

tions. If there is an ambiguous word, it can be

disambiguated (Resnik, 1999) by using the Lexi-

cal Database resource (for a survey on word sense

disambiguation, see (Navigli, 2009)). As soon as

a concept related to the text is found, the relations

stored in the Relationships Table are considered

to obtain appropriate tags from the thesaurus. As

explained before, the fact that at this phase we

have a partially classiﬁed document allows us to

choose the most appropriate ontologies for classi-

ﬁcation using conﬁgurable rules. For example, if

we have already realised with the statistical clas-

siﬁer that the text speaks of the American Bas-

ketball League, we will use a speciﬁc ontology

to classify the document more accurately ﬁnding

out for instance the teams and the players, and we

will not try to use any other resource. This partic-

ular ontology could be obtained and re-used from

the Web. But if we had discovered that the text

is about local politics, we will use another spe-

ciﬁc ontology to deduce the most appropriate tags.

This ontology would probably be hand-made, or

it would be adapted from other similar ontology,

because this kind of resources are difﬁcult or im-

possible to ﬁnd for free on the Web. So, our

system is generic enough to acommodate the re-

quired and more appropriate ontologies (existing

or hand-made) for the different topics covered in

the texts.

The way to obtain the tags is asking about

keywords and NE to the ontology, by using

SPARQL

, a set of rules, and the relationship ta-

ble to deduce the most suitable tags. The behavior

of the ontology is not only to be a simple bag-of-

words, because it can contain concepts, relations

and axioms, all of them very useful to inquire the

http://www.w3.org/TR/2006/WD-rdf-sparql-query-

20061004/

implicit topics in the text.

In summary, the text categorization process that GE-

NIE performs consists of following each of the pro-

posed tasks that constitute the system’s pipeline. This

process begins with the preprocessing of the input

text, which implies labors of lemmatization of the text

and extraction of named entities and keywords from

the text. Then it analyzes a set of attributes that are

given with the text that is being analyzed in order to

extract the ﬁrst basic and general labels. Afterwards,

it applies a statistical classiﬁcation method based on

machine learning techniques to obtain labels that cor-

respond to the general themes of the document. Then

it applies a geographic classiﬁer for the purpose of

identifying possible geographical references included

in the text. Finally, it applies an ontological classiﬁer

in order to carry out a more detailed classiﬁcation of

the text, which performs an analysis of named enti-

ties and keywords obtained from the text, consults the

appropriate ontology, and uses a lexical database to

remove possible ambiguities.

3 EXPERIMENTAL EVALUATION

We have performed a set of experiments to test and

compare the performance of our architecture with oth-

ers tools. For this purpose, we have tested in a real

environment using three corpus of news previously

labeled by a professional documentation department

of several major Spanish Media: Heraldo de Arag

Diario de Navarra

and Heraldo de Soria

. Each cor-

pus had respectively 11,275, 10,200, and 4,500 news.

These corpora are divided in several categories: lo-

cal, national, international, sports, and culture. Ev-

ery media has a different professional thesaurus used

to classify documents, with more than 10,000 entries

each. For classiﬁcation, each document can receive

any number of descriptors belonging to the thesaurus.

The ideal situation would be that the automatic text

categorization system could perform a work identical

to the one performed by the real documentation de-

partments.

These news are stored in several databases, in ta-

bles where different ﬁelds are used to store the dif-

ferent attributes explained in Section 2 (title, au-

thor, date, section, type, extension, etc.). For exper-

imental evaluation, we have extracted them from the

databases and we have put each text and the data of its

ﬁelds in XML ﬁles. We have used this corpus of XML

http://www.heraldo.es/

http://www.diariodenavarra.es/

http://www.heraldodesoria.es/

TheGENIEProject-ASemanticPipelineforAutomaticDocumentCategorisation

165

Figure 2: GENIE control interface.

ﬁles as the input of the system, and the output is the

same set of ﬁles but with an additional ﬁeld: classiﬁ-

cation information. This new XML node contains the

set of words (descriptors) belonging to the thesaurus

used to categorize the document, i.e., this node con-

tains the different tags that describe the XML ﬁle. An

example of node that can be contained in an XML ﬁle

of cinema is:

CULTURE CINEMA WOODY_ALLEN

</classify>

As the news in the dataset considered had been

previously manually annotated by the professionals

working in the documentation department, we can

compare the automatic categorization with that per-

formed by humans. So, we can evaluate the number

of hits, omissions and misses.

3.1 Experimental Settings

In the experiments, we have examined the follow-

ing measures, commonly used in the Information Re-

trieval context (Manning et al., 2008): the precision,

the recall, and the F-Measure. The dataset used ini-

tially in the experiments has been the Heraldo de

Arag

on corpus. We have used the information from

this dataset to deﬁne most of the rules of the vari-

ous processes associated with each of the stages of

the classiﬁcation system. These rules are integrated

in a conﬁguration ﬁle which contains all the informa-

tion necessary to lead the process and obtain the cor-

rect result. The other two datasets (Diario de Navarra

and Heraldo de Soria) have been used just to double-

check if the application of those rules also produced

the desired result; for comparison purposes, at the end

of this section we will also present some experimental

results based on them. Since news, thesauri, ontolo-

gies and classiﬁcation ﬁelds are private data of each

company, they are not available on-line on the Web

Anyway, if any researcher wants to use our corpus for

experimental purposes, he/she is entitled to apply directly

to the ﬁrst author and they will be provided privately.

In a ﬁrst stage, we have used MALLET (Mc-

Callum, 2002) to classify the different news corpus.

MALLET is a tool that allows the classiﬁcation of

documents. A classiﬁer is an algorithm that dis-

tinguishes between a ﬁxed set of classes, such as

“spam” vs. “non-spam”, based on labeled training ex-

amples. MALLET includes implementations of sev-

eral classiﬁcation algorithms, including Naive Bayes,

Maximum Entropy, and Decision Trees. In addi-

tion, MALLET provides tools for evaluating classi-

ﬁers. In addition to classiﬁcation, MALLET includes

tools for sequence tagging for applications such as

the extraction of named entities from text. The al-

gorithms include Hidden Markov Models, Maximum

Entropy Markov Models, and Conditional Random

Fields. These methods are implemented in an extensi-

ble system for ﬁnite state transducers. The following

classiﬁers were used: MaxEnt, Naive Bayes, C45 and

DecisionTree, achieving in the best case 60% in all of

the three measures (precision, recall and F-measure).

Afterwards, we have performed four experiments

with our own classiﬁer. The appearance of the GE-

NIE control application can be seen in Figure 2. Each

stage of the pipeline can be enabled or disabled sep-

arately. Regarding the resources and tools consid-

ered, we have used Freeling (Carreras et al., 2004),

as the Morphological Analyzer and Support Vector

Machines (SVM) (Joachims, 1998) to automatically

classify topics in the Statistical Classiﬁer.

We have chosen Freeling as is the only and widely used

active analysis tool suite that supports several analysis ser-

vices in both Spanish and English, as well other languages

which can be incorporated in our architecture in future de-

velopments. Some implementation details of Freeling were

modiﬁed in order to encapsulate it as a consistent library,

incorporating it into our architecture. As Freeling outputs

their analysis results in an undesired format to our approach,

the need to construct new structures for the several linguistic

units was necessary to deﬁne an architecture which can sup-

port this library and other different analysis tools than can

be added in the future. These structures aim to group most

of the mutual characteristics of the Romanic languages and

the English language into a single approach, while singular

language features had to be handled apart.

Regarding the type of SVM used, we have used a

modiﬁed version of the Cornell SVM-Light implementa-

tion (Joachims, 2004) with a Gaussian radial basis func-

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

166

tion kernel and the term frequency of the keywords as fea-

tures (Leopold and Kindermann, 2002). To obtain the fre-

quencies we have used a different corpus of 100,000 news,

in order to get a realistic frequency information. Finally,

we have chosen Eurowordnet as the Lexical Database and

Geonames

as the Gazetteer.

To train this Statistical Classiﬁer we have used sets of

5,000 news for each general theme associated to one de-

scriptor (FOOTBALL, BASKET, CINEMA, HANDBALL,

and MUSIC). These sets of news are different from the

datasets used in the experiments (as is obviously expected

in a training phase). For each possible descriptor, we have

an ontology, in this case we have designed ﬁve ontologies

using OWL (McGuinness et al., 2004) with near a hundred

concepts each one.

Next, there is an example of a piece of news:

This weekend is the best ﬁlm debut for the movie “In the

Valley of Elah”. The story revolves around the murder

of a young man who has just returned from the Iraq war,

whose parents try to clarify with the help of the police. As

interpreters we have: Tommy Lee Jones, Susan Sarandon

and Charlize Theron. Writer-director Paul Haggis is the

author of “Crash” and writer of “Million Dollar Baby”,

among others.

In this case, the system analyzes and classiﬁes the text with

the descriptor CINEMA. Moreover, the news can be tagged

with tags such as C THERON, IRAQ, TL JONES, etc.

3.2 Experimental Results

We have compared our classiﬁcation of the 11,275 news in

the ﬁrst dataset with the original classiﬁcation made by pro-

fessionals. The results can be seen in Figure 3. Below we

analyze the experiments:

1. In the ﬁrst experiment (Basic) we have used the pro-

cess presented in Section 2 without the Pre-Processing

step and without the Ontological Classiﬁer. We have

trained the system with SVM to classify 100 themes. In

this case, as we do not use the steps of Pre-Processing

and the Ontological Classiﬁer, the system has not per-

formed the lemmatization, the named entities recogni-

tion, the keywords extraction, and the detailed labeling

of the text. For this reason, the precision and the recall

are not good, as it is essential to embed semantic infor-

mation and conceptual patterns in order to enhance the

prediction capabilities of classiﬁcation algorithms.

2. In the second one (Semantic), we have introduced the

complete Pre-Processing stage and its associated re-

sources, we have used the Lexical Database EuroWord-

Net (Vossen, 1998) to disambiguate keywords, and we

have introduced the Ontological Classiﬁer, with ﬁve on-

tologies with about ten concepts and about 20 instances

each. In this experiment the precision and the recall

slightly improved because, as explained before, the step

of Pre-Processing is important to obtain a better classi-

ﬁcation.

http://www.geonames.org/

3. In the third one (Sem + Geo) we have included the Geo-

graphical Classiﬁer but we have used only the Gazzeter

resource. Here we have improved the recall of the label-

ing but in exchange of a decrease in the precision. By

analyzing the errors in detail, we observe that the main

cause is the presence of misclassiﬁcations performed by

the Geographical Classiﬁer.

4. Finally, in the fourth experiment (Full Mode), we have

executed all the pipeline, exploited all the resources and

populated the ontologies with about one hundred in-

stances, leading to an increase in both the precision and

the recall. Ontology instances added in this experiment

have been inferred from the observation of the errors

obtained in previous experiments. The motivation to

add them is that otherwise the text includes certain en-

tities unknown to the system, and when they were in-

corporated this helped to improve the classiﬁcation.

Figure 3: Results of the four document categorisation ex-

periments with news in the dataset 1.

If we look at the overall results obtained in the experi-

ment 1 and the experiment 2 in the Figure 3, we could say

that the inﬂuence of using semantic and NLP tools is ap-

parently not so signiﬁcant (about 20%). However, it seems

clear that these tools signiﬁcantly improve the quality of

labeling in terms of precision, recall and F-measure, reach-

ing up to about the 80%. Therefore, the use of semantic

techniques can make a difference when deciding about the

possibility to perform an automatic labeling.

After evaluating the results obtained in the reference

dataset (Heraldo de Arag

on), we repeated the same experi-

ments with the two other datasets. These dataset were not

considered while designing the ontologies, in order to main-

tain the independence of the tests. The results can be seen in

Figure 4. The results obtained with datasets different from

the one used for Heraldo de Arag

on, which was used to

conﬁgure the rule-based system, are only slightly different

(differences smaller than 10%). In Figure 4, it can also be

seen that the trends of the results are very similar regard-

less of the data. This shows the generality of our approach,

since the behavior of the classiﬁcation system has been re-

produced with several different corpora. Experimental re-

sults have shown that with our approach, in all the experi-

ments, the system has improved the results achieved by ba-

sic machine learning based systems.

TheGENIEProject-ASemanticPipelineforAutomaticDocumentCategorisation

167

4 RELATED WORK

Text categorisation represents a challenging problem for the

data mining and machine learning communities, due to the

growing demand for automatic information retrieval sys-

tems. Systems that automatically classify text documents

into predeﬁned thematic classes, and thereby contextualize

information, offer a promising approach to tackle this com-

plexity (Sebastiani, 2002).

Document classiﬁcation presents difﬁcult challenges

due to the sparsity and the high dimensionality of text data,

and to the complex semantics of natural language. The

traditional document representation is a word-based vector

where each dimension is associated with a term of a dic-

tionary containing all the words that appear in the corpus.

The value associated to a given term reﬂects its frequency of

occurrence within the corresponding document and within

the entire corpus (the tf-idf metric). Although this is a rep-

resentation that is simple and commonly used, it has sev-

eral limitations. Speciﬁcally, this technique has three main

drawbacks: (1) it breaks multi-word expressions into inde-

pendent features; (2) it maps synonymous words into dif-

ferent components; and (3) it considers polysemous words

as one single component. While a traditional preprocessing

of documents, such as eliminating stop words, pruning rare

words, stemming, and normalization, can improve the rep-

resentation, its effect is also still limited. So, it is essential

to embed semantic information and conceptual patterns in

order to enhance the prediction capabilities of classiﬁcation

algorithms.

Research has been done to exploit ontologies for

content-based categorisation of large corpora of documents.

WordNet has been widely used, for example in (Siolas and

d’Alch

e Buc, 2000) or (Elberrichi et al., 2008), but their

approaches only use synonyms and hyponyms, fail to han-

dle polysemy, and break multi-word concepts into single

terms. Our approach overcomes these limitations by incor-

porating background knowledge derived from ontologies.

This methodology is able to keep multi-word concepts un-

broken, it captures the semantic closeness to synonyms, and

performs word sense disambiguation for polysemous terms.

For disambiguation tasks we have taken into account

an approximation described in (Trillo et al., 2007), that is

based on a semantic relatedness computation to detect the

set of words that could induce an effective disambiguation.

That technique receives an ambiguous keyword and its con-

text words as input and provides a list of possible senses.

Other studies show how background knowledge in form of

simple ontologies can improve text classiﬁcation results by

directly addressing these problems (Bloehdorn and Hotho,

2006), and others make use of this intelligence to automat-

ically generate tag suggestions based on the semantic con-

tent of texts. For example (Lee and Chun, 2007), which

extracts keywords and their frequencies, uses WordNet as

semantics and an artiﬁcial neural network for learning.

Among other related studies that quantify the quality of

an automatic labeling performed by using ontologies, we

could mention (Maynard et al., 2006; Hovy et al., 2006),

but both are focused on a purely semantic labeling (i.e., they

do not consider statistics-based methods). More related to

our study, it is interesting to mention the work presented

in (Scharkow, 2013), although it does not include much in-

formation about the use of ontologies. Examples of hybrid

systems using both types of tools include the web service

classiﬁer explained in (Bruno et al., 2005), the system NASS

(News Annotation Semantic System) described in (Garrido

et al., 2011; Garrido et al., 2012), which is an automatic

annotation tool for the Media, or GoNTogle (Bikakis et al.,

2010), which is a framework for general document annota-

tion and retrieval.

5 CONCLUSIONS AND FUTURE

WORK

A tool for automating categorisation tasks is very useful

nowadays, as it helps to improve the quality of searches

that are performed later over textual repositories like digital

libraries, databases or web pages. For this reason, in this

paper we have presented a pipeline architecture to help in

the study of the problem of automatic text categorisation

using speciﬁc vocabulary contained in a thesaurus. Our

main contribution is the design of a system that combines

statistics, lexical databases, NLP tools, ontologies, and

geographical databases. Its stage-based architecture easily

allows the use and exchange of different resources and

tools. We have also performed a deep study of the impact

of the semantics in a text categorisation process.

Our pipeline architecture is based on ﬁve stages:

preprocessing, attribute-based classiﬁcation, statistical

classiﬁcation, geographical classiﬁcation, and ontological

classiﬁcation. Although the experimental part has been

developed in Spanish, the tool is ready to work with any

other language. Changing linguistic resources suitable for

the intended language is enough to make the system work,

since the process is sufﬁciently general to be applicable

regardless of the language used. The main contribution of

our work is, apart from the useful and modular pipeline

architecture, the experimental study with real data of the

problem of categorization of natural language documents

written in Spanish. There are many studies related to

such problems in English, but it is difﬁcult to ﬁnd them

in Spanish. Besides, we have compared the impact of

applying techniques that rely on statistics and supervised

learning with the results obtained when semantic tech-

niques are also used. There are two remarkable aspects.

Firstly, enhancing the amount of knowledge available by

increasing the number of instances in the ontologies leads

to a substantial improvement in the results. Secondly, the

use of knowledge bases helps to correct many errors from a

Geographical Classiﬁer.

Spanish vs. English Language. Our research on this

topic focuses on transfer projects related to the extraction

of information, so for us it is very important to work with

real cases. Therefore, the comparison of our work with typ-

ical benchmark data sets in English is not fundamental to

us, since they are not useful to improve the performance of

our system in Spanish, and we have seen that the ambient

conditions (language, regional context, thematic news, etc.)

have a great inﬂuence on the outcome of experiments. Many

researchers have already analyzed the differences between

working in NLP topics in English and in Spanish, and they

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

168

Figure 4: Comparative results of the automatic categorisation experiments.

have made it clear the additional difﬁculties of the Spanish

Language (Carrasco and Gelbukh, 2003; Aguado de Cea

et al., 2008), which could explain the poor performance

of some software applications that work reasonably well in

English. Just to mention some of these differences: in Span-

ish words contain much more grammatical and semantic in-

formation than the English words, the subject can be omit-

ted in many cases, and verbs forms carry implicit conjuga-

tion, without additional words. That, coupled with the high

number of meanings that the same word can have, increases

the computational complexity for syntactic, semantic and

morphological analyzers, which so behave differently in

Spanish and English. Spanish is the third language in the

world according to the number of speakers, after Mandarin

and English, but in terms of studies related to NLP we have

not found many scientiﬁc papers.

TheGENIEProject-ASemanticPipelineforAutomaticDocumentCategorisation

169

Impact of NLP and Semantics. Our experimental evalu-

ation suggests that the inﬂuence of NLP and semantic tools

is not quantitatively as important as the classic statistical

approaches, although their contribution can tip the scales

when evaluating the quality of a labeling technique, since

the difference in terms of precision and recall is sufﬁciently

inﬂuential (near 20%). So, our conclusion is that a statis-

tical approach can be successfully complemented with se-

mantic techniques to obtain an acceptable automatic cate-

gorisation. Our experience also proves that facing this issue

in a real environment when professional results are needed,

the typical machine learning approach is the best option but

is not always enough. We have seen that it should be com-

plemented with other techniques, in our case semantic and

linguistic ones. Anyway, the main drawback of the seman-

tic techniques is that the work of searching or constructing

the ontologies for each set of tags of every topic, populat-

ing them, and building the relationship tables, is harder than

the typical training of the machine learning approaches. So,

although the results are better, the scalability could be prob-

lematic. Sometimes it can be quite costly, especially if de-

tailed knowledge of the topic to tag is required in order to

appropriately conﬁgure the system.

NLP Future Tasks In some categorisation scenarios,

like bigger analysis (novels, reports, etc.) or groups of doc-

uments of the same ﬁeld, it can be interesting to obtain a

summary of the given inputs in order to categorise them

with their general terms before entering a more detailed

analysis which requires the entire texts. These summaries,

alongside with the previous deﬁned tasks, can led to a more

suitable detailed labelling, providing hints of which knowl-

edge bases might be interesting to work with. In order to

achieve this, we can perform syntactic analysis to simplify

the sentences of the summaries, as we have seen in works

like (Silveira and Branco, 2012), and then we will use the

obtained results to ﬁlter unnecessary information and select

the most relevant sentences without compromising the text

integrity. Although the required structures have been im-

plemented and some approaches as (Garrido et al., 2013a)

are being designed and tested, they are into an early stage

and they require more work before trying to use it inside the

categorisation pipeline.

Open Tasks. As future work, we plan to increase the

number of methods used in the pipeline, and to test this

methodology in new contexts and languages. It is notewor-

thy that a piece of news is a very speciﬁc type of text, char-

acterized by objectivity, clarity, and the use of synonyms

and acronyms, the high presence of speciﬁc and descriptive

adjectives, the tendency to use impersonal or passive con-

structions, and the use of connectors. Therefore it is not

sufﬁcient to test only with this kind of text, and to make a

more complete study is necessary to work with other types.

In fact, some tests have been made with GENIE with other

types of documents very different from news, such as book

reviews, business reports, lyrics, blogs, etc. and the results

are very promising, but it is early to assert the generality of

the solution in different contexts because the studies are still

in progress.

ACKNOWLEDGEMENTS

This research work has been supported by the CICYT

project TIN2010-21387-C02-02 and DGA-FSE. Thank you

to Heraldo Group and Diario de Navarra.

REFERENCES

Aguado de Cea, G., Puch, J., and Ramos, J. (2008). Tag-

ging Spanish texts: The problem of ’se’. In Sixth In-

ternational Conference on Language Resources and

Evaluation (LREC’08), pages 2321–2324.

Amitay, E., Har’El, N., Sivan, R., and Soffer, A. (2004).

Web-a-where: geotagging web content. In 27th An-

nual International ACM SIGIR Conference on Re-

search and Development in Information Retrieval,

pages 273–280. ACM.

Atkinson, M. and Van der Goot, E. (2009). Near real time

information mining in multilingual news. In Proceed-

ings of the 18th International Conference on World

Wide Web, WWW ’09, pages 1153–1154. ACM.

Bikakis, N., Giannopoulos, G., Dalamagas, T., and Sellis, T.

(2010). Integrating keywords and semantics on docu-

ment annotation and search. In On the Move to Mean-

ingful Internet Systems (OTM 2010), pages 921–938.

Springer.

Bloehdorn, S. and Hotho, A. (2006). Boosting for text

classiﬁcation with semantic features. In Advances in

Web mining and Web Usage Analysis, pages 149–166.

Springer.

Bruno, M., Canfora, G., Di Penta, M., and Scognamiglio, R.

(2005). An approach to support web service classiﬁca-

tion and annotation. In 2005 IEEE International Con-

ference on e-Technology, e-Commerce and e-Service

(EEE’05), pages 138–143. IEEE.

Carrasco, R. and Gelbukh, A. (2003). Evaluation of TnT

Tagger for Spanish. In Proceedings of ENC, Fourth

Mexican International Conference on Computer Sci-

ence, pages 18–25. IEEE.

Carreras, X., Chao, I., Padr

o, L., and Padr

o, M. (2004).

FreeLing: An open-source suite of language analyz-

ers. In Fourth International Conference on Language

Resources and Evaluation (LREC’04), pages 239–

242. European Language Resources Association.

Chau, R. and Yeh, C.-H. (2004). Filtering multilingual web

content using fuzzy logic and self-organizing maps.

Neural Computing and Applications, 13(2):140–148.

Elberrichi, Z., Rahmoun, A., and Bentaallah, M. A. (2008).

Using WordNet for text categorization. The Interna-

tional Arab Journal of Information Technology (IA-

JIT), 5(1):16–24.

Garrido, A. L., Buey, M. G., Escudero, S., Ilarri, S., Mena,

E., and Silveira, S. B. (2013a). TM-gen: A topic map

generator from text documents. In 25th IEEE Interna-

tional Conference on Tools with Artiﬁcial Intelligence

(ICTAI 2013), Washington DC (USA), pages 735–740.

IEEE Computer Society.

Garrido, A. L., Buey, M. G., Ilarri, S., and Mena, E.

(2013b). GEO-NASS: A semantic tagging experience

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

170

from geographical data on the media. In 17th East-

European Conference on Advances in Databases and

Information Systems (ADBIS 2013), Genoa (Italy),

volume 8133, pages 56–69. Springer.

Garrido, A. L., Gomez, O., Ilarri, S., and Mena, E. (2011).

NASS: News Annotation Semantic System. In 23rd

IEEE International Conference on Tools with Artiﬁ-

cial Intelligence (ICTAI 2011), Boca Raton, Florida

(USA), pages 904–905. IEEE Computer Society.

Garrido, A. L., Gomez, O., Ilarri, S., and Mena, E. (2012).

An experience developing a semantic annotation sys-

tem in a media group. In Proceedings of the 17th

International Conference on Applications of Natural

Language Processing and Information Systems, pages

333–338. Springer.

Gilchrist, A. (2003). Thesauri, taxonomies and ontologies

- an etymological note. Journal of Documentation,

59(1):7–18.

Goodchild, M. F. and Hill, L. (2008). Introduction to dig-

ital gazetteer research. International Journal of Geo-

graphical Information Science, 22(10):1039–1044.

Gruber, T. R. et al. (1993). A translation approach to

portable ontology speciﬁcations. Knowledge Acqui-

sition, 5(2):199–220.

Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and

Weischedel, R. (2006). OntoNotes: The 90% solu-

tion. In Human Language Technology Conference of

the NAACL, Companion Volume: Short Papers, pages

57–60. Association for Computational Linguistics.

Joachims, T. (1998). Text categorization with support vec-

tor machines: Learning with many relevant features.

In Tenth European Conference on Machine Learning

(ECML’98), pages 137–142. Springer.

Joachims, T. (2004). SVM Light Version: 6.01.

http://svmlight.joachims.org/.

Lee, S. O. K. and Chun, A. H. W. (2007). Automatic

tag recommendation for the web 2.0 blogosphere us-

ing collaborative tagging and hybrid and semantic

structures. Sixth Conference on WSEAS International

Conference on Applied Computer Science (ACOS’07),

World Scientiﬁc and Engineering Academy and Soci-

ety (WSEAS), 7:88–93.

Leopold, E. and Kindermann, J. (2002). Text categorization

with support vector machines. How to represent texts

in input space? Machine Learning, 46:423–444.

Li, H., Srihari, R. K., Niu, C., and Li, W. (2002). Location

normalization for information extraction. In Proceed-

ings of the 19th international conference on Compu-

tational linguistics-Volume 1, pages 1–7. Association

for Computational Linguistics.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to Information Retrieval, volume 1. Cam-

bridge University Press.

Maynard, D., Peters, W., and Li, Y. (2006). Metrics for

evaluation of ontology-based information extraction.

In Workshop on Evaluation of Ontologies for the Web

(EON) at the International World Wide Web Confer-

ence (WWW’06).

McGuinness, D. L., Van Harmelen, F., et al. (2004). OWL

web ontology language overview. W3C recommenda-

tion 10 February 2004.

Miller, G. A. (1995). WordNet: a lexical database for en-

glish. Communications of ACM, 38(11):39–41.

Mishra, R. B. and Kumar, S. (2011). Semantic web rea-

soners and languages. Artiﬁcial Intelligence Review,

35(4):339–368.

Navigli, R. (2009). Word sense disambiguation: A survey.

ACM Computing Surveys, 41(2):10:1–10:69.

Quercini, G., Samet, H., Sankaranarayanan, J., and Lieber-

man, M. D. (2010). Determining the spatial reader

scopes of news sources using local lexicons. In Pro-

ceedings of the 18th SIGSPATIAL International Con-

ference on Advances in Geographic Information Sys-

tems, pages 43–52. ACM.

Rauch, E., Bukatin, M., and Baker, K. (2003). A

conﬁdence-based framework for disambiguating ge-

ographic terms. In HLT-NAACL 2003 Workshop on

Analysis of Geographic References, vol. 1, pages 50–

54. Association for Computational Linguistics.

Resnik, P. (1999). Disambiguating noun groupings with

respect to WordNet senses. In Natural Language

Processing Using Very Large Corpora, pages 77–98.

Springer.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information Pro-

cessing and Management, 24(5):513–523.

Scharkow, M. (2013). Thematic content analysis using su-

pervised machine learning: An empirical evaluation

using German online news. Quality and Quantity,

47(2):761–773.

Sebastiani, F. (2002). Machine learning in automated text

categorization. ACM Computing Surveys, 34(1):1–47.

Sekine, S. and Ranchhod, E. (2009). Named Entities:

Recognition, Classiﬁcation and Use. John Benjamins.

Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. (2006). A

comparison of implicit and explicit links for web page

classiﬁcation. In Proceedings of the 15th Interna-

tional Conference on World Wide Web, WWW ’06,

pages 643–650. ACM.

Silveira, S. B. and Branco, A. (2012). Extracting multi-

document summaries with a double clustering ap-

proach. In Natural Language Processing and Infor-

mation Systems, pages 70–81. Springer.

Siolas, G. and d’Alch

e Buc, F. (2000). Support vector ma-

chines based on a semantic kernel for text categoriza-

tion. In IEEE-INNS-ENNS International Joint Con-

ference on Neural Networks (IJCNN 2000), volume 5,

pages 205–209. IEEE.

Smeaton, A. F. (1999). Using NLP or NLP Resources for

Information Retrieval Tasks. Natural Language Infor-

mation Retrieval. Kluwer Academic Publishers.

Trillo, R., Gracia, J., Espinoza, M., and Mena, E. (2007).

Discovering the semantics of user keywords. Journal

of Universal Computer Science, 13(12):1908–1935.

Vossen, P. (1998). EuroWordNet: A multilingual database

with lexical semantic networks. Kluwer Academic

Boston.

Wilbur, W. J. and Sirotkin, K. (1992). The automatic iden-

tiﬁcation of stop words. Journal of Information Sci-

ence, 18(1):45–55.

TheGENIEProject-ASemanticPipelineforAutomaticDocumentCategorisation

171