A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION

EXTRACTION

Massimo Ruffolo

Exeura s.r.l. - ICAR-CNR

University of Calabria, 87036 Rende (CS), Italy

Marco Manna

Department of Matematics

University of Calabria, 87036 Rende (CS), Italy

Keywords:

Information Extraction, Knowledge Representation, Logic Programming, Two-Dimensional Grammars,

Knowledge Management.

Abstract:

Recognizing and extracting meaningful information from unstructured documents, taking into account their

semantics, is an important problem in the ﬁeld of information and knowledge management. In this paper we

describe a novel logic-based approach to semantic information extraction, from both HTML pages and ﬂat text

documents, implemented in the HıLεX system. The approach is founded on a new two-dimensional represen-

tation of documents, and heavily exploits DLP

- an extension of disjunctive logic programming for ontology

representation and reasoning, which has been recently implemented on top of the DLV system. Ontologies,

representing the semantics of information to be extracted, are encoded in DLP

, while the extraction patterns

are expressed using regular expressions and an ad hoc two-dimensional grammar. The execution of DLP

reasoning modules, encoding the HıLεX grammar expressions, yields the actual extraction of information from

the input document. Unlike previous systems, which are merely syntactic, HıLεX combines both semantic and

syntactic knowledge for a powerful information extraction.

1 INTRODUCTION

Existing systems for storing unstructured information

such as document repositories, digital libraries, and

Web sites, consist mainly of a huge amount of HTML

pages or ﬂat text documents, organized according to

syntactic, semantic and presentation rules, recogniz-

able only by human readers. Such repositories tend to

be practically useless both for the vastness of the in-

formation they hold and the lack of machine readabil-

ity. Moreover, they are unable to manage the actual

knowledge that the information sources convey.

Recognizing and extracting relevant information

automatically from these rapidly changing sources,

according to their semantics, is an important problem

in the information and knowledge management area.

In the recent literature a number of approaches for

information extraction from unstructured documents

have been proposed. An overview of the large body

of existing literature and systems is given in (Eikvil,

1999; Feldman et al., 2002; Kuhlins and Tredwell,

2003; Laender et al., 2002; Rosenfeld et al., 2004).

The currently developed systems are purely syntactic,

and they are not aware of the semantics of the infor-

mation they are able to extract.

In this work we present a logic-based approach,

implemented in the HıLεX system, which combines

both syntactic and semantic knowledge for a powerful

and expressive information extraction from unstruc-

tured documents. Logic-based approaches to the in-

formation extraction problem are not new (Baumgart-

ner et al., 2001a; Baumgartner et al., 2001b), how-

ever, the approach we propose is original. Its novelty

is due to:

• The two-dimensional representation of an unstruc-

tured document. A document is viewed as a carte-

sian plan composed by a set of nested rectangular

regions called portions. Each portion, univocally

identiﬁed through the cartesian coordinates of two

opposite vertices, contains a piece of the input doc-

ument (element) annotated into an ontology.

• The exploitation of a logic-based knowledge repre-

sentation language called DLP

, extending DLP

(Gelfond and Lifschitz, 1991) with object-oriented

features, including classes, (multiple) inheritance,

complex objects, types, which is well-suited for

representation and powerful reasoning on ontolo-

gies. This language is supported by the DLV

system (Ricca et al., 2005), implemented on top of

115

Ruffolo M. and Manna M. (2006).

A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION.

In Proceedings of the Eighth International Conference on Enterprise Information Systems - AIDSS, pages 115-123

DOI: 10.5220/0002458601150123

 SciTePress

DLV (Eiter et al., 2000; Eiter et al., 1997; Faber

and Pfeifer, 1996; Leone et al., 2004).

• The use of an ontology, encoded in DLP

, de-

scribing the domain of the input document. A

concept of the domain is represented by a DLP

class; each class instance is a pattern representing

a possible way of writing the concept and is used

to recognize and annotate an element contained in

a portion.

• The employment of a new grammar, named HıLεX

grammar, for specifying the (above mentioned) pat-

terns. HıLεX grammar extends regular expressions

for the representation of two-dimensional patterns

(like tables, item lists, etc.), which often occur in

web pages and textual tabular data. The patterns

are speciﬁed through DLP

rules, whose execu-

tion yields the semantic information extraction, by

associating (the part of the document embraced by)

each portion to an element of the domain ontology.

It is worthwhile noting that, besides the domain

ontologies, HıLεX system uses also a core ontol-

ogy, containing (patterns for the extraction of) gen-

eral linguistic elements (like, e.g., date, time, num-

bers, email, words, etc.); presentation elements (like,

e.g., font colors, font styles, background colors, etc.);

structural elements (like, e.g., table cell, item lists,

paragraphs, etc.) which are not bounded to a speciﬁc

domain but occur generally.

The advantages of the HıLεX system over other exist-

ing approaches are mainly the following:

• The extraction of information according to their se-

mantics and not only on the basis of their syntactic

structure (as in the previous approaches).

• The possibility to extract information in the same

way from documents in different formats. The

same extraction pattern can be used to extract data

from both ﬂat text and HTML documents. Im-

portantly, this is not obtained by a preliminary

HTML-to-text translation; but it comes automati-

cally thanks to higher abstraction due to the view

of the input document as a set of logical portions.

• The possibility to obtain a “semantic” classiﬁcation

of the input documents, which is much more accu-

rate and meaningful than the syntactic classiﬁca-

tions provided by existing systems (mainly based

on counting the number of occurrences of some

keywords), and opens the door to many relevant

applications (e.g., emails classiﬁcation and ﬁlter-

ing, skills classiﬁcation from curricula, extraction

of relevant information from medical records, etc.).

Distinctive features of the novel semantic approach

to information extraction implemented in the HıLεX

system, summarized above, allows a better digital

contents management and fruition in different ap-

Figure 1: Financial Yahoo Page.

plication ﬁeld such as: e-health, e-entertainment, e-

commerce, e-government, e-business.

The remainder of this work is organized as a

by example explanation of the proposed approach.

In particular: section 2 shows the two-dimensional

document representation idea; section 3 describes

the DLP

knowledge representation language and

how ontologies are used to represent the semantics

of information to be extracted and to give a logic

two-dimensional representation of unstructured doc-

uments; section 4 describes the syntax and the se-

mantics of the two-dimensional pattern speciﬁca-

tion grammar and the logic-based pattern recognition

method exploiting it; ﬁnally, section 5 shows the ar-

chitecture of the HıLεX system.

2 TWO-DIMENSIONAL

REPRESENTATION OF

UNSTRUCTURED

DOCUMENTS

The two-dimensional representation of an unstruc-

tured document is the main notion, which the seman-

tic information extraction approach, presented in this

work, is based on. This notion is founded on the idea

that an unstructured document can be considered as

a cartesian plan composed by a set of nested rectan-

gular regions called portions. Each region, univocally

identiﬁed through the cartesian coordinates of two op-

posite vertices, contains a piece of the input docu-

ment including an element of the information to be

extracted. Information elements, organized accord-

ing to syntactic, presentation and semantic rules of

a language recognizable by a human reader, can be

simple or complex. simple elements are characters,

table cells, words (classiﬁed using its part-of-speech

tag recognized using natural language techniques);

complex elements are phrases, item lists, tables, para-

graphs, text boxes obtained as composition of other

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

116

Figure 2: Example of portions.

simple or complex elements.

To better explain the idea of portion consider the

web page depicted in Figure 1 (obtained from the

Italian Yahoo ﬁnancial portal) containing information

about the stock exchange market. Suppose we would

like to acquire, from this page, the table containing the

stock index values and their variation (surrounded by

a smooth etched box in Figure 1). A two-dimensional

representation of data contained in the highlighted

document region we are interested on (Figure 2), can

be obtained by drawing on it an hypothetical cartesian

plan. Each element of the table can be identiﬁed, in

that plan, by suitable rectangular regions (portions).

For instance, in Figure 2, the stock index name

“Mib 30” is a simple element which is contained in

the portion identiﬁed by [(x

, y

),(x

, y

)]. In the

same way, the signed ﬂoat number representing the

absolute variation of the “Mib 30” is contained in

the portion [(x

, y

),(x

, y

)]. Since portions can be

nested, the portion containing the complex element

representing the concept of “stock index row” can be

identiﬁed by the points [(x

, y

),(x

, y

)] and so on.

3 REPRESENTING KNOWLEDGE

The semantic information extraction approach

implemented in the HıLεX system is based on the

DLP

(Ricca et al., 2005) ontology representation

language.

DLP

is a powerful logic-based language which

extends Disjunctive Logic Programming (DLP) (Eiter

et al., 2000) by object-oriented features. In par-

ticular, the language includes, besides the concept

of relations, the object-oriented notions of classes,

objects (class instances), object-identity, complex-

objects, (multiple) inheritance, and the concept of

modular programming by means of reasoning mod-

ules. This makes DLP

a complete ontology rep-

resentation language supporting sophisticated reason-

ing capabilities.

Moreover, the DLP

ontology representation lan-

guage is implemented on the DLV

system, a cross-

platform development environment for knowledge

modeling and advanced knowledge-based reasoning.

The DLV

system (Ricca et al., 2005) permits to

easily develop real world complex applications and

allows to perform advanced reasoning tasks in a user

friendly visual environment. DLV

seamlessly inte-

grates the DLV (Eiter et al., 2000) system exploiting

the power of a stable and efﬁcient ASP solver (for fur-

ther background on DLV and DLP

see (Ricca et al.,

2005; Eiter et al., 2000)).

In the HıLεX system the DLP

language is heav-

ily exploited for the formal representation of the se-

mantics of information to be extracted (employing

suitable ontologies). Furthermore, DLP

allows the

encoding of the logic two-dimensional representation

of unstructured documents. Finally, DLP

reason-

ing modules (which are specialized DLP

logic pro-

grams) are exploited for the implementation of the

logic-based pattern recognition method allowing the

actual semantic information extraction.

More in detail, the elements of information to be

extracted are modeled by using the DLP

class ele-

ment which is deﬁned as follows:

class element (type: expression

type,

expression: string, label: string ).

The three attributes have the following meaning:

• expression: holds a string representing the pat-

tern speciﬁed by regular expressions or by the

HıLεX two-dimensional grammar (described in de-

tail in the following section), according to the

type property. Patterns contained in these at-

tributes are used to recognize the elements in a doc-

ument.

• type: deﬁnes the type of the expression (i.e. re-

gexp type, hilex type).

• label: contains a description of the element in

natural language.

As pointed out in section 2, elements are located in-

side rectangular region of the input document called

portions. Document portions and the enclosed el-

ements are represented in DLP

by using the class

point and the relation portion

class point (x: integer, y: integer).

relation portion (p: point, q: point, elem:

element).

Each instance of the relation portion represents the

relative rectangular document region. It relates the

two points identifying the region, expressed as in-

stances of the class point, and an ontology element,

expressed as instance of the class element. The set

of instances of the portion relation constitute the

logic two-dimensional representation of an unstruc-

tured document.

This DLP

encoding allows to exploit the two-

dimensional document representation on which the seman-

tic information extraction approach proposed in this paper

is based on.

A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION

117

The element class is the common root of two

kind of ontologies, the core ontology and the

domain ontologies. Every pattern encoding in-

formation to be extracted is represented by an instance

of a class belonging to these ontologies.

In the following the structure of core and domain

ontologies are described in details.

3.1 The Core Ontology

The core ontology is composed of three parts. The

ﬁrst part represents general simple elements describ-

ing a language (like, e.g., alphabet symbols, lemmas,

Part-of-Speech, regular forms such as date, e-mail,

etc.). The second part represents elements describing

presentation styles (like, e.g., font types, font styles,

font colors, background colors, etc.). The third part

represents structural elements describing tabular and

textual structures (e.g. table cells, table columns, ta-

ble rows, paragraphs, item lists, texture images, text

lines, etc.). The core ontology is organized in the class

hierarchy shown below:

class linguistic

element isa {element}.

class character isa {linguistic element}.

class number character isa {character}.

...

class regular form isa {linguistic element}.

class float number isa {regular form}.

...

class italian lexical element isa

{linguistic element}.

class english lexical element isa

{linguistic element}.

class english lemma isa

{english lexical element}.

...

class spanish lexical element isa

{linguistic element}.

...

class presentation element isa {element}.

class font type isa

{presentation element}.

...

class structural element isa {element}.

class table cell isa

{structural element}.

class separator isa

{structural element}.

...

Examples of instances of the float number class

are:

unsigned_float_number: float_number (type: regexp_type,

expression:"(\d{1,3}(?>.\d{3})

,\d+)",

label: "RegExp for unsigned float number").

signed_float_number: float_number (type: regexp_type,

expression:"([+-]\s

\d{1,3}(?>.\d{3})

,\d+)",

label: "RegExp for signed float number").

percentage: float_number (type: regexp_type,

expression:"(\(?(?>[+-])?(?>(?>100(?>,0+)?)|

(?:\d{1,2}(?>,\d+)?))%\)?)",

label: "RegExp for percentage").

When in a document the regular expression char-

acterizing a particular kind of ﬂoat number is recog-

nized, a document portion is generated and annotated

w.r.t. the corresponding class instance.

3.2 Domain Ontologies

Domain ontologies contain simple and complex el-

ements of a speciﬁc knowledge domain. The dis-

tinction between core and domain ontologies allows

to describe knowledge in a modular way. When a

user need to extract data from a document regarding a

speciﬁc domain, he can use only the corresponding

domain ontology. The modularization improve the

extraction process in terms of precision and overall

performances. Referring to the example of previous

section, elements representing concepts related to the

stock index market domain can be organized as fol-

lows:

class stock market domain isa {element}.

class stock index isa

{stock market domain,

linguistic element}.

class stock index cell isa

{stock market domain,

structural element}

class stock index row isa

{stock market domain,

structural element}.

class stock index table isa

{stock market domain,

structural element}.

class index value isa

{stock market domain, regular form}.

Examples of instances of the stock index class

are:

mibtel: stock index (type: regexp type,

expression: ‘‘Mibtel’’).

mib30: stock index (type: regexp type,

expression: ‘‘Mib30’’).

dowJones: stock index (type: regexp type,

expression: ‘‘Dow Jones’’).

When a regular expression characterizing a stock in-

dex is recognized in a document, a portion is gener-

ated and annotated w.r.t. the corresponding class in-

stance.

4 A TWO-DIMESIONAL

GRAMMAR FOR EXTRACTION

PATTERNS SPECIFICATION

The internal representation of extraction patterns, in

the HıLεX system, is obtained by means of a two-

dimensional grammar, founded on picture languages

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

118

(Chang, 1970; Giammarresi and Restivo, 1997), and

allowing the deﬁnition of very expressive target pat-

terns. Each pattern represents a two-dimensional

composition of portions annotated w.r.t. the elements

deﬁned in the ontology. The syntax of the HıLεX two-

dimensional grammar is presented in the following.

NEW ELEMENT → GENERALIZATION | RECURRENCE | CHAIN |

TABLE

GENERALIZATION → GEN1 | GEN2 | GEN3

GEN1 → generalizationOf (arg: ARG1)

GEN2 → orContain generalizationOf (arg: ARG1,

inArg: ARG1, condition: CND)

GEN3 → andContain generalizationOf (arg: ARG1,

inArg: ARG1, condition: CND)

CND → coincident | notCoincident | null

RECURRENCE → recurrenceOf (arg: ARG3,

range: RANGE, dir: DIR)

CHAIN → CHAIN1 (arg: ARG2, dir: DIR, sep: SEP)

CHAIN1 → sequenceOf | permutationOf

TABLE → TAB1 (arg: ARG2, range: RANGE,

dir: DIR, sep: SEP)

TAB1 → sequenceTableOf | permutationTableOf

ARG1 → ARG2 | ARG3

ARG2 → [ LIST ]

ARG3 → BASE ELEM

LIST → ARG3 , ARG3 LIST1

LIST1 → , ARG3 LIST1 | ε

RANGE → < NUM , NUM > | NUM | + |

DIR → vertical | horizontal | both

SEP → ARG3 | null

According to the HıLεX grammar, a portion annotated

w.r.t. a NEW ELEMENT can be obtained by applying

the composition language constructs to portions an-

notated w.r.t. basic ontology elements (BASE ELEM).

The semantics of each construct, together with some

examples of usage, are presented in the following

section.

GENERALIZATION: A portion annotated to basic

ontology element (BASE ELEM) can be re-annotated

to the new ontology element (NEW ELEMENT), by us-

ing the generalizationOf operator. The effect

of this operator is a semantic rewriting generalizing

the portion annotation.

Example 1 Consider the HTML document presented

in section 2 whose elements are properly mod-

elled in the core and domain ontologies. Let

unsigned

float number be an instance of the

float number class deﬁned in the core ontology.

A portion annotated as unsigned float number

can be re-annotated as a absolute index value

by using the following expression:

absolute_index_value: index_value (type:hilex_type,

expression:"generalizationOf (

arg: unsigned_float_number)",

label:"Absolute value of a stock index" ).

The HıLεX grammar constructs orContain gene-

ralizationOf and andContain generali-

zationOf allow to deﬁne new annotations of ex-

isting portion on the basis of the semantics of con-

tained portions. The generalization operators exploit

the spatial (strict) containments of portions.

RECURRENCE: A portion annotated w.r.t. a

NEW ELEMENT, obtained by means of the

recurrenceOf operator, consists in the con-

catenation, along a given direction, of a ﬁxed number

of portions annotated w.r.t. the same BASE ELEM.

Example 2 Using the HıLεX recurrenceOf con-

struct, a separator between two elements, contained

in a document, can be deﬁned as an instance of the

separator class, constituted by a null portion (i.e.

a portion without annotation having overlapped vertex

along a coordinate) or the concatenation, in the hori-

zontal direction, of an undeﬁned number of portions

annotated w.r.t. the blank char element, deﬁned as

an instance of the core ontology character class.

sep 01: separator (type: hilex type,

expression : ‘‘recurrenceOf (

arg: blank char,

range:

, dir: horizontal)’’,

label: "Blank characters separator").

Figure 3: Example of recurrence.

CHAIN: A portion annotated w.r.t. a NEW ELEMENT

by using the sequenceOf and permutationOf

operators, constitutes a chain of portions annotated

w.r.t. BASE ELEMs. In particular, a portion obtained

by the application of the sequenceOf operator is a

concatenation of at least two portions annotated w.r.t.

BASE ELEMs in a given direction and a ﬁxed order,

whereas, a portion obtained by using the permuta-

tionOf operator is a concatenation of at least two

portions annotated w.r.t. BASE ELEMs in a given di-

rection, without an established order.

Example 3 A table row containing stock index vari-

ations can be represented using the HıLεX construct

sequenceOf in the following way:

A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION

119

stock_index_row_01: stock_index_row( type:hilex_type,

expression:"sequenceOf( arg: [stock_index,

absolute_index_value, absolute_index_variation,

percentage_index_variation],

dir:horizontal, sep:sep_01 )",

label:"Row containing stock index variations" ).

The ﬁgure 4 shows the portion annotated w.r.t an

instance of the the stock

index row class. It is

constituted by an ordered sequence, in the horizontal

direction, of portions annotated w.r.t. instances of the

stock index class, and the unsigned float,

signed float and percentage instances of the

float number class. Between each couple of por-

tions could be present a portion annotated w.r.t the

element sep 01, an instance of the separator

class, deﬁned in the example 2. This expression

considers only the semantics of the portions and

their spatial positioning. Any reference to the doc-

ument structure is required to recognize the concept

of stock index row.

Figure 4: Example of chain.

TABLE: A portion annotated w.r.t. a NEW ELEMENT

can be deﬁned by using the sequenceTableOf or

permutationTableOf HıLεX operators, as a ta-

ble of portions annotated w.r.t. BASE ELEMs.

A portion, obtained from the sequenceTableOf

operator, is composed by portions having a ﬁxed com-

position along a direction, repeated a certain number

of times along the other direction, whereas, a por-

tion obtained from the permutationTableOf op-

erator is composed by portions having an unordered

composition along a direction, repeated with the same

structure a ﬁxed number of times along the other di-

rection. This construct allows to recognize table in

both HTML and text documents. In fact, portions pro-

vide an abstract representation of unstructured docu-

ments independent from the document format.

Example 4 The ﬁgure 5 depicts a portion annotated

w.r.t. an instance of the stock index table class

obtained by using the sequenceTableOf HıLεX

grammar construct as shown in the following:

stock_index_table_01:stock_index_table( type: hilex_type,

expression:"sequenceTableOf( arg: [stock_index,

adsolute_index_value, absolute_index_variation,

percentage_index_variation],

range:<2,5>, dir:vertical, sep:sep_01 )",

label:"table containing stock_index_row" ).

The instance stock index table 01 repre-

sents a table of stock index variations composed

Figure 5: Example of table.

by a vertical sequence of at least 2 and at most 5

rows. Each row is a sequence of other portions an-

notated w.r.t. instances of the class stock index,

and the unsigned float, signed float and

a percentage (i.e. a stock index row) in-

stances of the float number class.

4.1 Logic-Based Pattern Recognition

Extraction patterns expressed by means of the HıLεX

two-dimensional grammar allow the actual semantic

information extraction from unstructured documents.

The pattern recognition mechanism is implemented

encoding the HıLεX grammar expressions in DLP

In particular, each pattern is rewritten in a DLP

rea-

soning module as a set of rules exploiting the follow-

ing basic operators able to manipulate points and por-

tions.

relation strictFollow(p1: point, q1: point,

elem1: element, p2: point, q2: point, elem2: element).

relation strictBelow(p1: point, q1: point,

elem1: element, p2: point, q2: point, elem2: element).

relation minContain (p1: point, q1: point,

elem1: element, p2: point, q2: point, elem2: element).

relation min_max_horizontalRecurrence(p: point,

q: point, elem: element, min: integer, max: integer).

relation min_max_verticalRecurrence(p: point,

q: point, elem: element, min: integer, max: integer).

The strictFollow operator, for example, is im-

plemented by means of the DLP

rule presented in

following:

strictFollow (P1, Q1, E1, P2, Q2, E2) :-

portion (p: P1, q: Q1, elem: E1),

portion (p: P2, q: Q2, elem: E2),

P1: point (y: YP),

Q1: point (x: X, y: YQ),

P2: point (x: X, y: YP),

Q2: point (y: YQ).

The semantics of the ﬁve basic operators is intu-

itively given in Figure 6.

The table containing the stock index variations, in-

corporated in the page presented in section 2, can be

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

120

Figure 6: Basic operators.

extracted using the pattern presented in the example 4.

The corresponding DLP

rewriting is shown below.

module(stock_index_table_01){

portion(p:P1, q:Q7, elem:row_of_stock_index_table_01):-

strictFollow(p1:P1, q1:Q1,

elem1:E1,

p2:P2, q2:Q2,

elem2:sep_01),

strictFollow(p1:P2, q1:Q2,

elem1:sep_01,

p2:P3, q2:Q3,

elem2:absolute_index_value),

strictFollow(p1:P3, q1:Q3,

elem1:absolute_index_value,

p2:P4, q2:Q4,

elem2:sep_01),

strictFollow(p1:P4, q1:Q4,

elem1:sep_01,

p2:P5, q2:Q5,

elem2:absolute_index_variation),

strictFollow(p1:P5, q1:Q5,

elem1:absolute_index_variation,

p2:P6, q2:Q6,

elem2:sep_01),

strictFollow(p1:P6, q1:Q6,

elem1:sep_01,

p2:P7, q2:Q7,

elem2:percentage_index_variation),

instanceOf(E1,stock_index).

portion(p:P, q:Q, elem:stock_index_table_01):-

min_max_VerticalRecurrence(p:P, q:Q,

elem:stock_index_table_row_01,

min:2, max:5).

}

The new portion, which structure satisﬁes the ex-

traction pattern, is recognized by applying rules con-

tained in the reasoning module shown above. These

rules exploit the logic two-dimensional representation

of unstructured document. The row of stock in-

dex table 01 is a temporary instance of the

class stock index row, having the same structure

shown in the example 3. After the module execution

such an instance is deleted.

The result of the extraction process is graphically

shown in Figure 7. Figure 7 (a) depicts portions iden-

tiﬁed using patterns represented by regular expres-

sions. Regular expressions are recognized by a docu-

ment preprocessor based on a pattern matching mech-

anism. Figure 7 (b) and (c) show portions identiﬁed

by the pattern recognizer exploiting the logic repre-

sentation of the HıLεX grammar expressions.

Figure 7: Portions Extracted from the Yahoo Page.

It is worthwhile noting that patterns are very syn-

thetic and expressive. Moreover, patterns are general

in the sense that they are independent from the doc-

ument format. This last peculiarity implies that the

extraction patterns, presented above, are more robust

w.r.t. variations of the page structure than extraction

patterns deﬁned in the previous approaches. For ex-

ample, the table containing the stock index variations

could appear wherever in the page. Furthermore, the

same extraction patterns can also be used to extract in-

formation from ﬂat text having the structure depicted

in ﬁgure 8. The result of the extraction process on

ﬂat text is depicted in Figure 8 (a), (b), (c) having the

same structure of Figure 7.

5 THE HıLεX SYSTEM

The architecture of the HıLεX system, implement-

ing the semantic information extraction approach de-

scribed in the previous sections, is represented in ﬁg-

ure 9. The Knowledge Base (KB) of HıLεX stores the

core and domain ontologies by means of the DLV

system persistency layer. The information extraction

process is executed in three main steps: document

pre-processing, pattern recognition, and pattern ex-

traction. Each step is performed by a suitable archi-

tectural module.

In the ﬁrst step a Document Pre-Processor takes in

input an unstructured document and a query, contain-

ing the class instances names, representing the infor-

mation that the user needs to extract. After the ex-

ecution, the document preprocessor returns the two-

dimensional logic document representation and a set

of reasoning modules, constituting the input for the

pattern recognizer. In particular, the Document Pre-

Processor is composed of three sub-modules: Query

analyzer, Document Analyzer, and HıLεX Rewriter.

The Query analyzer takes in input the user query

and explores the ontologies to identify the patterns

to use for the extraction process. Patterns repre-

A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION

121

Figure 8: Flat Text Version of the Yahoo Page.

Figure 9: The Architecture of the HıLεX System.

sented through regular expressions (simple elements),

together with the corresponding ontology instance

names (named O

in Figure 9) are the input of the

Document Analyzer module. Patterns expressed us-

ing the HıLεX pattern representation grammar (com-

plex elements) together with the corresponding ontol-

ogy instance names (named O

in Figure 9) are the

input of the HıLεX Rewriter. The Document Analyzer

applies pattern matching mechanisms to detect sim-

ple elements constituting the document and, for each

of them, generates the relative portion. At the end

of the analysis the two-dimensional logic document

representation L

is returned. The HıLεX Rewriter

translates each pattern represented by the HıLεX two-

dimensional grammar in a reasoning module contain-

ing logic rules suitable for pattern recognition. The

output of the HıLεX Rewriter is a set of Reasoning

Modules (RM) executable by the DLV

system. The

translation is based on the operators able to manipu-

late portions described in Section 4.

The HıLεX Rewriter output (L

) together with the

Document Analyzer output (RM) is the input of the

second step of the information extraction process,

which is performed by the Pattern Recognizer mod-

ule.

The Pattern Recognizer is founded on the DLV

system. It takes in input the logic document represen-

tation (L

) and the set of reasoning modules (RM)

containing the translation of the HıLεX patterns in

terms of logic rules and recognize new complex ele-

ments. The output of this step is the augmented logic

representation (L

) of a unstructured document in

which new document regions, containing more com-

plex elements (e.g table having a certain structure and

containing certain concepts, phrases having a particu-

lar mining, etc.), are identiﬁed exploiting the semantic

knowledge represented in the ontologies. The pattern

recognition is completely independent from the doc-

ument format.

Finally, a Pattern Extractor takes in input the aug-

mented logic representation of a document (L

) and

allows the acquisition of element instances (seman-

tic wrapping) and/or the document classiﬁcation w.r.t.

the ontologies classes. Acquired instances can be

stored in DLP

ontologies, relational and XML

databases. Thus, extracted information can be used

in other applications, and more powerful queries and

reasoning tasks are possible on them. For example,

the classiﬁcation of the documents w.r.t. the ontology

can be exploited for document management purpose.

6 CONCLUSIONS AND FUTURE

WORKS

This work presents a novel, concrete, powerful and

expressive approach to information extraction from

unstructured documents. The approach, implemented

in the HıLεX system, is grounded on two main ideas:

• The semantic representation of the information to

extract by means of the DLP

ontology repre-

sentation language, having solid theoretical foun-

dations.

• The logic two-dimensional representation of docu-

ments allowing the deﬁnition of extraction patterns

expressed by the HıLεX two-dimensional grammar.

Thanks to these ideas, the approach constitutes a

decisive enhancement in this ﬁeld. Unlike previous

approach, the same extraction patterns can be used

to extract information, according to their semantics,

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

122

form both HTML and ﬂat text documents. Further-

more, the HıLεX system can be used to implement a

new generation of semantic wrappers. Many func-

tions that will be available in the future ”semantic

web” technologies are turning into reality today with

the HıLεX system.

Currently the approach is under consolidation and

its theoretical foundations are under investigation and

improvement. Future work will be focused on the

consolidation and extension of the HıLεX two-dimen-

sional grammar, the investigation of computational

complexity issues from a theoretical point of view,

the extension of the approach to pdf and other doc-

ument formats, the exploitation of natural language

processing techniques aimed to improve information

extraction from documents with only textual contents.

7 ADDITIONAL AUTHORS

• Tina Dell’Armi. Exeura s.r.l. University of Cal-

abria, 87036 Rende (CS), Italy dellarmi@exeura.it

• Lorenzo Gallucci. Exeura s.r.l. Exeura s.r.l.

University of Calabria, 87036 Rende (CS), Italy

gallucci@exeura.it

• Nicola Leone. Department of Matematics; Exeura

s.r.l. University of Calabria, 87036 Rende (CS),

Italy, leone@mat.unical.it

• Francesco Ricca. Department of Matematics,

University of Calabria, 87036 Rende (CS), Italy,

ricca@mat.unical.it

• Domenico Sacc

a. Exeura s.r.l.; DEIS; ICAR-CNR,

University of Calabria, 87036 Rende (CS), Italy,

sacca@unical.it

REFERENCES

Baumgartner, R., Flesca, S., and Gottlob, G. (2001a).

Declarative information extraction, web crawling, and

recursive wrapping with lixto. In LPNMR ’01: Pro-

ceedings of the 6th International Conference on Logic

Programming and Nonmonotonic Reasoning, pages

21–41, London, UK. Springer-Verlag.

Baumgartner, R., Flesca, S., and Gottlob, G. (2001b). Vi-

sual web information extraction with lixto. In The

VLDB Journal, pages 119–128.

Chang, S.-K. (1970). The analysis of two-dimensional pat-

terns using picture processing grammars. In STOC

’70: Proceedings of the second annual ACM sympo-

sium on Theory of computing, pages 206–216, New

York, NY, USA. ACM Press.

Eikvil, L. (1999). Information extraction from world wide

web - a survey. Technical Report 945, Norweigan

Computing Center.

Eiter, T., Faber, W., Leone, N., and Pfeifer, G. (2000).

Declarative Problem-Solving Using the DLV System.

In Minker, J., editor, Logic-Based Artiﬁcial Intelli-

gence, pages 79–103. Kluwer Academic Publishers.

Eiter, T., Leone, N., Mateis, C., Pfeifer, G., and Scarcello,

F. (1997). A deductive system for non-monotonic rea-

soning. In Logic Programming and Non-monotonic

Reasoning, pages 364–375.

Faber, W. and Pfeifer, G. (since 1996). Dlv homepage.

Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz,

E., Regev, Y., and Yaroshevich, A. (2002). A com-

parative study of information extraction strategies. In

Gelbukh, A. F., editor, CICLing, volume 2276 of

Lecture Notes in Computer Science, pages 349–359.

Springer.

Gelfond, M. and Lifschitz, V. (1991). Classical negation in

logic programs and disjunctive databases. New Gen-

eration Computing, 9(3/4):365–386.

Giammarresi, D. and Restivo, A. (1997). Two-dimensional

languages. In Salomaa, A. and Rozenberg, G., editors,

Handbook of Formal Languages, volume 3, Beyond

Words, pages 215–267. Springer-Verlag, Berlin.

Kuhlins, S. and Tredwell, R. (2003). Toolkits for generat-

ing wrappers – a survey of software toolkits for auto-

mated data extraction from web sites. In Aksit, M.,

Mezini, M., and Unland, R., editors, Objects, Com-

ponents, Architectures, Services, and Applications for

a Networked World, volume 2591 of Lecture Notes in

Computer Science (LNCS), pages 184–198, Berlin. In-

ternational Conference NetObjectDays, NODe 2002,

Erfurt, Germany, October 7–10, 2002, Springer.

Laender, A., Ribeiro-Neto, B., Silva, A., and Teixeira, J.

(2002). A brief survey of web data extraction tools. In

SIGMOD Record, volume 31.

Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G.,

Perri, S., and Scarcello, F. (2004). The DLV System

for Knowledge Representation and Reasoning.

Ricca, F., Leone, N., Dell’Armi, T., DeBonis, V., Galizia,

S., and Grasso, G. (2005). A dlp system with object-

oriented features. In LPNMR ’05: Proceedings of 8th

International Conference on Logic Programming and

Non Monotonic Reasoning, Diamante, Italy.

Rosenfeld, B., Feldman, R., Fresko, M., Schler, J., and

Aumann, Y. (2004). Teg: a hybrid approach to in-

formation extraction. In Grossman, D., Gravano, L.,

Zhai, C., Herzog, O., and Evans, D. A., editors, CIKM,

pages 589–596. ACM.

A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION

123