Model-based Integration of Unstructured Web Data Sources using

Graph Representation of Document Contents

Radek Burget

Faculty of Information Technology, Brno University of Technology, Bozetechova 2, Brno, Czech Republic

Keywords:

Information Integration, Domain Modelling, Document Processing, Structured Record Extraction.

Abstract:

Unstructured or semi-structured documents on the web are often used as a media for publishing structured,

domain-speciﬁc data which is not available from other sources. Integration of such documents as a data

source to a standard information system is still a challenging problem because of the very loose structure of

the input documents and usually missing semantic annotation of the published data. In this paper, we propose

an approach to data integration that exploits the domain model of the target information system. First, we

propose a graph-based model of the input document that allows to interpret the contained data in different

alternative ways. Further, we propose a method of aligning the document model with the target domain model

by evaluating all possible mappings between the two models. Finally, we demonstrate the applicability of the

proposed approach on a sample domain of public transportation timetables and we present the preliminary

results achieved with real-world documents available on the web.

1 INTRODUCTION

Despite much effort dedicated to the development of

different technical means for annotating the semantics

of the presented data such as Microformats

, RDFa

and others, the World Wide Web is still an extremely

large source of mostly unannotated documents. These

documents often contain structured and potentially

useful data presented in a way that is convenient for

human readers but it is completely unsuitable for au-

tomated processing. Therefore, using the documents

as a data source for traditional information systems

that are based on structured data models presents a

challenging task.

A typical domain-oriented information system

uses a structured data representation and storage (for

example a relational database), which has been de-

signed based on the analysis of the target domain,

identiﬁcation of the individual entities, their proper-

ties and the relationships among them. However, on

the web, many potential sources of domain-speciﬁc

data have the form of documents designed primarily

for human readers. Although the data contained in

these documents follow basically the same structure

that comes from the target domain, their integration

https://orcid.org/0000-0001-5233-0456

https://microformats.io/

https://rdfa.info/

to an existing information system is difﬁcult because

of the very loose way of their presentation without

any formal annotation.

In (Burget, 2017), we have mentioned several

domains, where this situation is quite typical such

as scholarly data (conference proceedings contents),

sports results or public transport time tables. In all

these (and many other) domains, the data has a ﬁxed

and predictable structure that potentially allows its

integration to existing applications in the respective

domains. However, the corresponding data sources

often have the form of periodically published docu-

ments (mostly web pages; PDF documents are typical

for some domains such as timetables) whose human

interpretation is assumed for understanding the pre-

sented data.

Traditionally, the integration of such web sources

is implemented using different kinds of wrappers

that recognize data ﬁelds in the documents by an-

alyzing the underlying document code – mostly

the HTML code represented as a Document Object

Model (DOM) (Schulz et al., 2016). For each data

source (the source of the input documents), the corre-

sponding code patterns are different and therefore, a

speciﬁc wrapper must be prepared. Such approach is

reliable and feasible when considering a limited num-

ber of previously known data sources that provide a

larger number of documents but it is not practical at

326

Burget, R.

Model-based Integration of Unstructured Web Data Sources using Graph Representation of Document Contents.

DOI: 10.5220/0008350103260333

In Proceedings of the 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pages 326-333

ISBN: 978-989-758-386-5

all, when the input documents come from previously

unknown sources, each document has been prepared

independently and uses a completely different way of

data presentation.

In this paper, we propose a model-based approach

aiming to overcome the speciﬁc details of the in-

dividual documents by an automatic discovery of a

mapping between the previously deﬁned domain data

model and the presented data records. The main pre-

sented contributions are the following:

• We present a technology- and language-

independent graph-based model of the document

contents that allows to interpret the contained

data in different alternative ways.

• We propose a method for evaluating the possible

mappings between the created document model

and the target domain model that describes the

expected structure of the contained data and for

choosing the best mapping based on a statistical

analysis.

• We demonstrate the application of the described

approach on a sample domain of public transport

timetables.

We also include preliminary results of this work in

progress that show the applicability of the proposed

document model and mapping methods on real-world

documents.

2 RELATED WORK

The research in the topic of data record extraction

from web documents has been running for over 20

years. Apart from historical HTML-based approaches

(Schulz et al., 2016), due to the evolution of the web

technology (mainly in the HTML and CSS languages

and the dynamic web pages) and the increasing com-

plexity of web documents, the recent approaches usu-

ally combine the analysis of the document code, with

visual presentation properties (Potvin and Villemaire,

2019; Shi et al., 2015). However, most of the cur-

rent methods use DOM

as the primary document

representation (Figueiredo et al., 2017; Guo et al.,

2019; Lockard et al., 2018; Shi et al., 2015; Yu-

liana and Chang, 2018). This limits the applicability

of the methods to speciﬁc HTML documents where

the DOM elements accurately delimit the desired data

ﬁelds.

From the data integration point of view, the cur-

rent methods infer the schema of the extracted records

from the source documents themselves (Figueiredo

https://www.w3.org/DOM/

et al., 2017; Shi et al., 2015; Yuliana and Chang,

2018). In all cases, the approach is to ﬁnd a speciﬁc

region or multiple regions (Figueiredo et al., 2017)

that contain the records and then, a ﬂat internal struc-

ture of the records is determined based on ﬁnding the

regular patterns in the document code and by com-

paring the similarity and other characteristics of the

repeating sequences. This approach allows easy ap-

plication of the methods to any document indepen-

dently on its domain; however, the integration of the

extracted data to a domain information system re-

quires further interpretation and transformation of the

extracted records.

In contrast to the above mentioned data-driven ap-

proaches, there has been signiﬁcantly less attention

given to the research of the model-driven approaches.

(Embley et al., 1999) uses a conceptual domain model

that is directly mapped to HTML code based on dif-

ferent heuristics. In (Potvin and Villemaire, 2019) a

ﬂat list of extracted data ﬁeld is used and (Lockard

et al., 2018) integrates the extracted data with an ex-

isting knowledge base.

In our previous research (Burget, 2017), we have

proposed a basic approach for matching individual bi-

nary relationships in a domain model to visual pre-

sentation patterns in the documents. In this paper, we

generalize the matching to the whole domain mod-

els and above all, we introduce a formal graph-based

document model of the input documents that makes

the matching possible.

3 THE DATA INTEGRATION

TASK

The data integration task we consider in this paper

is the following: We have a (potentially unlimited)

collection of unstructured input documents on the

source side and a structured domain-speciﬁc informa-

tion system on the target side.

The target information system is typically de-

signed based on the analysis of the particular domain,

which results in a domain data model such as a entity-

relationship diagram (ERD) or its equivalent depend-

ing on the used design methodology. The model cap-

tures the basic entity sets, their properties (attributes)

and the relationships among them. Independently on

whether an ERD or another formalism is used, we

may deﬁne a domain model for our purpose as fol-

lows:

Deﬁnition 1. A domain model is a tuple D = (E, P, R),

where E is a set of entity sets, P is a set of properties

(attributes in ERD) and R ⊂ (E × (E ∪ P)) is the set

of relationships.

Model-based Integration of Unstructured Web Data Sources using Graph Representation of Document Contents

327

Figure 1 shows a simple ERD for the public trans-

port timetables domain. Note that we consider just a

part of the ERD that is relevant to the considered data

sources; the complete ERD for a real-world informa-

tion system would be obviously signiﬁcantly larger.

Figure 1: An entity-relationship model for the domain of

public transportation timetables with two entity sets (Time,

Stop), three properties (Hour, Minute, Name) and one rela-

tionship (stops at).

At the input, we assume a collection of documents

that contain visually presented structured data records

consisting of several data ﬁelds. The documents come

generally from different sources and therefore, the

way of data presentation, formatting or the implemen-

tation may be different for every single document.

However, we put the following assumptions on the

input documents:

• We assume formatted text documents where the

document creator may specify the visual proper-

ties (fonts, colors, etc.) for every part of the docu-

ment text as well as the visual organization of the

contents (alignment, spacing, etc.) by any means.

For the web sources, the HTML web pages and

PDF documents are the most typical but our con-

tent model presented below in section 4 is inde-

pendent on the actual technology.

• Every document contains multiple data records

consisting of the data ﬁelds that may be directly

mapped to the properties in the target ERD (i.e.

without any additional transformations) and the

records are consistent regarding their structure

and visual presentation (their visual properties

and organization as mentioned above).

Figure 2 gives the overview of the document pro-

cessing process. First, the visual properties and posi-

tions of all parts of the document text are computed.

This is the only task that depends on the document

type. For some document types such as HTML, this

requires rendering the document by a web browser. In

PDF documents, the necessary information is avail-

able directly. In the next steps, we identify the text

chunks that represent the candidate substrings of the

document text that potentially could represent a data

ﬁeld. Based on the extracted text chunks, we build

a page contents model, which is basically a graph

that describes the visual properties of the individ-

ual chunks and the visually presented relationships

among them. We describe the model and its construc-

tion below in section 4.

The key part of the information integration pro-

cess consists of ﬁnding the most appropriate mapping

between the created document contents graph and the

domain model. For this purpose, we also represent

the domain model as a graph of the entity properties

and the relationships among them and we search for a

best mapping between the two graphs. The details of

this process are described in section 5.

In the following sections, we will use the already

mentioned public transport timetables as a sample do-

main. Our goal is to integrate the data about the stops

and the corresponding times from the timetable docu-

ments as shown in Figure 1. We believe, this domain

is suitable for illustrating the individual steps for the

following reasons:

• It is challenging. The timetables are a good exam-

ple of source documents that present data in a very

ambiguous way and even the human readers need

some experience to interpret the data properly in

some more complex cases.

• It is practically useful. Although there exist differ-

ent portals and aggregators in this domain, they

are usually limited to certain countries, regions

or groups of companies and they typically do not

provide their structured data to third parties.

• There are many highly diverse documents from

different transportation companies available on

the web.

However, the presented integration approach is

not limited to a single domain as long as the above

mentioned assumptions on the input documents are

met.

4 DOCUMENT CONTENTS

MODEL

The goal of the proposed document contents model is

to capture the possibly relevant parts of the document

contents and their mutual relationships based on their

visual presentation. We deﬁne the model as a graph:

Deﬁnition 2. The document contents model is deﬁned

as a graph G = (C, E), where C is a set of text chunks

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

328

Page

rendering

Presentation to

domain model

mapping

HTML

documents

Text chunk

extraction

PDF

documents

Text boxes

Page contents

model

(graph)

Structured

data records

Domain

model

Attributes (properties)

Figure 2: An overview of the data integration process.

Lincoln | County Hospital

Monday to Satu rday except Bank Holidays

Lincoln Bus Station 0700 0720 0745 15 45

Monks Road 12 0710 0730 0755 25 55

Tower Estate 0712 0732 0757 27 57

County Hospital 0719 0739 0805 35 05

Tower Estate 0722 0742 0808 38 08

Monks Road 12 0725 0745 0811 41 11

Lincoln Bus Station 0740 0800 0825 55 25

Lincoln Bus Station 1545 1620 1650 1720 1750

Monks Road 12 1555 1630 1700 1730 1800

Tower Estate 1557 1632 1702 1732 1802

County Hospital 1605 1640 1710 1740 1810

Tower Estate 1608 1643 1713 1743 1813

Monks Road 12 1611 1646 1716 1746 1816

Lincoln Bus Station 1625 1700 1730 1800 1830

for Sunday journeys see line 17 & 18 timetables

then every

30 mins

until

Figure 3: An example time table.

that represent the relevant parts of the contents to-

gether with their visual formatting and form the ver-

tices of the graph; E ⊂ C ×C is a set of graph edges,

that represent the relationships among the chunks as

expressed by the document layout.

With a text chunk, we understand any piece of

content (a substring of the document text), that pos-

sibly represents a value of a domain property. In the

moment of the chunk extraction, we do not decide,

whether the given substring really represents a part of

a data record; the goal is to identify all substrings that

“look like” a value of a given property when consid-

ered separately.

Deﬁnition 3. A text chunk is a tuple c = (t

, s

, p

where t

is the text of the chunk (the actual substring

of the document text), s

represents the visual style of

the text and p

represents the position of the chunk as

displayed in the resulting page.

Deﬁnition 4. The chunk style is further deﬁned as

= ( f s, w, st, c, bc) where f s is the average font size,

w ∈ [0, 1] is the average font weight from 0 (normal

font) to 1 (bold font), st ∈ [0, 1] is the average font

style (1 for italic font, 0 for regular font) and c and bc

are the computed foreground and background colors

of the displayed chunk.

Deﬁnition 5. The position p

= (x, y, w, h) describes

the x and y coordinates of the chunk in the page and

its width w and height h.

The edges E of the graph represent the mutual re-

lationships among the chunk pairs. Based on their

mutual positions, we identify speciﬁc relationships

that are interesting for further analysis of the whole

data record organization. For example, two chunks

may be in a onRight, below, sameLine or another re-

lation as described in section 4.2.

Both the chunks and the relationships are ex-

tracted from rendered documents as shown in Figure

2. In the next sections, we provide the details of the

chunk and relationship extraction.

4.1 Chunk Extraction

For the chunk extraction, we use a connected line as a

smallest unit of the rendered document.

Deﬁnition 6. A connected line represents a part of

the document text that is positioned on a single line

(considering the y coordinates of the individual char-

acters) and it does not contain an empty space wider

than a certain threshold ∆x.

In our experimental setup, we have used ∆x =

2.5 f where f is the average font size used at the con-

sidered line. The goal of this setting is to ensure

that the normal text formed by space-separated words

forms single lines and the parts separated with a larger

space create separate connected lines.

Example 1. In the example timetable in Figure 3, the

header and footer text forms continuous text lines;

however, the stop names and the time data are sep-

arated by a larger space. Thus, we obtain the

connected lines “Lincoln Bus Station”, “0700 0720

0745”, “15” and “45” for the ﬁrst line of the sched-

ule, etc. Note that the connected lines are not neces-

sarily consistent regarding their style; they may con-

tain text with different font weights, colors, etc. as we

may notice for the station names.

Model-based Integration of Unstructured Web Data Sources using Graph Representation of Document Contents

329

For each domain model property p ∈ P (see Def-

inition 1), we extract a set C

of chunks from all the

connected lines. The algorithm for the chunk discov-

ery within the connected lines depends greatly on the

type of the property p. For our sample domain, we

have used a simple algorithm that ﬁnds all one- or

two-digit numbers in the appropriate range for hours

and minutes; the chunks for stop names are discov-

ered using a simple regular expression allowing a se-

quence of alphanumeric characters and some com-

monly used punctuation. In our previous experiments

on other domains (Burget, 2017), we have also men-

tioned the usage of named entity classiﬁers (Finkel

et al., 2005) for recoginizing personal names and

locations or even using the DBPedia Spotlight tool

(Daiber et al., 2013) for recoginizing entities from the

DBPedia dataset. Advanced algorithms for numeric

value discovery have been proposed as well (Neu-

maier et al., 2016).

As the result of the chunk extraction, we ob-

tain a complete set of chunks for all the properties

C = C

∪ C

∪ . . . ∪ C

where n = |P|. Note that

the chunk detection itself may be quite inaccurate as

we use very approximate methods for the chunk ex-

traction. The extracted chunks may even overlap; e.g.

in the “Monks Road 12” string, we discover three

chunks: the whole string forms the name chunk, the

“12” substring forms the hour and minute chunks be-

cause both interpretations are possible. It is the task

of the mapping phase (described in section 5) to com-

plete the data records and exclude the incorrectly dis-

covered chunks.

4.2 Relationship Modelling

After the chunks have been detected, we analyze all

the chunk pairs (c

, c

) ∈ C ×C and we investigate

whether there is an relationship between c

and c

given by their mutual positions (x

, y

) and (x

, y

We have identiﬁed several relationships that are in-

teresting for further analysis. Every relationship is

deﬁned by a relation E

⊂ C × C and we say that

there is a spatial relationship x between c

and c

iff

, c

) ∈ E

. Currently, we consider the following

relationships:

• onRight – (c

, c

) ∈ E

onRight

when c

and c

are

placed on the same line just next to each other and

is on the right side of c

• after – c

is on the same line anywhere to the right

of c

• sameLine – c

and c

are on the same line regard-

less their mutual positions.

• below – c

is placed just below c

• lineBelow – c

is placed on a line that is just below

As we may see, a chunk pair may belong to multi-

ple relations as the spatial relationships (e.g. after and

sameLine) are not mutually exclusive.

Finally, the complete set of relationships is then

E =

for all the relations x listed above. Together

with the set C of chunks, it creates the document con-

tent graph as deﬁned in Deﬁnition 2.

5 MAPPING TO THE DOMAIN

MODEL

Our information integration approach is based on the

assumption that some of the extracted text chunks

may be mapped to the individual properties of the

domain model as deﬁned in Deﬁnition 1 and sim-

ilarly, some discovered spatial relationships among

them may be mapped to the domain model relation-

ships. During the mapping phase, we ﬁnd all possi-

ble mappings from the constructed document contents

graph to the domain model, we evaluate them and ﬁ-

nally, we use the best mapping found.

Below, we describe the representation of the do-

main model used for the ﬁnal mapping. Further in

section 5.2, we deﬁne a mapping formally and ﬁnally

in section 5.3, we discuss the way of evaluating the in-

dividual mappings and ﬁnding the most suitable one.

5.1 Domain Model Transformation

As the ﬁrst step, we transform the domain model to a

simpliﬁed graph model that describes only the proper-

ties and relationships as the entity sets have no direct

representation in the documents. An example model

for the timetables domain is shown in Figure 4. The

properties are divided into groups (the dashed boxes)

where each group corresponds to a set of properties

that are always presented together in a 1:1 relation-

ship.

Deﬁnition 7. The domain graph model is a graph

= (G, R

) where G = {G

, G

, . . . , G

} is a set

of property groups, G

⊂ P and G

∩ G

= ∅ for

any 1 ≤ i, j ≤ n. P is the set of domain properties.

⊂ G × G is a set of relationships between groups.

The domain graph is constructed from the domain

model deﬁned in Deﬁnition 1 as follows:

• All the properties of a single entity set belong to

the same group.

• If two entity sets are in a 1:1 relationship, all their

properties belong to a single group.

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

330

name

(Stop)

minute

(Time)

hour

(Time)

Figure 4: A domain graph model corresponding to the do-

main model shown in Figure 1 that represents the property

groups and the relationships among them.

• The 1:M relationships are transformed to the rela-

tionships between the respective groups.

Currently, we don’t consider M:N relationships in

our method because they are difﬁcult to represent in

the documents in an understandable way and there-

fore, they are very rarely used in source documents.

Considering the example from Figure 1, we obtain

two groups of properties: G

= {hour, minute} and

= {name} as shown in Figure 4. Subsequently, we

analyze the possible mappings between the document

contents graph and the domain graph model.

5.2 Mapping Representation

When considering a particular document represented

by the document contents graph, there exist many

possible mappings between the chunks and the prop-

erties in the domain graph and similarly, between the

relationships in the two graphs. In our approach, a

mapping presents a hypothesis about the visual pre-

sentation of data records, which is subsequently eval-

uated and compared with other hypotheses.

Based on the above mentioned assumption that

there exist multiple visually consistent data records in

the source documents, a mapping basically describes

two aspects of the records:

1. The visual style of the text chunks used for pre-

senting each property p ∈ P in the input docu-

ment.

2. The actual spatial relationships (as mentioned in

section 4.2) among the property values.

Let’s consider the chunk style deﬁned in Deﬁni-

tion 4 and let S be a set of all distinct chunk styles

used in the input document. Further, let R

be the set

of all spatial relationships between chunks discovered

in the input document. Then, we may deﬁne the map-

ping between a property group G

in the domain graph

and the document contents graph as follows:

Deﬁnition 8 (Group Mapping.). For each group G

∈

G, the mapping is deﬁned as m

= ( f

, f

) where f

7→ S is a morphism that assigns a chunk style to

each property in G

and f

: G

× G

7→ R

assigns

spatial relationships to the property pairs.

The f

morphism does not necessarily assign a re-

lationship to all possible property pairs. For a unique

description of the mapping, it is sufﬁcient that the

property pairs form a connected graph. For example,

considering three properties a, b and c, the mapping

may contain (a, b) 7→ onRight, (a, c) 7→ below (which

can be read as b is on the right side of a and c is be-

low a). We obtain a connected graph of properties

and therefore, it is not necessary to ﬁnd any relation-

ship for the remaining combinations such as (b, c).

Considering the group G

in Figure 4, it is sufﬁcient

to ﬁnd one of the morphisms (hour, minute) 7→ r or

(minute, hour) 7→ r, where r ∈ R

Similarly, we deﬁne an inter-group mapping that

corresponds to the way how the connection of two

groups is visually presented in the document:

Deﬁnition 9 (Inter-group Mapping.). For a pair of

groups (G

, G

) ∈ G × G, the inter-group mapping is

i j

= (p

, p

, r) where p

∈ G

, p

∈ G

and r ∈ R

In other words, we deﬁne a spatial relationship r

between two properties where the ﬁrst property be-

longs to the ﬁrst group and the second property be-

longs to the second group. Again, we have to ﬁnd

enough mappings between the group pairs so that we

obtain a connected graph of groups. Then, the com-

plete mapping is m = (M

, M

) where M

is a set con-

taining a group mapping for each group in G and M

is a set of the inter-group mappings.

Example 2. When considering our example

timetable in Figure 3 and the domain graph in

Figure 4, we ﬁnd many different styles used for

the presentation of the hour values in the document

(when considering the style of all the chunks in C

hour

)

and similarly for the minute and name properties (the

style morphisms f

and f

). Moreover, we ﬁnd

different ways how the (hour, minute) pair is possibly

presented, e.g. minute is on the right side of hour or

minute is below hour or hour is below minute, etc.

(the f

morphism). And ﬁnally, we ﬁnd the possible

presentation of the inter-group relation, e.g. hour is

on the same line as name. Since hour and name are in

separate groups in the domain graphs, we know that

hour actually represents a complete (hour, minute)

group and there may exist multiple such pairs related

to a single name because of the 1:N relationship

between the groups.

By considering all combinations of chunk styles,

and the applicable intra-group and inter-group rela-

Model-based Integration of Unstructured Web Data Sources using Graph Representation of Document Contents

331

tionship representations, we obtain a set M of all pos-

sible mappings from the contents graph to the domain

model graph.

5.3 Evaluation of the Mappings

The last step is the evaluation all the mappings and

choosing the most suitable one. For each mapping

m ∈ M, we apply the style and spatial relationship

mappings on the document text chunks and as a re-

sult, we obtain a set of candidate data records, where

each data ﬁeld of the record is represented by a text

chunk.

For the evaluation, the following aspects of the

discovered candidate records are important:

• The number of chunks actually covered by the

records. Although we admit that some of the

chunks may have been incorrectly identiﬁed, we

assume that the correctly identiﬁed ones prevail

and thus, more chunks contained in the discovered

records indicate a better result.

• Visual consistency of the records. The given map-

ping deﬁnes the actual visual style of the chunks

mapped to the individual properties as well as the

spatial relationships among them (e.g. hours and

minutes being at the same line). However, the

records may differ in the distance and alignment

of the particular chunks. When evaluating con-

sistency, we compare the individual records and

we observe the variance of the corresponding dis-

tances among the chunks. The lower the overall

variance is, the more consistent (and thus better)

are the records.

Additionally, we allow using certain number of

wildcards in style speciﬁcations. It is quite common

in visual presentation that some of the records or data

ﬁelds are distinguished from the others by a differ-

ent background color, font style, etc. In our experi-

ments, we allow one wildcard in the chunk style, i.e.

based on the style deﬁned in Deﬁnition 4, one the

( f s, w, st, c, bc) attributes may be disregarded.

As we may notice, the two evaluation criteria

mentioned above are contradictory to some extent. It

is easy to cover a large number of chunks and dis-

cover many data records when we allow low visual

consistency of records and vice versa. For our ex-

periments we have empirically set the total mapping

score to s = 0.6p + 0.4c where p is the percentage

of chunks contained in the records and c is the visual

consistency, p, c ∈ [0..1].

6 EXPERIMENTAL EVALUATION

For the evaluation on real-world documents, we have

implemented the proposed method in Java. For in-

put document processing, we have used the CSSBox

rendering engine for HTML documents and the PDF-

Box

library for reading PDF documents.

Our preliminary tests (being this a work in

progress) were run on 30 timetables in PDF available

online on the websites of various transportation com-

panies that operate in different en countries (Czechia,

Spain, Italy and the United States). As a second use

case, we have extracted the publication data (authors,

titles and sessions) from CEUR Workshop Proceed-

ings

(HTML documents).

The tests have shown the practical usability of the

proposed document contents model described in sec-

tion 4 as well as the domain mapping method. How-

ever, in about 10% of input documents, the correct

mapping was not evaluated as the best one and the

evaluation function had to be adjusted for obtaining

correct results. Therefore, we consider the mapping

evaluation the main issue for our ongoing research.

7 CONCLUSIONS

In this paper, we have proposed an approach to the

integration of the data contained in web documents

to structured information systems. Unlike most of the

existing approaches that derive the data structure from

the input documents, our method is driven by a pre-

viously deﬁned domain model of the information sys-

tem.

In order to make the information integration pos-

sible, we have designed a graph-based model of the

document contents and subsequently, we have pro-

posed a method for ﬁnding the best mapping of the

document contents model to the domain model. Our

preliminary results show that the approach allows in-

tegration of real-world HTML and PDF documents

and mapping of the published data to the ﬁxed do-

main model. The evaluation of the possible mappings

seems to be the most challenging topic for our next

research.

http://cssbox.sourceforge.net

https://pdfbox.apache.org/

http://ceur-ws.org/

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

332

ACKNOWLEDGEMENTS

This work was supported by the Ministry of the In-

terior of the Czech Republic as a part of the project

Integrated platform for analysis of digital data from

security incidents VI20172020062.

REFERENCES

Burget, R. (2017). Information extraction from the web

by matching visual presentation patterns. In Knowl-

edge Graphs and Language Technology: ISWC 2016

International Workshops: KEKI and NLP&DBpedia,

Lecture Notes in Computer Science vol. 10579, pages

10–26. Springer International Publishing.

Daiber, J., Jakob, M., Hokamp, C., and Mendes, P. N.

(2013). Improving efﬁciency and accuracy in mul-

tilingual entity extraction. In Proceedings of the

9th International Conference on Semantic Systems (I-

Semantics).

Embley, D. W., Campbell, D. M., Jiang, Y. S., Lid-

dle, S. W., Lonsdale, D. W., Ng, Y.-K., and Smith,

R. D. (1999). Conceptual-model-based data extrac-

tion from multiple-record web pages. Data Knowl.

Eng., 31(3):227–251.

Figueiredo, L. N. L., de Assis, G. T., and Ferreira, A. A.

(2017). Derin: A data extraction method based on

rendering information and n-gram. Information Pro-

cessing & Management, 53(5):1120 – 1138.

Finkel, J. R., Grenager, T., and Manning, C. (2005). Incor-

porating non-local information into information ex-

traction systems by gibbs sampling. In Proceedings

of the 43rd Annual Meeting on Association for Com-

putational Linguistics, ACL ’05, pages 363–370.

Guo, J., Crescenzi, V., Furche, T., Grasso, G., and Gott-

lob, G. (2019). Red: Redundancy-driven data extrac-

tion from result pages? In The World Wide Web Con-

ference, WWW ’19, pages 605–615, New York, NY,

USA. ACM.

Lockard, C., Dong, X. L., Einolghozati, A., and Shiralkar,

P. (2018). Ceres: Distantly supervised relation extrac-

tion from the semi-structured web. Proc. VLDB En-

dow., 11(10):1084–1096.

Neumaier, S., Umbrich, J., Parreira, J. X., and Polleres, A.

(2016). Multi-level semantic labelling of numerical

values. In The Semantic Web – ISWC 2016, pages

428–445, Cham. Springer International Publishing.

Potvin, B. and Villemaire, R. (2019). Robust web data ex-

traction based on unsupervised visual validation. In

Intelligent Information and Database Systems, pages

77–89, Cham. Springer International Publishing.

Schulz, A., L

assig, J., and Gaedke, M. (2016). Practical web

data extraction: Are we there yet? – a short survey.

In 2016 IEEE/WIC/ACM International Conference on

Web Intelligence (WI), pages 562–567.

Shi, S., Liu, C., Shen, Y., Yuan, C., and Huang, Y.

(2015). Autorm: An effective approach for automatic

web data record mining. Knowledge-Based Systems,

89:314 – 331.

Yuliana, O. Y. and Chang, C.-H. (2018). A novel

alignment algorithm for effective web data extrac-

tion from singleton-item pages. Applied Intelligence,

48(11):4355–4370.

Model-based Integration of Unstructured Web Data Sources using Graph Representation of Document Contents

333