name
(Stop)
minute
(Time)
1
n
G
1
G
2
hour
(Time)
Figure 4: A domain graph model corresponding to the do-
main model shown in Figure 1 that represents the property
groups and the relationships among them.
• The 1:M relationships are transformed to the rela-
tionships between the respective groups.
Currently, we don’t consider M:N relationships in
our method because they are difficult to represent in
the documents in an understandable way and there-
fore, they are very rarely used in source documents.
Considering the example from Figure 1, we obtain
two groups of properties: G
1
= {hour, minute} and
G
2
= {name} as shown in Figure 4. Subsequently, we
analyze the possible mappings between the document
contents graph and the domain graph model.
5.2 Mapping Representation
When considering a particular document represented
by the document contents graph, there exist many
possible mappings between the chunks and the prop-
erties in the domain graph and similarly, between the
relationships in the two graphs. In our approach, a
mapping presents a hypothesis about the visual pre-
sentation of data records, which is subsequently eval-
uated and compared with other hypotheses.
Based on the above mentioned assumption that
there exist multiple visually consistent data records in
the source documents, a mapping basically describes
two aspects of the records:
1. The visual style of the text chunks used for pre-
senting each property p ∈ P in the input docu-
ment.
2. The actual spatial relationships (as mentioned in
section 4.2) among the property values.
Let’s consider the chunk style defined in Defini-
tion 4 and let S be a set of all distinct chunk styles
used in the input document. Further, let R
s
be the set
of all spatial relationships between chunks discovered
in the input document. Then, we may define the map-
ping between a property group G
i
in the domain graph
and the document contents graph as follows:
Definition 8 (Group Mapping.). For each group G
i
∈
G, the mapping is defined as m
i
= ( f
si
, f
ri
) where f
si
:
G
i
7→ S is a morphism that assigns a chunk style to
each property in G
i
and f
ri
: G
i
× G
i
7→ R
s
assigns
spatial relationships to the property pairs.
The f
rg
morphism does not necessarily assign a re-
lationship to all possible property pairs. For a unique
description of the mapping, it is sufficient that the
property pairs form a connected graph. For example,
considering three properties a, b and c, the mapping
may contain (a, b) 7→ onRight, (a, c) 7→ below (which
can be read as b is on the right side of a and c is be-
low a). We obtain a connected graph of properties
and therefore, it is not necessary to find any relation-
ship for the remaining combinations such as (b, c).
Considering the group G
1
in Figure 4, it is sufficient
to find one of the morphisms (hour, minute) 7→ r or
(minute, hour) 7→ r, where r ∈ R
s
.
Similarly, we define an inter-group mapping that
corresponds to the way how the connection of two
groups is visually presented in the document:
Definition 9 (Inter-group Mapping.). For a pair of
groups (G
i
, G
j
) ∈ G × G, the inter-group mapping is
m
i j
= (p
i
, p
j
, r) where p
i
∈ G
i
, p
j
∈ G
j
and r ∈ R
s
.
In other words, we define a spatial relationship r
between two properties where the first property be-
longs to the first group and the second property be-
longs to the second group. Again, we have to find
enough mappings between the group pairs so that we
obtain a connected graph of groups. Then, the com-
plete mapping is m = (M
G
, M
I
) where M
G
is a set con-
taining a group mapping for each group in G and M
I
is a set of the inter-group mappings.
Example 2. When considering our example
timetable in Figure 3 and the domain graph in
Figure 4, we find many different styles used for
the presentation of the hour values in the document
(when considering the style of all the chunks in C
hour
)
and similarly for the minute and name properties (the
style morphisms f
s1
and f
s2
). Moreover, we find
different ways how the (hour, minute) pair is possibly
presented, e.g. minute is on the right side of hour or
minute is below hour or hour is below minute, etc.
(the f
ri
morphism). And finally, we find the possible
presentation of the inter-group relation, e.g. hour is
on the same line as name. Since hour and name are in
separate groups in the domain graphs, we know that
hour actually represents a complete (hour, minute)
group and there may exist multiple such pairs related
to a single name because of the 1:N relationship
between the groups.
By considering all combinations of chunk styles,
and the applicable intra-group and inter-group rela-
Model-based Integration of Unstructured Web Data Sources using Graph Representation of Document Contents
331