Figure 5: New property name detection.
is an import clue that helps us to confine the boundary
of the property value. As the example in Figure 4,
the value of property “ (name)” can be only
extracted from node “td” which is the child node of
the first pattern node “tr”. Hence our problem is to
find those pattern nodes. As mentioned above, the
pattern nodes may be composed of multiple nodes
which could be any HTML element, we propose a
novel method to detect pattern nodes automatically.
Root Node Detection. In order to find the pattern
nodes, we first find the root node. If the root is de-
tected, we could extract pattern nodes from its descen-
dant nodes.
Given a Web page, we first resolve the Web page
into its DOM tree structure. Then we preserve all the
anchor nodes from the leaf nodes. For each anchor
node, we define its pattern as pat
{
tag,class
}
, where
tag is the DOM node type such as “div” or “table”,
while class is the class attribute of that node. For ex-
ample, the node “<table class=‘mt20’>...</table>”
in Figure 4 has the pattern pat
{
‘table’, ‘mt20’
}
. For
each non-leaf node, we select a root node node
root
when
• The node covers all the anchor nodes which have
the same pat.
• There is no other root nodes in its descendants.
Pattern Node Detection. For each node
root
,
if it has more than n children, we will ex-
tract pattern nodes from its children de-
noted as patset[subpat
1,l
, ..., subpat
n,1
], where
subpat
i,l
{
pat
i
, nodeset
}
presents node sets
that have the same pat
i
and l is the length
of pat
i
, while nodeset[nodes
1
, ..., nodes
m
]
(m>minmum repeat count) presents the node
sets. minmum repeat count is a constraint that depicts
the minimum occurrence of pat
i
. Each nodes
i
contains DOM tree nodes [child
1
, ..., child
l
], where
child
i
is a descendant node of node
root
. For one
subpat
i,l
{
pat
i
, nodeset
}
, if l ≥valid pattern length,
we will consider pat
i,l
as a valid pattern and then
extract property-value pairs from nodeset.
As the example in Figure 4, the node “table” is
not a root node since it has only one child “tbody”.
The node “tbody” is a root node since we can extract
patset from its eight children nodes “tr”, that is
patset[subpat
{
tr, nodeset[0, 1, ..., 7]
}
], and then we
can extract property-value pairs from each “tr” node.
Rule based Value Parser. Parsing values from the
free text is task specific. In our task, we developed
several rule-based parsers (shown in Figure 2) for dif-
ferent types of organization properties such as date,
number, address and organization name, etc. Note
that the previous procedure depicted in this section
is independent of subjects or languages (except the
handcrafted dictionary).
There is a problem that the extracted property-
value pairs may describe multiple organizations. In
this paper, we simply split the property-value pairs
into different groups ensuring that each group has
only one property representing the organization’s
name.
Entity Linkage. After extracting the property-value
pairs, we will link the pairs with the official data set.
Since the organization name and address are already
known (see Figure 3), we can link extracted property-
value pairs with an organization entity by matching
their name and address properties. Those property-
value pairs failed to match the source organizations
will be grouped by matching their name and address
properties with each other, and the grouped pairs will
form new organization entities.
3.4 New Property Name Detection
As mentioned in Section 3.3.1, the handcraft dictio-
nary fails to cover all the properties on the Web. Some
unknown properties may occur when handling differ-
ent Web pages. Moreover, even a known property
may have various expressions in different Web pages,
for example, property “ (organization name)”
has other synonymous expressions like “ ”,“
”, “ ”, etc. It is an important work to discover
more properties and expressions to enlarge the dictio-
nary.
Initially we craft the dictionary with some basic
properties, such as “ (organization name)”, “
(address)”, “ (manager)”, etc. Then dur-
ing the step described in Section 3.3.2, for nodes
i
[child
1
, ..., child
l
], if child
h
is in the property dic-
tionary, we will record its pattern pat
h
. If there is
Harvesting Organization Linked Data from the Web
163