ExtractInMultiLine(NodeSet) {
LineSets ← divideByLine (NodeSet);
for each LineSet in LineSets {
<TITLE, LINK> ← extractTitleLink(LineSet);
Push(ResultList, <TITLE, LINK, pubDate>);
}
}
InfoExtract (M, <r, c, n>, j, TPL) {
TR ← total row number of M
for (i = r, i< r+n, i++) {
TN ← getTimeNode(R[i]);
TBN ← getNode (M[i, j]);
pubDate ← getTime(TN,TPL[i]);
NodeSetB ← searchInBorder( TBN);
if (NodeSetB ≠ NULL) {
isInSameLine ← checkPostion(NodeSetB);
if (isInSameLine == TRUE) {
<TITLE, LINK> ← extractTitleLink(NodeSetB);
Push(ResultList, <TITLE, LINK, pubDate>);
}
else (isInSameLine == FALSE) {
ExtractInMultiLine(NodeSetB);
}
}
else {
NodeSetL ← searchInLine(TBN);
If (NodeSetL ≠ NULL) {
<TITLE, LINK> ← extractTitleLink (NodeSetL);
Push(ResultList, < TITLE, LINK, pubDate>);
}
else {
if (r+i ≠ TR-1) {
NextTBN ← getNode (M[i+1, j]);
}
else {
NextTBN ← detectSearchBorder;
}
AreaSet ← searchArea (TBN, NextTBN);
ExtractInMultiLine(AreaSet);
}
}
}
}
in the j
th
column of the M are same or not. If the
values are not same, splitByValues(<r, j, n>) will
segment the section <r, j, n> into k sub-sections in
each which the values in the j
th
column are the same.
When each sub-section contains only 1 row, the
segmentation process will be stopped and we can
extract the information items in the current section.
Although HTML pages containing the time
pattern have diverse contents and structures, they
can be classified into two types in terms of the
layout. In the first type, each news item has an
individual release time, and the page showed in
figure 2(a) is a typical example. The page in figure
2(b) is an example of the second type, in which
multiple news items follow every release time. The
algorithm in figure 6 describes the details of
information extraction based on the structure and
layout analysis.
getTimeNode(R[i]) returns the time node TN
corresponding to the i
th
row of M. getNode(M[i, j])
returns a node TBN corresponding to M[i, j], which
defines the border of current TN. getTime extracts
the time information from TN based on the
corresponding time pattern stored in TPL and output
pubDate in the standard format such as ‘Tue, 18 Jan
Figure 6: Algorithm for information item extraction
Figure 7: Information Extraction in multiple lines
2005 07:27:42 GMT’. searchInBorder searches and
outputs all <a> nodes under TBN to a node set
NodeSetB. checkPostion checks if all the <a> nodes
in a set are presented in the same line in a browser or
not. For the list-oriented information, each item is
usually displayed in an individual line. This is an
important layout feature. The line presentation relies
on the DOM tree structure and specific tags such as
<ul>, <li>, <tr>, <p>, <div> and <br>, which
cause a new line in the display. extractTitleLink uses
heuristic rules to select the href attribute of a suitable
<a> node and a proper title text in the current line
as the title and link in RSS feeds. searchInLine
searches <a> nodes in the line in which TBN is
presented, and outputs to a set NodeSetL.
ExtractInMultiline, described in figure 7, extracts
information items from a <a> node set in which the
nodes are displayed in multiple lines. devideByLine
is used to divide a node set into multiple sub-sets in
which all the nodes are displayed in the same line
.
For some pages, like the example in the figure 2(b),
we detect the position of two adjacent TBNs and
search target nodes between them by searchArea.
But for last TBN in M there is no next TBN as the
end border detectSearchBorder is used to decide the
end border of search. In general, the structure of
each section is similar, so we can use the structure in
the last section to deduce the current end border.
Obviously, ResultList can be easily translated to a
RSS format.
After recognition of all the items in a section, we
can decide the complete border of this section. In
some pages, such as the page in figure 2(a), each
section has a category title for summarizing content
in the section, which corresponds to the category in
the RSS item. The category data is usually presented
in a line above and adjacent to the first item of the
section, and contained in continuous text nodes on
the left part in the line. If category is presented in an
image, we can use a similar method to check the alt
attribute of the appropriate <img> node. If
necessary, we can also extract this information
automatically.
The idea of the time pattern discovery can be
easily extended to mine other distinct format
patterns, such as price patterns, which can be used to
extract pairs of the product name and price from
pages in e-commercial sites.
WEBIST 2005 - WEB INTERFACES AND APPLICATIONS
314