(1) the line following an <hr>-tag, (2) the only line
in the block starting with a number, (3) the only line
in the block having the smallest position code or (4)
the only line in the block following a blank line. After
having identified the sample candidate result records
ViNTS builds the wrapper. For this purpose ViNTS
determines the tag paths beginning at the root node of
the result page (<html>-tag) for each identified first
line element. The minimal sub-tree of the result page,
including all search result records, is calculated based
on the tag paths. The search result records are sub-
trees of the result page, which are siblings and have
the same or a similar tag structure. These sub-trees
can be separated by a separator fulfilling the follow-
ing conditions: (1) common subset of the sub-trees of
all records, (2) occurs only once in a sub-tree of each
record and (3) contains the rightmost sub-tree of each
result record. There can be several separators for a
dataset. The wrapper is built by using the smallest tag
path for detecting the data region including the search
result records and the separators to separate the result
records within the data region.
ViNTS needs sample result pages and an empty
result page as input, which can be difficult when ex-
tracting product records from e-shop websites since
there is usually no empty result page which can be
used.
(Walther et al., 2010) present an approach for ex-
tracting structured product specifications from pro-
ducer websites. For the retrieval of the product spec-
ification the algorithm locates the product detail page
on the producer’s website and extracts and structures
the product attributes of the product specification.
For searching the producer’s page with the product
specification (Walther et al., 2010) process keyword-
based Web search by using the popular search en-
gines Google, Bing and Yahoo. After the Web search
step (Walther et al., 2010) rank the results by using
a method called “Borda ranking” described in (Liu,
2006) followed by the analysis of the page URI, the
page title and the page content based on domain spe-
cific terms for finding the producer site within all can-
didates which were found by the Web searches. For
extracting the product data in form of key-value pairs
(Walther et al., 2010) execute three different wrap-
per induction algorithms on the product detail page.
Each of the three algorithms cluster the HTML nodes,
which contain text to a node cluster as a first step.
The first algorithm is chosen if there are example key
phrases provided as input. The algorithm filters the
clusters created in the first step of the nodes, which
contain the example phrases. The XPath description
of the nodes is used for wrapper generation. If no key
phrases are provided as input the second algorithm
is used, which exploits domain knowledge from al-
ready stored product data as key phrases to find the
relevant nodes in the cluster for generating the wrap-
per. If there are neither example key words nor do-
main knowledge provided as input the third algorithm
generates the wrapper from training sets, which are
product pages of related products. In the last step
the key-value pairs are extracted by text node splitting
based on identifying separators like a colon in the text
nodes.
The problem when using the approach of (Walther
et al., 2010) for automated product record extraction
is that example product data for arbitrary product do-
mains is required, which has to be given by the users
in the form of key phrases, or which must be provided
from the system as knowledge. For obtaining good
results, the key phrases provided by the users or sys-
tem must fit the phrases of the product detail page of
the producers, otherwise the approach will not work.
Additionally, numerous steps and different algorithms
are needed for the data extraction task.
The approach proposed in (Anderson and Hong,
2013) for extracting product records from the Web is
based on Visual Block Model (VBM) a product of the
HTML tag tree and the Cascading Style Sheet (CSS)
of a web page. The VBM is created by the rendering
process of a layout engine like WebKit
5
. (Anderson
and Hong, 2013) filter the basic blocks of the page,
which are blocks containing other visual blocks. In
the next step the similarity of the basic blocks is de-
fined by calculating the visual similarity, the width
similarity and the block content similarity. Blocks
are visually similar since all of their visual properties
are the same. Width similarity of blocks is given if
their width properties are within a 5 pixels threshold
of each other. The block content similarity exists if
the blocks include similar child blocks, which is cal-
culated by using Jacard index described in (Real and
Vargas, 1996) and a preset similarity threshold. For
the product record extraction (Anderson and Hong,
2013) select a seed candidate block, which is a single
basic block. The seed block is identified by selecting
a visual block in the centre of the page and tracing
the visual blocks around that block by moving clock-
wise in the form of a Ulam Spiral
6
until reaching a
basic block which is taken as seed block. The seed
block is within one or more container blocks where
one is assumed to be a data record block. Thus, all of
them are taken as candidate blocks. Clusters for all of
the candidate blocks are created based on the calcu-
lation of block content similarity to all blocks in the
VBM. The cluster including the maximum number of
5
http://www.webkit.org/
6
http://mathworld.wolfram.com/PrimeSpiral.html
WEBIST2015-11thInternationalConferenceonWebInformationSystemsandTechnologies
422