knowledge base. However due to text variability (as
discussed in section 3.3), this proved difficult, leading
us to produce our own manual annotations.
We have created a novel collection of product
features by downloading a sample of 9 different
French major e-commerce web sites: boulanger.fr,
materiel.net, ldlc.fr, fnac.com, rueducommerce.fr,
surcouf.com, darty.fr, cdiscount.com and digit-
photo.com. The ldlc.fr web site changed its page
template during our experiments so we evaluated our
method on the first and second version of this web site
(resp. ldlc.fr (v1) and ldlc.fr (v2)). This emphasizes
an interesting aspect of our method which is its ro-
bustness to structure changes: even if extraction rules
change, product features are usually kept as is. Thus,
our method can readily induce new rules without hu-
man intervention.
For each web site, a gold standard was produced
by randomly selecting 100 web pages which did not
belong to any category in particular (Movies & TV,
Camera & Photo, . . . ) and annotating product features
(name and corresponding value). Finally the corpus
is composed of 1 022 web pages containing 19 402
feature pairs.
4.2 Experimental Settings
For each web site, we ran our method as follows:
• We randomly chose 5-10 unseen web pages from
randomly chosen categories
• We retrieved the corresponding feature sets from
the Icecat knowledge base. Association between
a web page and a feature set was achieved auto-
matically by looking at the product name and the
page title
• We applied the proposed method and induced
XPath extraction rules
• Finally, we applied those rules to our gold stan-
dard web pages in order to extract product features
We have used standard metrics to assess the qual-
ity of our extractions:
• Precision, defined as the ratio of correct features
extracted to the total number of features extracted
• Recall, defined as the ratio of correct features ex-
tracted to the total number of all available fea-
tures.
4.3 Results
As shown in table 1, our method offers very high per-
formance. Most of the time, the system gives a per-
fect extraction, due to a good templateness and little
variability in the whole web site. This proves that our
initial hypotheses and the choice of the XPath formal-
ism were relevant. Actually, our custom formalism
derived from XPath correctly captures what is regular
in templated web pages: HTML structure (tags) and
attributes (such as the “class” attribute which provides
rendering clues sometimes). Moreover, dividing our
extraction rules in three parts (see section 3.4) allows
us to extract features precisely and robustely which
leads to high precision. Our sequential approach is a
major difference with previous methods that consid-
ered all text fragments in web pages. However, as it
is clearly iterative, failure of one step of the method
is irrecoverable which is exactly why extractions on
web sites 8 and 10 failed.
More interestingly, we observe mixed results on
web sites 6 and 7. The lower recall for web site 6
can be explained by a misrepresentative sample. The
extraction rules do not cover all existing HTML at-
tributes that locate the specification block due to the
absence of examples while inducing the rules. The
noise extracted for web site 7 is due to the alignment
hypothesis. In-depth analysis reveal that several ta-
ble cells, aligned with product feature names or val-
ues, are mislabeled. For instance, features relative to
a computer screen are preceded by a ”Screen” cell er-
roneously labeled as a feature name.
We tried to overcome some of those problems by
providing more input pages for these sites. We de-
cided to limit the number of input pages to 10 in or-
der to respect our initial goal to use few input pages.
In fact, we made the hypothesis that SBS scores on
these sites were wrong due to a lot of differences in
the DOM trees. The use of more pages gives a more
precise evaluation of text variability and increases the
probability of crossing known features on web pages.
Results shown in table 2 and in-depth analyses con-
firm this hypothesis.
On site 8, when providing only 5 pages as input,
a block containing a lot of features written in plain
text was selected instead of the specification block.
This problem was avoided when more pages were
provided. On site 6, recall did not increase which
means that there are still unseen cases in the test set.
Results for site 10 show another issue, which can-
not be handled by our method regardless of how many
pages we use as input. The main context which leads
to the failure of the specification block detection is
when we cannot compare the same segments on all
pages. Manual analysis of each step for this web site
show that this case happened here. In fact, we don’t
have any specific HTML attributes (the usual ”id” and
”class”) for locating web page segments, and there
are different optional elements on each page too. The
WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies
650