Table 1: Corpus statistics.
Training Set Test Set
Number of documents 240 60
Avg. number of text units 107.1 97.6
Avg. hierarchy depth 4.1 4.1
Avg. number of headings 10.8 10.8
Table 2: Feature sets used in heading classification.
Feature Set Number of Features
F
n
, F
n+1
46
F
n
, F
n+1
, F
n-1
73
F
n
, F
n+1
, F
n+2
73
F
n
, F
n+1
, F
n+2
, F
n-1
100
F
n
, F
n+1
, F
n+2
, F
n-1
, F
n-2
127
Table 3: Results for heading classification (Model 1).
Recall Precision F-measure
Linear 0.71 0.80 0.75
Polynomial (d=2) 0.73 0.84 0.78
Polynomial (d=3) 0.72 0.84 0.78
Polynomial (d=4) 0.70 0.86 0.77
Table 4: Feature sets used in hierarchy extraction.
Feature Set Number of Features
F
10
11
F
10
F
01
24
F
10
F
01
F
20
36
F
10
F
01
F
20
F
02
48
Table 5: Results for hierarchy extraction (Model 2).
Method
Learning
Al
orith
Model 1
headin
s
Manual
headin
s
M
0
Classification 0.58 0.75
M
0
Rankin
0.56 0.75
M
1
Classification 0.55 0.76
M
2
Classification 0.49 0.66
M
3
Classification 0.51 0.69
M
4
Classification 0.51 0.67
5 CONCLUSIONS
In this paper, we considered the problem of sectional
hierarchy extraction from Web documents and
adapted a tree learning approach as a solution. The
proposed system was evaluated on a corpus of Web
documents for heading and sectional hierarchy
extraction. The results show that the system has
acceptable results for unrestricted domain of Web
documents. As further work, we will consider the
division of Web documents into content blocks. This
information can be used as features to improve the
accuracy of the system. Also, the corpus of Web
documents will be extended with additional queries.
ACKNOWLEDGEMENTS
This work was supported by the Bogazici University
Research Fund, Grant No. 07A106 and İstanbul
Kültür University.
REFERENCES
Branavan, S. R. K., Deshpande, P., Barzilay, R., 2007.
Generating a table-of-contents. In Proc. of ACL.
Brugger, R., Zramdini, A., Ingold, R., 1997. Modeling
documents for structure recognition using generalized
n-grams. In Proc. of ICDAR, pp. 56–60.
Collins, M., Roark, B., 2004. Incremental parsing with the
perceptron algorithm. In Proc. of ACL.
Covington, M. A., 2001. A fundamental algorithm for
dependency parsing. In Proc. of ACM Southeast
Conference.
Curran, J. R., Wong, R. K., 1999. Transformation-based
learning for automatic translation from HTML to
XML. In Proc. of ADCS99.
Feng, J., Haffner, P., Gilbert, M., 2005. A learning
approach to discovering Web page semantic
structures, In Proc. of ICDAR, pp. 1055 – 1059.
GATE, 2009. A General Architecture for Text
Engineering. Available at: http://gate.ac.uk/.
Joachims, T., 1999. Making large-scale SVM learning
practical. In Advances in Kernel Methods - Support
Vector Learning. MIT Press.
Mao, S., Rosenfeld, A., Kanungo T., 2003. Document
structure analysis algorithms: a literature survey. In
Proc. of SPIE Electronic Imaging, pp.197—207.
Mayfield, J., McNamee, P., Piatko, C., Pearce, C., 2003.
Lattice-based tagging using Support Vector Machines.
In Proc. of ICIKM, pp. 303-308.
Platt, J. C., 1999. Probabilistic outputs for Support Vector
Machines and comparisons to regularized likelihood
methods. In Advances in Large Margin Classifiers, A.
Smola, P.Bartlett, B. Scholkopf, D. Schuurmans
(eds.). MIT Press.
Shilman, M., Liang, P., Viola, P., 2005. Learning non-
generative grammatical models for document analysis.
In Proc.of ICCV.
The Lobo Project, 2009. Cobra: Java HTML renderer &
parser. Available at: http://lobobrowser.org/cobra.jsp.
TREC, 2004. Text REtrieval Conference. Available at:
http://www.trec.org.
W3C, 2005. Document Object Model (DOM). Available
at: http://www.w3.org/DOM/.
Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin,
C.-Y., Li, H., 2007. Web page title extraction and its
application. In Information Processing and
Management, Vol. 43, No. 5, pp. 1332-1347.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
450