different approaches proposed in the literature to
XML document classification. Basic methods for
classification are described in section 3. Our
proposal on document representation and feature
vector construction is explained in section 4. Section
5 describes the experiments and results. And finally,
section 6 presents the conclusion and future works.
2 RELATED WORKS
The continuous growth in XML documents has
caused different efforts in developing classification
systems based on document structure. Document
representation has to be done before classification
process. The representation models can be divided
into three groups: the first group are the models
which do not consider the structure of document.
These works focus on representing the document
content as a bag of words to classify them. They are
the most studied classification methods. The second
group are the models which take into account only
the structure of a document in order to classify them
without considering the document content. For
example (Wisniewski et al, 2005) use Bayesien
model to generate the possible DTDs of a documents
collection. A class represents a DTD. They are
interested only on document structure for
classification. (Aïtelhadj et al, 2009) and
(Dalamagas et al, 2005) have also classified the
documents by only document structure trees
similarities. Finally, the third group is composed of
the models which consider both structure and
content of XML documents in representation.
Mostly the classification systems use vector
space model for document representation, the
difference is based on selecting vector features and
their weight computation. In (Doucet and Ahonen-
Myka, 2002) each vector feature can be a word or a
tag of XML document. The tf*ief (Term Frequency
* Inverse Element Frequency) is used for calculating
the weight of the words or the tags in documents. In
(Vercoustre et al, 2006) a feature can be a path of
the XML tree or a path following by a text leaf. The
term weight is based on tf*idf. This allows taking
into account either the structure itself or the structure
and content of these documents. (Yi and Sundaresan,
2000) have also used a vector model containing
terms or XML tree paths as the vector elements.
Ghosh (Ghosh and Mitra, 2008) has proposed a
composite kernel for fusion of content and structure
information. The paths from root to leaves are used
as indexing elements in structure kernel weighted by
tf*idf. The content and structure similarities are
measured independently. A linear combination of
these kernels is used finally for a content-structure
classification by SVM. (Wu and Tang, 2008)
propose a bottom up approach in which the
structural elements are document tree nodes and the
leaves are textual section of each element. First the
terms in leaf nodes are identified, and their
occurrences are extracted and normalized. Then the
key terms are substantiated with the structural
information included in the tags by the notion of key
path. A key path ends at a leaf node that contains at
least one key term for a class. By using the key
terms set and key path the similarity is computed
between new documents and class models. (Yan et
al, 2008) propose a document representation by the
vectors of weighted words, a vector for each
structural element in a document. In this method a
weight is associated to each structural element
according to its level in the document tree. Then the
weight of words is calculated based on its frequency
inside the element and the importance of this
element. (Yang and Wang, 2010) (Yang and Zhang,
2008) have proposed an extension of the vector
model called the Structured Link Vector Model
(SLVM). In their model a document is represented
by a set of term vectors, a vector for each structural
element. The term weight is calculated by term
frequency in each document element and the inverse
document frequency of term.
The model that we propose belongs to the third
group, constructing the document feature vector by
structure and content.
3 DOCUMENT CLASSIFICATION
Classification can be divided in two principal
phases. The first phase is document representation,
and the second phase is classification. The standard
document representation used in text classification is
the vector space model. The difference of
classification systems is in document representation
models. The more relevant the representation is, the
more relevant the classification will be. The second
phase includes learning from training corpus,
making a model for classes and classifying the new
documents according to the model. In this section
we are going to explain the vector space model
followed by a presentation of the SVM classification
algorithm which is used in our approach.
ICEIS 2011 - 13th International Conference on Enterprise Information Systems
96