proximity between vectors-terms, the less is the
angle, the higher is the cosine of the angle (cosine
measure). Consequently, maximum proximity is
equal to 1, and minimum one is equal to 0.
The obtained term-term matrix measures the
proximity between terms on the basis of their co-
occurrence in documents (as coordinates of vectors-
terms are frequencies of their use in documents).
The latter means that the sparser the initial term-
document matrix, the worse is the quality of the
term-term proximity matrix. Therefore, it is
expedient to save the initial matrix from information
noise and rarefaction with the help of the latent
semantic analysis (Deerwester et al., 1990). The
presence of noise is conditioned by the fact that,
apart from the knowledge about the subject domain,
the initial documents contain the so-called “general
places” which, nevertheless, contribute to the
statistics of distribution.
We use the method of latent semantic analysis
for clearing up the matrix from information noise.
The essence of the method is based on
approximation of the initial sparse and noised matrix
by a matrix of lesser rank with the help of singular
decomposition. Singular decomposition of matrix A
with dimension M×N, M>N is its decomposition in
the form of product of three matrices – an
orthogonal matrix U with dimension M×M, diagonal
matrix S with dimension M×N and a transposed
orthogonal matrix V with dimension N×N:
=
.
Such decomposition has the following
remarkable property. Let matrix A be given for
which singular decomposition =
is known
and which is needed to be approximated by matrix
with the pre-determined rank k. If in matrix S
only k greatest singular values are left and the rest
are substituted by nulls, and in matrices U and V
T
only k columns and k lines are left, then
decomposition
=
will give the best approximation of the initial matrix
A by matrix of rank k. Thus, the initial matrix A with
the dimension M×N is substituted with matrices of
lesser sizes M×k and k×N and a diagonal matrix of k
elements. In case when k is much less than M and N
there takes place a significant compression of
information but part of information is lost and only
the most important (dominant) part is saved. The
loss of information takes place on account of
neglecting small singular values, i.e. this loss is the
higher, the more singular values are discarded. Thus,
the initial matrix gets rid of information noise
introduced by random elements.
3.3 Summary
The extracted concepts and relations must be plotted
on a concept map. Let us repeat that as concepts or
nodes of a graph we use all terms for which
Pearson’s criterion is higher than a certain threshold
value determined experimentally. In literature the
value of 6.6 is indicated as a threshold but by
varying this value it is possible to reduce or increase
the list of concepts. For example, a too high value of
the threshold will allow to leave only the most
important terms which have the highest values of
Pearson’s criterion.
In the same way the number of extracted
relations can be varied. If among all pairwise
distances in the term-term matrix, the values lower
than a certain threshold are nulled, the edges (links)
will connect only the concepts the proximity
between which is higher than the indicated
threshold.
4 EXPERIMENTS
To carry out experiments, we chose the subject
domain “Ontology engineering”. The documents
representing chapters from the textbook (Allemang,
D. et al, 2011) formed a positive set of the teaching
collection. Besides, some articles from other themes
formed a negative set of the teaching collection.
Tokenization and lemmatization from the collection
resulted in a thesaurus of unique terms. The use of
Pearson’s criterion with the threshold value of 6.6
allowed to select 500 key concepts of the subject
domain. Table 1 presents the first 10 concepts with
the greatest value of the criterion.
Table 1: The first 12 concepts of the subject domain.
No
Concept Chi-square test value
1 semantic 63.69
2 Web 59.95
3 property 59.87
4 manner 57.08
5 model 53.74
6 class 52.40
7 major 51.71
8 side 50.78
9 word 50.59
10 query 44.09
11 rdftype 37.41
12 relationship 35.71
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
252