Table 2: Example keywords.
Example words connections
an ontology, ontology driven, driven similar-
ity, similarity algorithm, algorithm tech, tech
report, report kmi, kmi maria, maria var-
gas, vargas vera, vera and, and enrico, enrico
motta, motta an, an ontology, ontology driven,
driven similarity, driven similarity, similarity al-
gorithm, knowledge media, media institute, insti-
tute kmi, kmi the, the open, open university, uni-
versity walton, walton hall, hall milton, milton
keynes,keynes mk, mk aa, aa united, aa united,
united kingdom, kingdom m, m.vargas, vargas
vera , vera open, open.ac, ac.uk, uk abstract, ab-
stract.this, this paper, paper presents, presents our,
our similarity, similarity algorithm, algorithm be-
tween, between relations, relations in, in a, a user,
user query, query written, written in, in fol, fol
first, first order, order logic, logic and, and onto-
logical, ontological relations, relations.our, our
similarity, similarity algorithm, algorithm takes,
takes two, two graphs, graphs and, and produces,
produces a, a mapping, mapping between, be-
tween elements, elements of, of the, the two, two
graphs, graphs i.e, i.e.graphs, graphs associated,
graphs associated, associated to, to the, the query,
query a, a subsection, subsection of, of ontology,
ontology relevant
3.3 WordRank
While parsing simple text document, one have to find
relations between words. Our idea of Word Rank as-
sumes using of simple natural connections between
words, based on their position upon the text. Sim-
ply we can consider two words as connected, if they
are neighbours. Additionally, according to assump-
tions presented in previous papers, initial statistical
analysis of the texts in repository was performed and
unimportant words were chosen
1
. They should not be
considered during connections analysis and selection
connected words. Example set of connected words is
presented in table 2. All words, which are not om-
mited, are potential keywords.
Our procedure takes following steps, shown also
on figure 4.
1. Mark all punctuation marks and all unimportant
words as division elements. Mark all other words
as potential keywords. Lets V be a set of all po-
tential keywords.
1
Word is unimportant if it is appearing often in all ana-
lyzed documents.
Figure 4: Example graph of connected words
2. Set as connected every two neighbor words,
which are not marked as division elements. Con-
sider each connection as bidirectional.
3. Build directed graph G = (V,E).
4. Label every connection with weight from domain
[0,1]. (E = VxVx[0,1]) Let’s x and y be two
connected words. According to carried out tests,
weight of connection between word and its suc-
cessor should equal 1 and weight of connection
between word and its predecessor should equal
0.3.
E = E ∪{(x,y,1)}∪{(y,x,0.3)} (13)
5. For each word in graph compute it’s ranking ω() :
V → [0,2].
Now main algorithm of categorization presented
in (Zyglarski and Bała, 2009) can categorize those
words and prepare final keywords lists.
3.4 Main Part of the Algorithm
Last step of algorithm (For each word in graph com-
pute it’s ranking ω() : V → [0,2].) is the most im-
portant. It is based on the idea of Google PageR-
ank ((Page et al., 1999)), where importance of each
website depends on the number of hyperlinks, linking
to this website. Similarly in our algorithm, weight
of each word depends on weight and number of it’s
neighbors. By the word we need to understand ab-
stract class of the word (not connected with its posi-
tion in the text).
KMIS 2010 - International Conference on Knowledge Management and Information Sharing
318