SCIENTIFIC

DOCUMENTS MANAGEMENT SYSTEM

Application of Kohonens Neural Networks with Reinforcement in Keywords

Extraction

Bła

zej Zyglarski

Faculty of Mathematics And Computer Science, Nicolaus Copernicus University, Chopina 12/18, Toru

n, Poland

Piotr Bała

Faculty of Mathematics And Computer Science, Nicolaus Copernicus University, Chopina 12/18, Toru

n, Poland

ICM, Warsaw University, ul.Pawinskiego 5a, Warsaw, Poland

Keywords:

Keywords extraction.

Abstract:

Keywords choice is an important issue during the document analysis. We’ve developed the document man-

agement system which performs the keywords-oriented document comparison. In this article we presents our

approach to keywords generation, which uses self-organizing Kohonen neural networks with reinforcement.

This article shows the relevance of this method to the standard statistical method.

1 INTRODUCTION

This article contains complete description and com-

parison of two approaches of keywords choosing: the

statistical one and the neural network based one. Sec-

ond one consists of two types: the simple neural net-

work and the neural network with the reinforcement.

In both cases we are using as the analysis example

the same input string built over a two letters alpha-

bet, which is ”aab abb abaa abb bbba abb bbaa baa

bbaa bbbb bbba bbbb”. The last part of this docu-

ment shows effectiveness of the reinforced learning

and other algorithms. All presented approaches was

tested with example set of about 200 documents.

2 THE STATISTICAL APPROACH

The standard approach to selecting keywords is to

count their appearance in the input text. As it was

described in narimura, most of frequent words are ir-

relevant. The next step is deleting trash words with

use of speciﬁc word lists.

In order to speed up the trivial statistical algorithm

for keywords generation, weve developed the algo-

rithm which constructs the tree for text extracted from

a document. The tree is built letter by letter and each

letter is read exactly once, which guarantees that the

presented algorithm works in a linear time.

2.1 An Algorithm

Denote the plain text extracted from an i-th document

as T (i). Let

T (i) be a lowercase text devoid of punc-

tuation. In order to select appropriate keywords, we

need to build a tree (denoted as τ(

T (i))).

1. Let r be a root of τ(

T (i)). Let p be a pointer,

pointing at r

p = r (1)

2. Read a letter a from the source text.

a = ReadLetter(

T (i)) (2)

3. If there exists an edge labeled by a, leaving a ver-

tex pointed by p, and coming into some vertex s

then

p = s (3)

count(p) = count(p) + 1 (4)

else, create a new vertex s and a new edge labeled

by a leaving vertex p and coming into vertex s

p = s (5)

counter(p) = 1 (6)

Zyglarski B. and Bała P. (2009).

SCIENTIFIC DOCUMENTS MANAGEMENT SYSTEM - Application of Kohonens Neural Networks with Reinforcement in Keywords Extraction.

In Proceedings of the International Conference on Knowledge Management and Information Sharing, pages 55-62

DOI: 10.5220/0002298100550062

 SciTePress

4. If a was a white space character then go back with

the pointer to the root

p = r (7)

5. If

T (i) is not empty, then go to step 2.

After these steps every leaf l of τ(

T (i)) represents

some word found in the document and count(l) de-

notes appearance frequency of this word. In most

cases the most frequent words in every text are irrele-

vant. We need to subtract them from the result. This

goal is achieved by creating a tree ρ which contains

trash words.

1. For every document i

ρ = ρ + τ(

T (i)) (8)

Every time a new document is analyzed by the sys-

tem, ρ is extended. It means, that ρ contains most

frequent words all over ﬁles which implies that these

words are irrelevant (according to different subjects of

documents). The system is also learning new unim-

portant patterns. With increase of analyzed docu-

ments, accuracy of choosing trash words improves.

Let Θ(ρ) be a set of words represented by Θ(ρ).

We should denote as a trash word a word, which fre-

quency is higher than median m of frequencies of

words which belongs to Θ(ρ).

1. For every leaf l ∈ τ(

T (i)) If there exist z ∈ ρ,

such as z and l represents the same word and

count(z) > count(m) then

count(l) = 0 (9)

2.2 Limitations

Main limitation of this type of generating keywords

is loosing the context of keywords occurrences. It

means that words which are most frequent (which im-

plies that they are appearing together on the result list)

could be actually not related. The improvement for

this issue is to pay the attention at the context of key-

words. It could be achieved by using neural networks

for grouping words into locally closest sets. In our

approach we can select words which are less frequent,

but their placement indicates, that they are important.

Finally we can give them a better position at the result

list.

3 KOHONEN’S NEURAL

NETWORKS APPROACH

During implementation of the document management

system (Zyglarski et al., 2008) weve discovered previ-

ously mentioned limitations in using simple statistical

methods for keywords generation. Main goal of fur-

ther contemplations was to pay attention of the word

context. In other words we wanted to select most fre-

quent words, which occur in common neighborhood.

Table 1 presents an example text, with discovered

keywords. Presented algorithm selects the most fre-

quent words, which are close enough. It is achieved

with a words categorization (Frank et al., 2000).

This is simple text about keywords generation

(discovery). Of course all keywords are difficult

for automatic generation (or

discovery), but in

limited way we could achieve results, which will

fullfil our needs and retrieve proper

keywords.

During implementation of document management

system we’ve

discovered mentioned previously li-

mitations in using simple statistical methods for

keywords generation.

Main goal of further contemplations was to pay

attention of words context. In other words we

wanted to select most

frequent keywords, which

occur in common neighborhood. Using

neural ne-

tworks

we show disadvantages of statistical key-

words discovery.

Neural networks

based keywords generation is

way much reliable and gives better results, inc-

luding promotion of less frequent keywords.

Others, which can be more

frequent do not have to

be part of the results. Limitations were defe-

ated.

Figure 1: Example keywords selection.

Precise results are shown in the Table 1. Each key-

word is presented as a pair (a keyword, a count). Key-

words are divided into categories with use of Kohonen

neural networks. Each category has assigned rank,

which is related to number of gathered keywords and

their frequency).

Results were ﬁltered using words from ρ deﬁned

in ﬁrst part of this article.

3.1 An Algorithm

Algorithm of neural networks based keyword discov-

ery consists of 3 parts. At the beginning we have

to compute distances between all words. Then we

have to discover categories and assign proper words

to them. At the end we need to compute the rank od

each category.

3.1.1 Counting Distances

Lets

T ( f ) be a text extracted from a document

f and

(i) be a word placed at the position i in this

text. Lets denote the distance between words A and B

KMIS 2009 - International Conference on Knowledge Management and Information Sharing

Table 1: Example keywords selection.

Category Rank Keyword Count

1 0,94 keywords 8

discovery 3

frequent 3

simple 2

text 1

automatic 1

2 0,58 neural 2

networks 2

discovered 1

neighborhood 1

3 0,50 fulﬁll 1

4 0,41 statistical 2

words 2

contemplations 1

promotion 1

disadvantages 1

5 0,36 retrieve 1

reliable 1

within the text

as δ

(A, B).

(A, B) = (10)

= min

i, j∈

{

1,..,n

}

{ki − jk; A =

(i) ∧ B =

( j)} (11)

By the position i we need to understand a number of

white characters read so far during reading text. Ev-

ery sentence delimiter is treated like a certain amount

W of white characters in order to avoid combining

words from separate sequences (we empirically chose

W = 10). Weve modiﬁed previously mentioned tree

generation algorithm. Analogically we are construct-

ing the tree, but every time we reached the leaf (which

represents a word) we are updating the distances ma-

trix M (presented on ﬁgure 2), which is n × n upper-

triangle matrix and n ∈ N means number of distinct

words read so far.

1 2 ... k ... n

1 0 *

2 0 *

... ... *

k 0* * *

... ...

n 0

Figure 2: Distances matrix.

Generated tree (presented on ﬁgure 3) is used for

checking previous appearance of actually read word,

assigning the word identiﬁer (denoted as ξ

( j) =

ξ(

( j)),ξ

( j) ∈ N ) and to decide whether update

existing matrix elements or increase matrix’s dimen-

sion by adding new row and new column. Figure 2

marks with dashed borders ﬁelds updateable in the

ﬁrst case. For updating existing matrix elements and

counting new ones we also use the array with posi-

tions of last occurrences of all read words. Lets de-

note this array as λ(ξ

( j))

n=1

n=2

n=3

n=k

Figure 3: Word tree.

We need to consider two cases:

1. Lets assume that j − 1 words was read from input

text and a word with position j is read for the ﬁrst

time.

∀

i< j

( j) 6=

(i) (12)

λ(ξ

( j)) = j (13)

∀

k∈{1,...,ξ

M[k, ξ

( j)] = | j − λ(k)| (14)

This case is shown on ﬁgures 4 and 5, which con-

tains a visualization of the state of am algorithm

after reading three ﬁrst words from example string

”aab abb abaa — abb bbba abb bbaa baa bbaa

bbbb bbba bbbb”. At the ﬁgure 5 there is also

presented the table of last occurrences.

n=1

n=2

n=3

Figure 4: The word tree generated after reading ﬁrst three

words.

λ(ξ

) 1 2 3

1 2 3

aab 1 0 1 2

abb 2 0 1

abaa 3 0

Figure 5: The distance matrix generated after reading ﬁrst

three words.

SCIENTIFIC DOCUMENTS MANAGEMENT SYSTEM - Application of Kohonens Neural Networks with Reinforcement

in Keywords Extraction

2. Lets assume that j − 1 words was read from input

text and a word with position j was read ealier and

already has given a worde identiﬁer.

∃

i< j

( j) =

(i) (15)

We have to update the last occurrences table by

λ(ξ

( j)) = j (16)

and update appropriate row and column in exist-

ing matrix

∀

k∈{1,...,ξ

( j)−1}

M[k, ξ

( j)] = ∆(k, ξ

( j)) (17)

∀

k∈{ξ

( j)+1,...,

}

M[ξ

( j), k] = ∆(ξ

( j), k) (18)

where

= max

i={1,..., j}

{ξ

(i)} (19)

∆(k, l) = min(|λ(k) − λ(l)|, M[k, l]) (20)

This case is shown on ﬁgures 6 and 7, which con-

tains visualization of the state of the algorithm af-

ter reading four ﬁrst words from an example string

”aab abb abaa abb — bbba abb bbaa baa bbaa

bbbb bbba bbbb”. In this case a word ”abb” was

read twice, so we need to update λ array and the

actual distance matrix M.

n=1

n=2

n=3

Figure 6: The word tree generated after reading ﬁrst four

words.

λ(ξ

) 1 4 3

1 2 3

aab 1 0 1 2

abb 2 0 1

abaa 3 0

Figure 7: The distance matrix generated after reading ﬁrst

four words.

Bold-faced ﬁelds indicates values, that could be

evaluated each time using λ array and dont have to

be memorized. According to that observation we

can omit the distance table, using instead of it only

speciﬁc lists (Figure 10), containing these elements,

which cannot be evaluated with λ array. Final results

of our example (after reading all words from string

”aab abb abaa abb bbba abb bbaa baa bbaa bbbb bbba

bbbb —”) are shown on ﬁgures 8 and 9.

n=1

n=2

n=3

n=7

n=4

n=5

n=6

Figure 8: The word tree generated after reading all words.

λ(ξ

) 1 6 3 11 9 8 12

1 2 3 4 5 6 7

aab 1 0 1 2 4 6 7 9

abb 2 0 1 1 1 2 4

abaa 3 0 2 4 5 7

bbba 4 0 2 3 1

bbaa 5 0 1 1

baa 6 0 2

bbbb 7 0

Figure 9: The distance matrix generated after reading all

words.

Presented algorithm guarantees that its result ma-

trix is a matrix of shortest distances between words.

(A, B) = M[ξ(A), ξ(B)] (21)

3.1.2 Generating Categories

The knowledge of distances between words in doc-

ument is used to categorize them with use of the

self-organizing Kohonen Neural Network (Figure 11)

(Kohonen, 1998).

This procedure takes 5 steps:

1. Create rectangular m × m network, where m =

c. Presented algorithm can distinguish max-

imally m

categories. Every node (denoted as

x,y

) is connected with four neighbors and con-

tains a prototype word (denoted as

(x, y)) and a

set (denoted as β

(x, y)) of locally close words.

2. For each node choose random prototype of the

category p ∈ {1, 2, ...,

3. For each word k ∈ {1, 2, ...,

} choose closest

prototype

(x, y) in network and add it to list

(x, y).

KMIS 2009 - International Conference on Knowledge Management and Information Sharing

λ(ξ

) λ(1) λ(2) ... λ(

)

1 2 ...

(1, u

1,1

) δ

(1, u

2,1

) ... δ

(1, u

)

(1, u

1,2

) δ

(1, u

2,2

) ... δ

(1, u

)

... ... ... ...

(1, u

1,k

) δ

(1, u

2,k

) ... δ

(1, u

)

Figure 10: The scheme of the distance table.

Figure 11: The scheme of the neural network for words cat-

egorization with selected element and its neighbors.

4. For each network node ω

x,y

compute a general-

ized median for words from β

(x, y) and neigh-

bors lists (denoted as β). A generalized median is

deﬁned as an element A which minimizes a func-

tion:

B∈β

f 2

(A, B) (22)

Set

(x, y) = A

5. Repeat step 4 until the network is stable. Stability

of the network is achieved, when in two follow-

ing iterations all word lists are unchanged (with-

out paying attention to iternal lists sturcture and

their position in nodes).

Such algorithm divides the set of all words found in

document into separate subsets, which contains only

locally close words. In other words it groups words

into related sets (see Figure 12).

3.1.3 Selecting Keywords

After execution of described procedure all words from

the analyzed document are divided into categories.

We need to choose most important categories and then

select most frequent words among them. Each cate-

gory contains list of words, which are locally close.

As a category rank we could take

x,y

w∈β

(x,y)

count(w)

|β

(x, y)|

(23)

Figure 12: The two dimensional example of Kohonen net-

work grouping locally closest words (small circles) into dif-

ferent categories.

Now ordering categories descending by ρ

x,y

we can

select from each a number of most frequent words.

This method allows selecting words, which appear-

ance isnt most frequent in the complete scope of a

document, but is frequent enough in some sub-scope.

The sample results are shown in table 1. The table

presents ﬁrst eighteen results presented method.

4 THE NEURAL NETWORK

REINFORCEMENT

At this stage of work we’ve decided to extend an

idea of presented algorithm with use of a reinforce-

ment for improving it’s accuracy. The reinforcement

is performed with use of presented in (Zyglarski et al.,

2008) system, in which we have created huge repos-

itory of documents. Articles used for testing algo-

rithms were also put into this repository. It means that

all of them were compared using statistics of words,

statistics of n-grams (Cavnar and Trenkle, 1994) (in

this particular case 3-grams) and Kolmogorov Com-

plexity (Fortnow, 2004). Three kinds of distances be-

tween documents are computed.

Table 2: The comparison of results for the example text.

Statistical Kohonen Reinforced

words 40 words 40 keyword 25

keywords 25 keywords 25 distance 7

text 22 neural 13 neural 13

neural 13 distances 3 statistical 10

abb 13 text 22 context 4

11 simple 5 matrix 11

matrix 11 elements 2 denoted 4

tree 9 reading 8 vertex 3

frequent 9 extracted 3 web 2

T 9

11 relevant 2

bbbb 8 matrix 11 string 3

networks 8 distance 7 extracted 3

extracted 3 discovery 5 networks 8

SCIENTIFIC DOCUMENTS MANAGEMENT SYSTEM - Application of Kohonens Neural Networks with Reinforcement

in Keywords Extraction

4.1 An Idea of the Reinforcement

Main idea of the reinforcement (Sutton, 1996) is to

modify a behavior of the neural network depending

of a weight of a keywords candidate. At the begin-

ning we need to initiate the weight attribute (with val-

ues from interval [0,1]) of every word from a docu-

ment. Each word, which can be found in previously

mentioned trash word set has weight equal to 0, rest

of them have weights equal to 1. The neural net-

work algorithm is modiﬁed in such way, that words

with smaller weight are pushed away from words with

greater weight. It means that they are pushed aside the

main categories. Of course they will have also small

inﬂuence on category rank. Moreover we need to add

parent iteration, which will modify weights of words

and repeat neural network steps until proper words

will be selected. After performance of word catego-

rization a set of proposed keywords is generated. At

this stage we need to check every keyword for it’s ac-

curacy. This is performed by checking a number (in

our tests it was 10) of articles (containing tested key-

word) randomly selected from repository and compar-

ing normalized distances between them and the an-

alyzed document. If these documents are relatively

close (in the terms of counted distances) to initial one,

a keyword is prized with increasing it’s weight. If dis-

tances are relatively far, weight is decreased. In other

words, if an selected keyword is good, a network is re-

warded. With this improvement, algorithm continues

with steps of neural network learning.

4.2 The Results Propriety

Methodology of creating repository, which is de-

scribed in (Zyglarski et al., 2008) guarantees, that col-

lected documents has various subjects. They are gath-

ered with using of most frequent words appeared in

each document. Additional variety is an result of the

collection which initiates repository - containing ar-

ticles from various areas of interests. It means that

there is a big chance, for articles containing tested

keyword to be really connected with the same sub-

ject. If a keyword candidate isn’t really a keyword,

these documents will probably differ from tested one

and network will not be reinforced.

5 THE COMPARISON OF

RESULTS

Presented method gives better results than the sim-

ple statistical method. In table 2 we show keywords

found over this article, chosen with using all three

methods (with italic font there are marked actual (sub-

jectively selected by authors) keywords). It’s clear

that Kohonen Networks related methods gives better

results than statistical method and also reinforcement

has very good inﬂuence on ﬁnal results.

10%

15%

20%

25%

30%

35%

40%

45%

Figure 13: Effects of statistical method. X axis shows ef-

fectiveness, Y axis shows number of documents with pro-

cessed with this effectiveness.

10%

15%

20%

25%

30%

Figure 14: Effects of neural network method. X axis shows

effectiveness, Y axis shows number of documents with pro-

cessed with this effectiveness.

In our tests we’ve used about 200 various articles.

In most cases results given by our approach was more

accurate than other approaches. The accuracy was

checked manually and is subjective. Finally, accord-

ing to executed tests, statistical methods gave very

poor results (see Figure 13). In the best case list of

proposed keywords achieved 65% accuracy. In the

worst case it was about 5%. At the ﬁgure 13 there

are presented accuracies of results from tested arti-

cles (for example: in 66% of articles accuracy of key-

words was at level between 20% and 40%). Better

results archived with Kohonen Networks without the

reinforcement are presented at the ﬁgure 15). In the

best case, list of proposed keywords achieved almost

80% accuracy. 10%-40% accuracy was in this case

very less often.

The best result were generated with using Rein-

forced Kohonen Networks, where best results reached

KMIS 2009 - International Conference on Knowledge Management and Information Sharing

10%

15%

20%

25%

30%

Figure 15: Effects of neural network method with reinforce-

ment. X axis shows effectiveness, Y axis shows number of

documents with processed with this effectiveness.

level of 90% accuracy and results between 10%-40%

were completely eliminated. Comparison of effec-

tiveness of all presented methods is shown on ﬁg. 16.

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Reinforced learning

Kohonen Networks

Sta!s!cal approach

Figure 16: Comparison of presented methods.

6 DOCUMENT MANAGEMENT

SYSTEM IMPLEMENTATION

The Web Services based Scientiﬁc Article Manager

(Zyglarski et al., 2008) has been designed to provide

the knowledge management for the users using jour-

nal articles and internet documents as main sources

of information. The journal articles are usually re-

trieved in PDF format, however the content can be

processed and analyzed automatically (excluding rel-

atively small number of documents which are stored

as direct scans). Figure 17 presents the architecture of

learning part of the system. This article shows details

of the gray shaded part of the ﬁgure.

During selection of documents connected to ex-

amined article, there are used three different methods

of comparison, which are ﬁnally combined. It guaran-

tees accuracy of obtained results, which pay attention

to content and structure of documents:

1. Statistical based, performed by checking apper-

ance of all words in two documents. Each word

is treated as a dimension and a taxicab metric is

used for counting distances betweeen documents.

In this case all punctuation and white characters

are excluded.

2. N-grams based, performed by checking apperance

of all n-grams in two documents. Each n-gram

is treated as a dimension, analogicaly to previous

method.

3. Kolmogorov distance based. Kolmogorov com-

plexity of text X (denoted as K(X)) is length of the

shortest compressed binary ﬁle X∗, from which

original text can be reconstructed. Formally Kol-

mogorov Complexity K(x) is deﬁned as length of

the smallest program running on Turing Machine,

which returns word x . In our case Turing Ma-

chine is substituted by compression program and

compressed ﬁle responds to the Turing program.

Presented system is implemented with Java lan-

guage and provides data with use of the webservices

technology. We’ve also implemented three kinds of

user interfaces:

1. Servlet based;

2. Portlet based (used with Liferay portal);

3. .Net Windows client (used with Windows ex-

plorer, in future intended to work seamlessly

within Windows explorer)

The users are about to share, and access each other

documents, with apropriate permission subsystem.

7 CONCLUSIONS

Presented approach gives in most cases very good

results of the context aware keywords recognition.

Efectiveness is related with size of the repository of

available documents, which are used for reinforce-

ment of the neural network. It means, they are

more effective while working with large data col-

lections. Moreover continuously working kohonen

network can improve the keywords recognition ev-

ery time new documents are added to the reposi-

tory. If neural network is stable, any change of key-

words weights implies reorganisation of this network,

which (according to size of performed changes - only

weights in proposed keywords are changed) is in most

cases quite fast. Most of the processing time is spent

to check accuracy of proposed keywords. Fortunately,

results of this check are used also for categorization of

documents collected in our Document Management

System.

SCIENTIFIC DOCUMENTS MANAGEMENT SYSTEM - Application of Kohonens Neural Networks with Reinforcement

in Keywords Extraction

Document analysis

Scientific

Article

Scientific

Article

Scientific

Article

Repository

Statistical

N-grams

based

Kolmogorov

Compare with other documents from repository

Statistical

N-grams

based

Kolmogorov

Categorize documents

Analyze words

distances

Combine results and generate

documents ranking list

Categorize words

Check results and

prepare reinforcement

INPROPER

PROPER

END

START

Figure 17: The Document Management System implemen-

tation.

REFERENCES

Cavnar, W. B. and Trenkle, J. M. (1994). N-gram-based

text categorization. In Proceedings of SDAIR-94, 3rd

Annual Symposium on Document Analysis and Infor-

mation Retrieval, pages 161–175, Las Vegas, US.

Fortnow, L. (2004). Kolmogorov complexity and computa-

tional complexity.

Frank, E., Chui, C., and Witten, I. H. (2000). Text cate-

gorization using compression models. In In Proceed-

ings of DCC-00, IEEE Data Compression Conference,

Snowbird, US, pages 200–209. IEEE Computer Soci-

ety Press.

Kohonen, T. (1998). The self-organizing map. Neurocom-

puting, 21(1-3):1–6.

Sutton, R. S. (1996). Generalization in reinforcement learn-

ing: Successful examples using sparse coarse coding.

In Advances in Neural Information Processing Sys-

tems 8, volume 8, pages 1038–1044.

Zyglarski, B., Baa, P., and Schreiber, T. (2008). Web ser-

vices based scientiﬁc article manager. In Informa-

tion Systems Architecture and Technology, Web Infor-

mation Systems: Models, Concepts and Challenges,

pages 205–215. Wrocaw University of Technology.

APPENDIX

Scientiﬁc work cofunded from resources of Euro-

pean Social Fund and Budget within ZPORR (Zin-

tegrowany Program Operacyjny Rozwoju Regional-

nego), Operation 2.6 ”Regional Innovation Strate-

gies and knowledge transfer” of Kujawsko-Pomorskie

voivodship project ”Scholarship for PhD students

2008/2009 - ZPORR”.

KMIS 2009 - International Conference on Knowledge Management and Information Sharing