Table 1: Results of document categorization using Docu-
ment Map.
DM Number of documents (in %) for category:
unit
number Sport Policy Foreign Actuality Society
1 2.5 16.2 8.5 0
2 45 9.4 16.5 20
3 20 2.4 0 0
4 22.5 25.5 25 0
5 0 0 0 0
6 10 25.5 41.5 60
7 0 0 0 0
8 0 0 0 0
9 0 21 8.5 20
ranged to 3x3 grid. The map receives and processes
the vectors from the WCM outputconvolvedby Gaus-
sian mask and produces the output which corresponds
to the category of the input document. After the train-
ing, the DM output units were labeled manually.
The association of documents from particular cat-
egories to the clusters, which are represented by the
DM map output units are presented in Table 1. It is
evident that the unit 2 is mostly activated for the sport
category, units 4 and 6 are activated especially for cat-
egory policy, etc.
The ART-2 network was developed to give a com-
parable output with the SOM based categorizer. The
ART-2 categorizer has nine output units (i.e. the net-
work can create at most nine clusters). The set of
documents used for training of the SOM based cat-
egorizer was also used here. The number of actu-
ally created clusters was strongly dependent on the
parameter ρ (vigilance threshold). In our case param-
eter ρ = 0.98 was used because most documents were
submitted to only one cluster if ρ had a smaller value.
The results of categorization using ART-2 categorizer
are presented in Table 2. The meaning of values in
the table is similar as for the SOM based categorizer.
Documents with sport, policy and foreign actuality
topics are well separated (see the values for units 7,
5 and 1 respectively), documents dealing with soci-
ety news were mostly submitted to the same cluster
as documents about policy (output unit 5).
The comparisonof SOM and ART-2 based catego-
rizers is quite difficult and it is still investigated. Since
the changes in the SOM network parameters affect the
resulting clusters less than it is in the case of ART-2
network, the results seem to be more natural. The
advantage of SOM categorizer is a low number of pa-
rameters. The ART-2 is very sensitive to parameters
setting. There are seven parameters of the network
(including ρ mentioned above), which have to be set
up before training the network. If the parameters are
chosen properly, the network can give better catego-
Table 2: Results of document categorization using ART-2
categorizer.
ART-2 Number of documents (in %) for category:
output unit
number Sport Policy Foreign Actuality Society
1 8.4 11.4 53.7 17.7
2 0.4 2.3 1.4 2.4
3 0.1 0 0 0
4 14.7 5.6 0.5 8.9
5 5.8 58.3 10.4 44.4
6 0.2 0.1 0.5 0
7 56.3 14.7 16.3 13.3
8 5.7 2.9 4.1 4.4
9 8.4 4.7 13.1 8.9
rization results then SOM categorizer.
In our future work we plan to focus on the follow-
ing tasks, which could improve the results of docu-
ment categorization:
• introduction of another feature set for word de-
scription,
• application of other supervise-trained neural net-
works (e.g. multilayer perceptron, LVQ, etc.) as
a second layer
• usage of more sophisticated approaches for com-
parison of categorization results
ACKNOWLEDGEMENTS
This work was supported by grant no. 2C06009 Cot-
Sewing.
REFERENCES
Carpenter, G. A. and Grossberg, S. (1988). The art of
adaptive pattern recognition by a self-organizing neu-
ral network. Computer, 21(3):77–88.
Fausett, L. V. (1994). Fundamentals of Neural Networks.
Prentice Hall, Englewood Cliffs, NJ.
Fiesler, E. and Beale, R., editors (1997). Handbook of Neu-
ral Computation. Oxford University Press.
Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998).
Websom-self-oganizing maps of document collec-
tions. Neurocomputer, pages 101–117.
Kohonen, T. (2001). Self-Organizing Map. Springer-Verlag,
Berlin Heidelberg.
Manning, C. D., Raghavan, P., and Sch¨utze, H. (2007). An
Introduction to Information Retrieval - Preliminary
Draft. Cambridge University Press.
COMPARISON OF NEURAL NETWORKS USED FOR PROCESSING AND CATEGORIZATION OF CZECH
WRITTEN DOCUMENTS
513