4 CONCLUSIONS
The evaluation of 10% of the set of documents used
for the experiment was performed manually by
experts of the domain. The most relevant clusters
were created with 2-pass correlation and joint
clusters method. The extrinsic measures of
clustering were 0,87 of purity and 0,22 of entropy.
The evaluation on Reuters data set showed 0,62 of
purity and 1,44 of entropy (see Table 3.). These
differences appear because of the genre and topic of
texts present in Reuters data set – general language
corpus. One important aspect of our methodology is
using technical thesaurus to assign the importance
weight to NPs found in text.
Table 3: Results of clustering methodology applied on
technical documents and general language texts.
Text genre Purity Entropy
Technical documents
0,87 0,22
General texts 0,62 1,44
The results of this experiment were applied in
RKB Knowledge Base. When viewing a particular
publication, RKB Knowledge Base provides a list of
most relevant publications.
ACKNOWLEDGEMENTS
The authors wish to thank Hugh Glaser and Ian
Millard of Southampton University for their advice
and cooperation regarding the Resilience Knowledge
Base. This research has been supported in part by
EC IST contract no. 026764, Network of Excellence
ReSIST (Resilience for Survivability in IST).
Gintare Grigonyte was supported by DAAD
(Deutscher Akademischer Austauschdienst) grant
A/07/92317.
REFERENCES
Gelbukh, A.F., Sidorov, G., Guzmán-Arenas, A. 1999.
Use of a Weighted Topic Hierarchy for Document
Classification. In Proceedings of the 2
nd
international
Workshop on Text, Speech and Dialogue V. Matousek,
P. Mautner, J. Ocelíková, and P. Sojka, Eds. Lecture
Notes In Computer Science, vol. 1692. Springer-
Verlag, London, 133-138.
Glaser, H., Millard, I., Jaffri, A. 2008. RKBExplorer.com:
A Knowledge Driven Infrastructure for Linked Data
Providers. The Semantic Web: Research and
Applications, Springer, 797-801.
Gonzalo, J., Verdejo, F., Chugur, I., Cigarran, J., 1998.
Indexing with WordNet synsets can improve Text
Retrieval. In proceedings of the COLING/ACL'98
Workshop on Usage of WordNet for NLP.
Haller, J., Schmidt, P. 2006. AUTINDEX - Automatische
Indexierung. Zeitschrift für Bibliothekswesen und
Bibliographie: Sonderheft 89, Klostermann, Frankfurt
am Main, 104-114.
Hatzivassiloglou, V. , Gravano, L., Maganti, A. 2000. An
investigation of linguistic features and clustering
algorithms for topical document clustering. In
proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in
information retrieval, 224-231.
Huang S., Xue G., Zhang B., Chen Z., Yu Y., Wei-Ying
Ma 2004. TSSP: A Reinforcement Algorithm to Find
Related Papers, Proceedings of the Web Intelligence,
IEEE/WIC/ACM, p.117-123.
Johnson S. C. 1967. Hierarchical Clustering Schemes. In
Psychometrika, 2:241-254.
Joerg B. 2008. Towards the Nature of Citations, In poster
proceedinds of FOIS 2008, 31-36.
Kouomou, A., Berti-Équille, L., Morin, A. 2005.
Optimizing progressive query-by-example over pre-
clustered large image databases, In proceedings of the
2nd international workshop on Computer vision meets
databases, Baltimore, USA.
Mass, H.D., Rösener, C., Theofilidis, A. 2009.
Morphosyntactical and semantic analysis of text: The
MPRO tagging procedure. Forthcoming: SFCM 2009
workshop on Systems and Frameworks for
Computational Morphology, Zürich, Switzerland.
Manning, C., Schütze, H. 1999. Foundations of Statistical
Natural Language Processing. MIT Press Cambridge,
MA.
FIZ Thesaurus Technik und Management. Hierarchisch
strukturiertes Fachwortverzeichnis. 2000. FIZ-Technik
Presse-Information. Frankfurt.
Tikk, D., Biro, G., Szidarovszky, F., Kardkovacs, Z.,
Lemak, G., 2007. Topic and language specific internet
search engine. In journal Acta Cybernetica, vol. 18.2,
279-291.
Zheng, H., Kang, B., Kim, H., 2009. Exploiting noun
phrases and semantic relationships for text document
clustering, In Information Sciences, vol. 179.13,
2249-2262.
LINGUISTICALLY ENHANCED CLUSTERING OF TECHNICAL PUBLICATIONS
327