hundred of the most frequent ones. All produced
roots can be divided into three categories: the
grammatically correct ones, correct for all practical
purposes (FAP) and incorrect. FAP correct roots are
those that are not grammatically correct but have
been extracted from the group of words that share
common root. They are too long to be grammatically
correct, but with some improvements in the last step
of the root extraction algorithm, they could become
grammatically correct. On the other hand, incorrect
roots are those that are extracted from the groups of
words that don't share a common root.
The results are shown in Table 2.
Table 2: Accuracy of hundred most frequent root words
extracted using morphological and semantic analysis
Grammatically correct 60%
FAP correct 25%
Incorrect 15%
With the increase of a word corpus, the growth
of the number of morpho-semantic groups would
slow down and the average number of words per one
group would be bigger. This would increase the
number of grammatically correct ones and decrease
the number of FAP (“for all practical purposes”)
correct root words. For eliminating the incorrect root
words we need to make some improvements in
clustering of morphologically similar words. One
way would be to produce a positional weight
function that would stress the digrams at the
beginning, end or middle of the stem, depending on
the processed language. It should also be allowed
that one word be assigned to more groups, but then
the last step of the algorithm (extracting root words
from morpho-semantic groups of words) should be
improved because proposed one it is too rigid. To
achieve that, some sort of weighting could also be
used.
The same text was processed using Goldsmith's
Linguistica (a tool for morphological analysis).
Since Linguistica is more oriented on the stem of the
word, root words produced by it were too long
(prefixes were rarely removed). Among hundred
most frequent roots there were: 15 grammatically
correct ones, 75 FAP correct ones and 10 incorrect
ones. Since both methods greatly depend on the size
of the corpus more extensive tests are needed.
4 CONCLUSION
The goal of this paper is to explore methods for a
complete (morphological and semantic) knowledge
free computer analysis of any natural language text.
The method described here gave good results even
when applied to a small text where only a few
lexical variants of every word are present. Further
improvements of the algorithm are necessary to
avoid creation of the incorrect morpho-semantic
groups of words. Knowledge-free tools for
morphological analysis is probably the right choice
for languages that are not world-wide spread (as
English is), because creation of morphological
dictionaries for "local" languages has questionable
cost effectiveness.
REFERENCES
F. C. Graham, 2004. Large Dynamic Graphs: What Can
Researchers Learn From Them?, SIAM News, vol. 37.,
no. 3.
T. Laundauer, S. Dumais, 1997. A Solution to Plato's
Problem, The Latent Semantic Analysis Theory of
Acquisition, Induction and Representation of
Knowledge, Psychological Review, no. 104., pp. 211-
240.
P. Schone, D. Jurafsky, 2000. Knowledge-Free Induction
of Morphology Using Latent Semantic Analysis,
Proceedings of CoNLL-2000 and LLL-2000, Lisbon,
Portugal, pp. 67-72.
De Roeck, A., W. Al-Fares, 2000. A Morphologically
Sensitive Clustering Algorithm for Identifying Arabic
Roots, Proceedings of the 38th Annual Meeting of the
ACL, Hong Kong.
R. Scitovski, 1999. Numerička Matematika,
Elektrotehnički fakultet Osijek, Osijek.
P. Nakov, A. Popov, P. Mateev, 2001. Weight Functions
Impact on LSA Performance, EuroConference
RANLP'2001, Tzigov Chark, Bulgaria, pp. 187-193.
C. D. Manning, H. Schűtze, 1999. Foundations of
Statistical Natural Language Processing, MIT Press,
Cambridge, MA, pp. 554-566.
M. Moguš, M. Bratanić, M. Tadić, 1999. Hrvatski čestotni
rječnik, Školska knjiga, Zagreb.
J. Goldsmith, 2001. Unsupervised Learning of the
Morphology of a Natural Language. Computational
Linguistics. 153-189.
EXTRACTING MOST FREQUENT CROATIAN ROOT WORDS USING DIGRAM COMPARISON AND LATENT
SEMANTIC ANALYSIS
373