A Dictionary based Stemming Mechanism for Polish

Michał Korzycki

AGH University of Science and Technology, Department of Computer Science,

Al. Mickiewicza 30, Kraków, Poland

Abstract. In this paper we present and evaluate a robust stemming mechanism

for Polish. We use the Polish Inﬂection Dictionary to build a Rule Based Stemmer

and a Generative Reversed Rule Stemmer. The combination of both stemmers in

the shape of the described Hybrid Stemmer provides us with a high precision

stemming mechanism that is able to match human performance. This assumption

is supported by a conducted experiment, the results of which are presented.

1 Introduction

Human linguistic skills are clearly composed of potential abilities ie. the ability to deal

with words, forms and expressions that have not been encountered before. That is di-

rectly related to the nature of the language - an ever changing entity, conveying new

information, often using new, unknown means - such as new words and expressions.

Computer systems that are dealing with language should try to replicate this behavior

in order to be of use in any serious application. This paper describes the process and

results of building a stemmer (a mechanism for generating a base form for a word form

found in text) that also is able to assign some grammatical categories to the found form.

The process of creating the stemmer is automatic - its rules are automatically extracted

from the Polish Inﬂection Dictionary, thus are a direct result of analyzing the language

itself and are not biased by some prior grammatical preconceptions. That permits us to

postulate that the resulting mechanism is able to recreate closely something that can be

called a natural grammar - linguistic knowledge coming directly from observation and

not from formal grammar deﬁnitions.

2 Related Work

The presentedwork bases on the notion of generativegrammar introduced by N.Chomsky

[1] that have been expandedinto the two-levelmorphologyformalism of K.Koskenniemi

[2]. All those formal approaches have been crucial in the creation of the Polish Inﬂec-

tion Dictionary [3] on which we build upon. The detailed description of this dictionary

and its creation approach can be found in Lubaszewski et al. [4].

There are already some stemming solutions available for Polish - such as the com-

mercial solutions Gram from Neurosoft or PoMor from MorphoLogic. Among the free

and open stemmers, one must count Stempel created by A.Białecki and L. Galambos,

the Lametyzator and Stempelator by D. Weiss [6] and the classic SAM [7]. A stemmer

Korzycki M..

A Dictionary based Stemming Mechanism for Polish.

DOI: 10.5220/0004100301430150

In Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science (NLPCS-2012), pages 143-150

ISBN: 978-989-8565-16-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

acts often as the replacement of a dictionary, so a stemmer comparison is often a com-

parison of the quality of the underlying dictionaries. Creating a comparison metric that

would be able to evaluate the quality of the stemming mechanism in isolation from the

underlying dictionary has been suggested in [8] and is beyond the scope of this paper.

3 Requirements for a Robust Stemmer

The stemmer that will be presented should be able to mimic human skills as closely as

possible. If we were to list those skills, we would point to the following issues:

– Being able to discern exceptions from general rules. The language is an entity that

has been created through an evolutionary process, resulting in many different com-

peting layers and grammars. This often results in some vestigial grammars that

describe behaviors different to the rest of the language.

– Correct behavior on known words. We use the Polish Language Dictionary as the

base for our linguistic knowledge. We need the stemmer to be fully compliant with

that dictionary.

– For words that are not found in the mentioned dictionary, we need the stemmer to

be able to correctly stem and identify their part of speech.

4 Polish Word Representation

Polish is a highly inﬂected language. Each primary word has a number of inﬂectional

forms: verbs have 47 (if we exclude participles), adjectives44, numerals up to 49, nouns

and pronouns 14, and adverbs 3. These ﬁgures, and the fact that many words have irreg-

ular stem alternations, show that Polish inﬂection presents real problems for the com-

putational linguist [5]. If we ask how to inﬂect properly the Polish word, eg. personal

masculine noun aktor (’actor’):

Singular Plural

Nom. aktor-0 aktorz-y

Gen. aktor-a aktor-ów

Dat. aktor-em aktor-ami

Acc. aktor-a aktor-ów

Instr. aktor-em aktor-ami

Loc. aktorz-e aktor-ach

Voc. aktorz-e aktorz-y

The grammarian’s answer is that one must ﬁrst learn Polish lexical grammar and

then apply that grammar to particular lexical items But, in fact, if one wants to inﬂect a

particular word properly, one must ﬁrst select the proper inﬂection ending, eg. -0, -a, -o,

-e, -e, -i or -y to form Masc. Pers. Nom. Sing., -a, -e, or -ego to form Gen. Sing., -owi,

-u, or -emu to form Dat. Sing., etc. and then must apply the proper stem alternation

rule. As we can see, the inﬂectional stem of the word aktor changes from aktor- to

aktorz- before the ending -e and -y. which is the result of the palatalization process. Let’s

144

compare behaviour of the ﬁnal stem consonant before ending -y, which can occur in

Nom. Sing. and Nom. Plur. of nouns, eg. aktor-0 : aktorz-y, senior-0 : seniorz-y (’older

person’) amor -0: amor-y (’cupid’), gbur-0 : gbur-y (’bumpkin’), traktor-0 : traktor-y

and adjectives, eg. któr-y : którz-y (’which’) and stary : starz-y (’old’). It is clear that the

global phonological rule which says that a front vowel causes consonant palatalization

is not appropriate; here, as in aktorz-y, seniorz-y the palatalization takes place before

-y in Nom. Plur., but cf. the co-existence of amor-y, gbur-y, traktor-y and któr-y, star-y

alongside którz-y, starz-y respectively Nom.Sing. and Plur. of adjective. This shows that

there is a need for a new approach to the stem alternation process. The data show that

the belief that it is possible to develop efﬁcient stemming algorithm for Polish seems

to be naïve one. We argue, that if one wants to create algorithm, which recognize a

particular word properly, one must store all inﬂection forms in the dictionary - word by

word.

5 The Polish Inﬂection Dictionary; its Lexical Grammar and

Generative Mechanism

The Polish Inﬂection Dictionary [3] is the base that has been used to create the stemmer

described in this article. The construction of the dictionary bases on over 420 identiﬁed

lexical categories. Each is deﬁned by its inﬂection patterns used to generate it. The ﬁrst

element of the dictionary is the set of rules that are used to assign a lexical category to

a word basing on its ending.

Each inﬂection category pattern is represented by its speciﬁc local grammar, which

consists of two elements: a vector of inﬂection endings associated with the category,

and the proper local grammar rules, mainly related to stem alternation rules.

There are words in Polish, that in general behave according to a speciﬁc inﬂection

pattern, but some of their forms do not match strictly the pattern (such as the words

handel ’commerce’ and hotel ’hotel’ will have their corresponding genetive cases, re-

spectively, handlu but hotelu). Such cases are described by additional exception rules

that describe over 11.000 such cases as mentioned above.

Although, the generative approach to build the Inﬂection Dictionary comes directly

from the concept of two-level morphology [2], it cannot be used directly for word form

recognition. The reason for that is that the dictionary generating mechanism has been

augmented by additional ﬁlters - the last building block used for generating the Inﬂec-

tion Dictionary. Those mechanisms are rejecting forms that, formally, are correct, but

the language itself has rejected them. Those rejected forms range from illegal adjectives

comparative form (bardziej chory but not chorszy ’more ill’) to plurale tantum forms

(spodnie - ’trousers’) that for some pragmatic reasons do not possess singular forms.

Morphological relations are another problem that cannot be described in rule form.

Those relations join different words, which share the same lexical meaning, eg. the im-

perfective, perfective and iterative form of verbs, pisa´c : napisa´c : pisywa´c ’to write’,

where one cannot specify the preﬁx to build the proper perfective form c.f. od-pisa´c ’to

answer’ prze-pisa´c ’to copy’ nad-pisa´c ’to overwrite’ and so on. In addition the pres-

ence of iterative forms depend on the meaning of a speciﬁc verb. It seems impossible

145

to determine the rules which guide those language selection mechanisms, so the ﬁlters

had to be provided manually.

All this results in a dictionary of very high quality, but at the cost of not being able to

reverse its generative mechanism for recognition. The dictionary consists of more then

120.000 lexical entries excluding proper names, with more than 3.300.000 inﬂection

forms.

6 Building a Stemmer by Extracting Rules from the Dictionary

The stemmer described in this article is composed of two elements. The ﬁrst part has

been automatically generated from the observation of the forms occurring in the inﬂec-

tion dictionary. That approach gives it a large amount of ﬂexibility, as it can be used on

any observable linguistic data. We will refer to it further in the text as the rule-based

stemmer. The second is based on reversing the generation rules of the Inﬂection Dic-

tionary. As mentioned above, such reversion is imperfect and can leads to erroneous

results, but as it is applied only after the rule-based stemmer has failed to provide an

answer, that imperfections (mainly multiple potential results) can be accepted.

As the ﬁrst stage to create the rule-based stemmer, we need to identify homographs.

Homographs are a major issue to consider while trying to identify words in a text. Their

occurrence makes it difﬁcult to discern between words and their forms. In a highly

inﬂected language, such as polish, homographs can be an incidental result of the rich

inﬂection (as the word mamy that can be both the form of a verb mie´c ’to have’ or a noun

mama ’mom/mother’). But they can also be the result of a much deeper phenomenon,

such as the inability to distinguish in Polish between the genitive and accusative case

of personal nouns. It can also be the result of a common etymology of different words-

such as pal ˛acy from pali´c ’to smoke’ can be a form of a participle, an adjective and a

noun.

The mechanism described in this paper should be able to assign proper lexical values

to words found in texts based on the the values available in the Inﬂection Dictionary, its

content is transformed, in order to cope with the problem of homographs as described

above. We introduce the notion of meta-tags that describe homographs. So, the meta-

tag ADCAAA-2-3-6-7-9-BDC-26 designates the set of the 2

, 3

, 6

, 7

and 9

forms of the lexical group ADCAAA and the 26

form of the lexical group BDC. This

is an example of a proper meta-tag that is a container for over 3.000 participles, and as

such represents a real linguistic phenomenon and not only an random event that should

be discarded as an exception.

In order to extract grammar rules from the Inﬂection Dictionary, we build a preﬁx

tree (trie) out of its entries but represented in a reversed order from right to left. The

leaves of that tree are the meta-tags of the represented forms.

This trie is searched from top to bottom, in order to ﬁnd the nodes, such that all

leaves below that node that share the same meta-tag (grammatical description). With

such a node we can identify a key - that is the string that can be constructed by go-

ing from the root of the trie to this node and its value - the unique meta-tag of the

leaves below that node. Such key,value pair can represented in the form of -onoma

=> AAAAAA-2-4 - something that will be described further as a rule.

146

The rule has a straightforward interpretation - it signiﬁes that in the Inﬂection Dictio-

nary all forms ending with -onoma were of the lexical group described as AAAAAA

and on the 2

or 4th position if its form vector. Unknown words with such an ending

will also be identiﬁed as belonging to this lexical group.

It is important to be able to discern exceptions from general rules, as some words,

often representing a vestigial grammar, should not inﬂuence our ability to recognize

new words. That decision comes from the observation that new words appearing tend to

be have a much more regular inﬂection in general. The distinction between exceptions

and rules comes directly from a set of simple observations made on the trie described

above. First we identify as 0-level exceptions the words in the dictionary that belong to

categories that are not inﬂected (all their forms are identical). 21.331 such words have

been found. Next, we identify as 1

level exceptions those words that belong to cate-

gories that have rules containing on their keys only full words - their rules are just word

dictionaries - they have no discerning power for unknown words. 3.882 such words

have been found. After that, the trie is rebuilt after rejecting level 0 and 1 exceptions.

After extracting the rules from the trie, some of the rules again contain full words as

keys. And some categories have only full word keys. Those keys (words) are identiﬁed

as 2

level exceptions and listed separately. There are 9.947 such rules (exceptions)

identiﬁed. They contain usually categories of rare uninﬂected nouns such husky, collie

etc. Finally, we list as exceptions (not rules) those keys that are identical to the forms.

These are usually very short words, so short, that they are not much different in length

from typical lexical endings. In this category we have words like efeb ’ephebe’, that

found themselves in the same category group as keys -ozof or -ligraf that lead all to the

lexical label AAAAAA-1. That group of 3

level exceptions is the largest one at 156.970

elements, as it contain short words that are quite numerous in the language itself. These

operations lead to the following example rules after discarding the exceptions from the

trie:

-achówki ADAB-2-8-11-14 -achówka

-achtem ACAAAAA-5 -acht

Where the ﬁrst column corresponds to the inﬂected ending of the stemmed form to

be matched, the second describes the identiﬁed lexical value (the inﬂection category/ies

and the corresponding indexes on their form vectors), the third depicts to the ending of

the base which has to replace the inﬂected ending in column 1 in order to obtain the

searched base form.

The process of creation of the stemmer guarantees us one important property - if a

rule matches the analyzed form, it will be the only rule matching and the result will be

unambiguous.

7 The Rule-based Stemming Mechanism

The application of the rule-based stemmer to a word is a two step process:

– ﬁrst we check if that word is an exception. If it is found on the exception list - we

return the lexical value associated with that exception.

147

– If it is not an exception, we ﬁnd the key of a rule that matches the ending of that

word. As the keys are extracted from the original dictionary trie as to determine

unambiguously the lexical category - the keys are mutually exclusive and do not

include each other.

8 The Reversed Generating Rule-based Stemmer

As described above, the Inﬂection Dictionary has been generated by two set of rules.

The ﬁrst set is the set of category recognition based on word ending (as in -iwiec

=> AAACBA). The second one is a set of stem alternation rules. After applying those

two rules, a vector of inﬂected endings is applied to generate the form vector for the

presented word.

The Inﬂection Dictionary generation rules are used to create the mechanism of the

reversed generating rules stemmer. The set of lexical endings from each speciﬁc cate-

gory is prepended with the alternation rules for that category (or taken as such without

them). That leads to a set rules that can create many false positives, as in the generative

case, the alternation mechanism was optional, ie. was used only if it was applicable to a

speciﬁc stem. Here we must consider the potential applicability of alternations in each

cases - leading to superﬂuous answers.

The generated recognition rules coming from the generative rules described in the

part regarding the generation of the Inﬂection Dictionary lead to the following example

rules:

ACACBC awiec awca

ACACBC ec cowi

Where the ﬁrst column corresponds to the lexical category, the second to the ending

of the base from, the third column to the inﬂected ending that has to be removed and

replaced by the 2

column value in order to obtain the searched base form.

9 The Hybrid Stemmer

Combining the two stemmers (the rule-based and the reversed generating rule) by using

them sequentially gives us a solution that combines the power of both approaches:

– by using the rule based stemmer ﬁrst we obtain a mechanism that works perfectly

on all known words, and generates high precision unambiguous results for recog-

nized unknown words.

– if the former approach fails (does not provide a result - the form has not been

matched by any rule), a more "fuzzy" mechanism based on the reversed generative

rules is applied. As the analyzed word has not been matched in the ﬁrst step, it can

be seen as "hard" and can be interpreted in an ambiguous way. Additional tools

working on a wider scope that a single form (such as a contextual morphosynctatic

disambiguator) can be required for further processing. But issues dealing with more

than one form are beyond the scope of this paper.

148

10 Benchmarking the Hybrid Stemmer

The goal set for the created mechanism is to be able to mimic as closely as possible

human behavior in word recognition, so the proper benchmark for its efﬁciency should

be based on human evaluation. The task of stemming words has been presented aThe

survey was composed of words to be stemmed in 5 categories:

– Category A - 10 "hard" word forms from the dictionary - those having homographs

both incidental (damy from da´c ’to give’ and dama ’damme’) and systemic (par-

ticiples versus deverbative nouns such as pal ˛acy ’smoking’ and ’smoker’)

– Category B - 11 word forms not in the Inﬂection Dictionary, but recognized by the

rule based stemmer - such as edynburskim from edynburski (’from Edinburgh’)

– Category C - 10 word forms not in the Inﬂection Dictionary, but correctly recog-

nized by the hybrid based stemmer, but not by the rule-based stemmer - words like

handlu from handel (’commerce’)

– Category D - 6 word forms not found in the Inﬂection Dictionary, but correctly

recognized by the hybrid based stemmer, but not by the rule-based stemmer. Those

words have been recognized but not unambiguously. Those words included such

forms as krasnoarmiejców from krasnoarmiejec (adopted from russian: ’soldier of

the Red Army’).

– Category E - 3 words unrecognized by the stemmer like jazzmenów from jazzman

The score has been evaluated in each category separately, with a value of 50% given

for an partially (ambiguous or incomplete) correct answer.

Table 1. Survey results overview and comparison with the Hybrid Stemmer performance.

Hybrid Survey Results

Category Stemmer Average Median 4

Quartile 95

Percentile

A 100.00% 50.80% 46.60% 54.10% 83.30%

B 100.00% 60.41% 63.64% 70.45% 78.64%

C 100.00% 90.00% 90.00% 100.00% 100.00%

D 50.00% 75.56% 83.33% 83.33% 85.00%

E 0.00% 38.60% 33.33% 66.66% 66.66%

Average 85.00% 65.89% 65.63% 74.15% 83.91%

To give an estimate on the efﬁciency of the described mechanism, we will provide

a general performance comparison to A.Weiss Stempelator - a very popular stemmer

solution that is probably the closest in terms of general architecture to the Hybrid Stem-

mer described above. The Stempelator [6] is composed similarly of two parts - the

Lematyzator that extracts base forms from the ispell dictionary for Polish and Stempel

- a rule based stemmer for forms not found by the Lematyzator. As the comparison

should be between the stemming mechanisms, we will compare the results obtained by

Stempel on the survey above,but restricting it only to categories B to E - to compare ele-

ments that do not belong to the dictionary from the point of view of both stemmers. The

results presented below should be regarded more as a comparison of the performance

of both stemmers in reference to human performance rather than a relative comparison

between them.

149

Table 2. Survey results with Hybrid Stemmer and Stempelator performance.

Hybrid Stempelator Survey

Category Stemmer Performance Median

B 100.00% 63.64% 63.64%

C 100.00% 50.00% 90.00%

D 50.00% 33.33% 83.33%

E 0.00% 33.33% 33.33%

Average 80.00% 46.67% 66.66%

11 Conclusions

As can be seen, especially in difﬁcult cases, the described stemmer performance excels

typical human skills (actually, only one participant of the review scored more than the

described mechanism). Such result proves that we were able to create a mechanism that

mimics human behavior in word recognition, extracts its data from the language itself

and does not overgeneralize its rules, thanks to its ability to discern between exceptions

and generic rules.

References

1. Chomsky, N.: Aspects of the Theory of Syntax, MIT Press, (1965)

2. Koskenniemi, K.: Two-level Morphology - A general Computational Model for Word-Form

Recognition and Production, University of Helsinki Publication No. 11 (1983)

3. Lubaszewski, W., Wróbel, H., Gaj˛ecki, M., Moskal, B., Orzechowska, A., Pietras, P., Pisarek,

P., Rokicka, T.: Słownik Fleksyjny j˛ezyka polskiego, Lexis Nexis, Kraków (2001)

4. Lubaszewski, W. (ed.): Słowniki komputerowe i automatyczna ekstrakcja informacji z tekstu,

Kraków, AGH Press, (2009), original text in Polish

5. Lubaszewski, W.: A Grammar for the Polish Inﬂection Lexicon TASK Quarterly : scientiﬁc

bulletin of Academic Computer Centre in Gdansk ; ISSN 1428-6394 - (2000) vol. 4 no. 2

s.291-300. - Abstr.

6. Weiss, D.: Stempelator: A Hybrid Stemmer for the Polish Language. Technical Report RA-

002/05, Institute of Computing Science, Pozna

n University of Technology, Poland, (2005).

7. Weiss, D.: A survey of freely available polish stemmers and evaluation of their applicabil-

ity in information retrieval. In: Human Language Technologies as a Challenge for Computer

Science and Linguistics, Proceedings of the 2

Language and Technology Conference, pages

216-221, Pozna

n, Poland, (2005).

8. Korzycki, M.: Transducer sko

nczenie stanowy jako narze¸dzie rozpoznawania form tek-

stowych wyrazów [The Finite-State Transducer as a Tool for Polish Inﬂection Form Recogni-

tion], PhD Thesis, AGH (2008)

150