Czech is a fusional language, which among others
means that each tag consists of several pairs of strings
which represent morphological features such as the
part of speech (pars orationis), gender (genus
grammaticum), number (numerus grammaticus),
paradigm (paradigma), case (casus garmmaticus) or
tense (tempus grammaticum). Some of these
properties are not available to all words, so tags might
be of different lengths. The first character of a pair
identifies the property, whereas the second character
is its value. In the first line of the example with word
“robot”, the grammatical tag k1gInSc1 is returned.
Meanings of each part of the tag are as follows:
k1 – Character ‘k’ represents the part of
speech. In this case, ‘1’ stands for noun. In
the Czech language, we recognize 10
different parts of speech, see Table 1.
gI – Character ‘g’ stands for the gender,
while ‘I’ specifies the masculine.
nS – Character ‘n’ represents the number,
while ‘S’ stands for singular.
c1 – Character ‘c’ stands for the case of a
particular word. In this case, ‘1’ specifies the
first paradigm. In the Czech language, we
recognize 7 different cases.
In summary, the first morphological result tells us
that one of the basic forms of the word “robot” is
indeed the word “robot” itself, which is a noun with a
masculine gender, in singular form, and in
nominative.
2.3 Integration to the Search Engine
The strong point of a fully integrated Majka is that all
functions can be called on demand. Thanks to its
application programming interface (API), functions
can be called and their returned outputs can be
processed by an existing PHP application, the
OPTIMED CurrMS. This application is based on the
Symfony framework (SensioLabs) and has been
designed for the development of web-based
applications and systems; the whole application is
used for curriculum development and management.
Our long-term goal is to refine the existing full-
text search engine by a morphological analysis, with
the aim of obtaining more relevant results to queries
of users. We have created two pieces of code
especially for the purposes of work with Majka.
The first piece of code is the CallMajka service: it
takes any string as the input and, for each word,
returns the most relevant basic form and a
corresponding tag based on a specific decision
algorithm. This service is created only as part of the
application responsible for the entire preparation of
data for Majka and also for all post-processing of data
that are produced by Majka.
The second piece of code is the
CallMajkaCommand, which is a PHP command that
calls Majka on a single input word. The command is
designed to be the only entry point of Majka’s output
to the application. All actions which somehow
communicate with the Majka’s interface must use a
functionality implemented within this command.
3 RESULTS
We were able to integrate and start using Majka
easily, thanks to the well-prepared application
programming interface, a robust documentation and
an on-demand created vocabulary. The vocabulary
was created from the whole content of the OPTIMED
platform. Therefore, we were sure that the vocabulary
used for our further analysis contained the most
relevant words depending on the current content.
During our initial phase of Majka integration into
the web-based application, we created several
analysis outputs. Their main purpose was to show the
power of morphological analyser for our needs. We
examined the distribution of occurring word types
(parts of speech) on the input, counts of most frequent
words and the analyser’s ability to reduce the number
of unique words of the corpus.
3.1 Dataset Transformation
The process of data transformation uses the
CallMajka service, which removes specific
punctuation marks (~`!@#$%^&*()_={}[]:;'<>,.?/)
using regular expression (regex), puts text to
lowercase and then splits it to separate words.
Afterwards, the CallMajka service calls Majka on
each word by the command line. Finally, one basic
word form with a corresponding tag for each input
word is selected algorithmically. For the purpose of
enhancing the medical query search, we agreed that
substantives, adjectives and verbs in singular form
and in nominative are most informative. Therefore,
based on these conditions, this algorithm decides
which basic form is the most relevant one. If the
vocabulary does not recognise the input word, the
algorithm labels the word as unclassified. The input
dataset that we used for the analysis contained more
than 860,000 words.
Using the CallMajka service, we acquired an
output dataset of a similar length but containing
transformed words. A mapping dictionary that