Figure 7: Synonyms reconstruction via common reference.
For example it could be illustrated by the article
“нежный” (gentle, L
1
) that contains the sense
“хрупкий, уязвимый” (fragile, vulnerable, L
1
S
1
).
This sense contains the synonymy reference to the
lexeme “мягкий” (soft, L
2
). In addition, the sense
L
1
S
1
(fragile, vulnerable) contains the antonymy
reference to the sense “плохо поддающийся
деформации или разделению” (poorly amenable to
deformation or separation, L
3
S
1
) of the lexeme
“жёсткий“ (hard, L
3
) and the sense “легко
поддающийся нажиму, деформации” (easily
amenable to pressure and strain, L
2
S
1
) of the lexeme
L
2
also have the antonymy reference to the same
sense. So it is possible to replace the “sense-to-
lexeme” synonymy link from the sense L
1
S
1
(fragile,
vulnerable) to the lexeme L
2
(soft) by the “sense-to-
sense” synonymy reference from the sense L
1
S
1
(fragile, vulnerable) to the sense L
2
S
1
(easily
amenable to pressure and strain).
4 SOFTWARE ARCHITECTURE
AND ALGORITHM RESULTS
Developed algorithm was tested on online Russian
Wiktionary. At the time of testing online resource
contained 568910 pages. These pages contain in
total 1285926 senses and 302358 references to other
lexemes. There were found only 101993 Russian
senses that have had at least one reference with
another lexeme. As mentioned above, Russian
Wiktionary contains lexemes from other languages
(English, German, etc., created by robots) and they
were omitted. Found Russian senses have 206994
links to other lexemes and 70% of 101993 senses
(70202) are single sense lexemes, i.e. can be subject
of rule, described in section 3.1.
Software was developed using Java
programming language and Spring framework. Its
architecture consists of three layers (Fig. 8).
First layer loads mark-up contents of Wiktionary
articles using online REST API, parses it and
converts to graph. This layer executes initial
dictionary bulk data import and provides daily
dictionary synchronization with online version to
maintain created by crowd article in actual state.
Second layer is a software storage that is based
on OrientDB (Tesoriero, 2013) DBMS. OrientDB is
an open source database based on distributed graph
engine. It provides support of HTTP REST and
JSON APIs to properly represent and visualise
deducted semantic relation in the browser-oriented
application. Developed software use Gremlin API
(based on SpringData) to provide connection to
OrientDB. Lexemes, senses and semantic references
implementation is based on directed multigraph
(Harary, 1994) model.
Explicit and implicit semantic references are
created on the third layer. It gets results of mark-up
parsing, inserts it into graph model and sequentially
applies algorithm's rules. Order of rules is not
important. Rules based on synonymy (3.3 and 3.4)
can be applied multiple times. Algorithm repeatedly
applies these rules and stops if new references were
not created.
During the first run the program downloads all
existing pages in Russian Wiktionary for 30 hours.
Than it applies developed rules in single thread of
3.3 GHz CPU for a little bit more than 4 hours.
First step of algorithm creates mutual sense
references as described in section 3.2. We found
16% (32562) references between Russian lexemes’
senses. Next we run rule that finds single sense
lexemes (3.1). It converts sense-to-lexeme to sense-
to-sense references and creates 57814 explicit
references. As mentioned above this rule also creates
implicit (complement) references based on explicit
one. In total the rule creates 115628 references that
cover 56% of all Russian lexemes. Execution time of
rule 3.2 was less then 15 second. The third step rule
(3.3) reconstructs 5037 references between senses
that is 2.4% of initial Russians senses’ links count.
Next rule (3.4) adds 0.7% (1470) references more.
We applied rules form 3.3 an 3.4 again, and got
390+10 references per 1 and 30 seconds
respectively. At the third iteration of rules 3.3 and
3.4 we got 28+4 references for approximately same
time. Fourth iteration had no any new references.