more than two letters long, we try to split it in two
parts. We accept candidates consisting of two parts
if such a word combination is found in the bigram
dictionary built on data from the general corpus.
Our first approach was to check if the two
separated words are accepted by the spelling
checker. Unfortunately, this approach led to the
production of never occurring candidates. The
typical faults for such action are a prefix separated
from a prefixed verb or both parts of a compound
written separately (in Latvian, compound parts must
always be written together). For example, the verb
aplikt (‘to put on’) is separated as ap (‘around’) and
likt (‘to put’), or the compound asinsvadi (‘blood-
vessels’) is separated as asins (‘blood’) and vadi
(‘wires’ or ‘cords’). Such results prompted to change
the algorithm and perform lookup in the bigram
dictionary.
3.1.7 Words with Same Root
The Latvian language is an inflectional language;
most word-forms are formed by combining the root
and the ending. We search for the candidates in a list
of correct and popular words. We accept the
candidates that have one or two extra letters at the
end compared to the current word. We cut the
typical endings from the end of the current word (the
single letter ‘a’, ‘e’, ‘i’ or ‘u’ and two letters if the
last one is ‘s’ or ‘m’) and search for the candidates
that differ in length from the current word by no
more than two symbols. We also add the same root
base-form of a word supplied by the spelling
checker module. For example, for the word komanda
(‘a command’ in singular nominative), the following
candidates are chosen: komandai (‘a command’ in
singular dative), komandas (‘a command’ in singular
genitive or plural nominative), komandu (‘a
command’ in singular accusative or plural genitive),
komandām (‘a command’ in plural dative), komandē
(‘commands’ a verb in the 3rd pers.), komandēt (‘to
command’).
3.1.8 Diacritic Marks
The diacritic restauration module tries to add
diacritic marks to every character in the current
word. Correctness of the newly constructed word is
checked with the spelling checker module. The
words with correct diacritics are added to the
candidate list. With this method, for the incorrect
word speletajs, the correct candidate spēlētājs (‘a
player’) is generated. Also, for the nominative of the
correct word attiecības (‘relationship’), a locative
form attiecībās is generated.
3.2 Ranking of the Candidates
For the ranking of the candidates, the features
related to the candidates are employed. The list of
features is similar to the ones used in the
normalisation system MoNoise (van der Goot et al.,
2017). For every candidate, a feature vector is
constructed containing the following values:
a binary value (the number ‘0’ or ‘1’) signalling
if the candidate is the original word;
the candidate’s and the original word’s cosine
similarity in the vector space and the rank of the
candidate in a list of top 20 most similar words if
the candidate is supplied by the word
embeddings module;
a binary value signalling if the candidate is
generated by the spelling checker module and the
candidate’s rank among other correction
candidates that spelling checker generates for the
misspelled original word;
a number of times the original word is changed
to particular candidate in the dictionary of
corrections built on the basis of the training data;
a binary value signalling if the candidate is
created by changing some final characters of the
original word, i.e. has the same root, or by
adding diacritic marks to some letters of the
original word;
a binary value signalling if the candidate is
created by splitting the original word;
the candidate’s unigram probability in the
general corpus and in the Twitter corpus;
the candidate’s bigram probabilities with the left
and the right adjoining word in the general
corpus and in the Twitter corpus;
a binary value signalling if the candidate is in the
good and popular word list;
a binary value signalling if the original word and
the candidate have a matching symbol order;
the length of the original word and the candidate;
a binary value signalling if the original word and
the candidate are constructed of valid utf-8
symbols, are not e-mail addresses or Web links.
Random forest classifier algorithm creates an
ensemble of decision trees taking into account
different features. Different trees are responsible for
different normalisation actions. Classifier ranks
every candidate at every position in a given text
string (see Table 4). Candidates with a top score
form the normalised text string.
Chat Language Normalisation using Machine Learning Methods
969