m360
The data in table 3 shows that Soundex for this
example, and also for many others, works quite well.
However, the start of the string is important, and the
Soundex algorithm, which is developed with English
in mind, fails for instance to find any significant
equality for words beginning with “i” or “j”. A single
symbol plus a three-digit representation might also be
too narrow to catch more subtle similarities or
differences. Soundex might be more useful that some
string metrics, but unless a version is developed that
take pronunciation for Old Swedish in account it may
be unsatisfactory as a working tool.
3.3.2 The Winkler-Jaro Distance
Another string measurement is the Winkler-Jaro
distance metric. Despite its name, it is not a true
metric, and more of a similarity than a distance
(Winkler, 1990). It is computed for words pairwise,
with the resulting value 1 for perfectly equal strings
and 0 for unequal ones (i.e. strings having completely
different characters). The ingoing parameters for the
measure are the string length, the number of matching
characters and the number of transpositions.
Table 4: Pairwise similarity values for “jomfru” (“virgin”).
Word forms Winkler-Jaro
umfru - iomffrv 0.6429
The computation of this measure can yield any
floating number between zero and one, so its
comparison power should perhaps be better that both
Levenshtein and Soundex. In table 4 below we see as
an example the pairwise Winkler-Jaro values for the
word “jomfru” (Eng. “virgin” or “maiden”) in its
different spelling variants in the corpora.
As can be seen from the table, the Winkler-Jaro
measure gives high scores for these related word
pairs, and these also occur quite adjacent in the full
listing. Seemingly, this measure might be a good
choice for finding and grouping related word forms
together.
The relation may then be either a matter of
spelling variation or a closeness due to inflectional
causes. In both cases, this information is helpful in
inventorying the text and giving clues for lexicon
look-ups, either manual or automated.
3.4 Stop Word List
Stop word lists are used for subtracting non-specific
or uninteresting words from any given text. Such a list
typically consists of some of the most frequent words
in any language, belonging to closed word classes,
such as determiners, pronouns, prepositions and
conjunctions. Also, auxiliary verbs might be included
in such lists.
For use here, a stop word list was constructed by
examining the frequency list of the corpora. The
principles used for choosing words were in
accordance with the general ideas behind stop word
lists and resulted in a list of 74 specific words:
"honom", "hans", "a", "ok", "oc", "han", "hon",
"at", "mz", "the", "them", "ther", "swa", "af", "aff",
"ey", "foer", "i", "j", "ii", "jak", "jac", "thz", "til",
"vm", "vtan", "som", "sit", "sin", "sina", "sinom",
"aar", "aeftir", "aen", "aer", "alle", "alt", "aat", "een",
"enkte", "for", "haenna", "haenne", "hanom",
"hanum", "henna", "henne", "hona", "hulkin",
"hwar", "hwat", "iak", "mik", "sidhan", "sidhe", "sik",
"tel", "tha", "thaen", "thaer", "thaes", "then", "thera",
"thik", "thin", "tho", "thu", "thy", "tik", "war",
"wara", "wardth", "wilde" and "hafdhe”.
Further, variants of the name Maria were brought
together into the most frequent form.
The result of this procedure made upon the
already normalised text, as described previously, of
the corpora can be seen in figure 3 below. Here, words
with lexical meaning now appear, that seem to be
characteristic of the texts in the corpora. This is
probably the best we can accomplish in terms of a text
analysis for finding key words in the absence of a
reference corpora for what would constitute a
“normal” text in Old Swedish.
NLPinAI 2020 - Special Session on Natural Language Processing in Artificial Intelligence