Table 6: Accuracy of various part of speech prediction
methods for missing words.
Method Accuracy
Posgrams 43.2%
Always noun 28.0%
Random choice 7.1%
State of the art algorithms for word prediction
reported in literature have been tested with the
Microsoft Sentence Completion Challenge dataset.
This is currently the only complete training-test
dataset specifically developed for measuring
automatic sentence completion algorithms.
However, because of its limits exposed in section 1
(uniformity of the part of speech in the answer set of
any given question and unrepresented parts of
speech among missing words), this dataset could not
be employed. Therefore, in order to test the full
word prediction methodology a new questionnaire
has been built. Its format is the same of the one used
for the Microsoft Sentence Completion Challenge:
each question is composed by a sentence having a
missing word and by five candidate words as
answers. However our questionnaire is more
general, since they address the aforementioned
limits. First of all, each word of the same answers
set may belong to a different part of speech.
Secondly, the missing word can belong to any part
of speech, including: conjunctions, prepositions,
determinants and pronouns. The questionnaire has
been built by selecting 368 random Italian sentences
from the Paisà (Lyding et al, 2014) dataset. From
each sentence a question is built, by removing a
random word; the other candidate words are chosen
randomly from the nearby sentences; these words
and the removed word constitute the answers set for
the question. We employed three different word
prediction methods for the second step: ngrams,
Latent Semantic Analysis (LSA) (Spiccia et al,
2015) and random choice. Table 7 shows the results
in term of accuracy.
Table 7: Accuracy of various word prediction methods on
the Italian questionnaire, with (2 steps) and without (1
step) employing the proposed part of speech prediction
algorithm.
Method Accuracy
ngrams (2 steps) 51.1%
ngrams (1 step) 50.3%
LSA (2 steps) 30.7%
Random choice (2 steps) 29.3%
LSA (1 step) 25.5%
Random choice (1 step) 20.4%
Each two steps method always provides better
results than its single step counterpart. Since any
part of speech, except punctuation, is admissible in a
question answers set, most stopwords such as
conjunctions, prepositions and determiners had to be
included during the training phase. This has
undermined the quality of the semantic spaces
employed by LSA, leading to lower results than
those reported in literature for English. Furthermore,
the Italian language is more agglutinative than
English: for example, the word “accettandoglielo”
stands for “accettando esso da lui”, which means
“accepting it from him”; one verb, two pronouns and
a preposition are agglutinated into a single word.
This hinders the performance of methods, like LSA,
that attempts to find a single fixed-length encoding
for such words: in fact, some information will be
inherently lost, unless an ad-hoc preprocessing step
is taken to split them. Since these kinds of words are
very frequent in Italian, the obtained results are not
directly comparable with those reported for the
Microsoft Completion Challenge. Even though the
problem negatively affects the two steps
methodology too, we purportedly have not added the
preprocessing step: this has allowed us to assess a
lower bound for the prediction accuracy achievable
by the methodology in the worst-case scenario.
Results show that word prediction methods with
lower accuracy exhibit greater improvements
(+8.9% for Random choice) than methods with
higher accuracy (+0.8% for ngrams). In general, the
greater the accuracy of the word prediction method,
the greater the part of speech prediction accuracy
step is required to be advantageous. While this is not
surprising, it should be noted that even a 50.3%
accuracy word prediction algorithm (i.e. ngrams)
can be improved by a 43.2% accuracy part of speech
predictor: in fact, depending on the predicted part of
speech the accuracy of the first step may be greater
and therefore still advantageous; when this is not the
case the part of speech prediction will be
automatically discarded, as described in section 5.
7 CONCLUSION
In this work we presented a word prediction
methodology based on posgrams. The proposed
approach differs from other algorithms for
introducing an additional preparatory step aimed at
predicting the part of speech of the missing word.
The number of candidate words can therefore be
reduced accordingly. This has lead to an absolute
accuracy improvement of up to 8.9% as shown by