The main goal of the proposed WMD based Pair-
Wise AEE approach is not to accurately reproduce
the human grader’s scores, which are varying in their
evaluation but to provide acceptable scores and also
immediate and helpful feedback. The proposed AEE
approach is compared with approaches using LSA,
Wordnet and cosine similarity.
A qualitative evaluation measure based on pair-
wise ranking, called prank, for assessing the perfor-
mance of various models w.r.t. the human scoring
is also proposed in this paper. To the best of au-
thors’ knowledge, this is the first attempt to evaluate
AEE approaches in a qualitative way, only quantita-
tive evaluation measures (e.g. a mean squared error
between human’s and machine’s scores) were used so
far in the recent literature.
Experiments conducted on real-world datasets,
provided by the Hewlett Foundation for a Kaggle
challenge, showed the proposed WMD based Pair-
Wise AEE approach promising such that, in general,
it achieved higher performance than the used baseline
AEE approaches according to both quantitative (nor-
malized root mean squared error) and the qualitative
(the proposed prank) evaluation measures.
The rest of this paper is organized as follows: Sec-
tion 2 reviews existing AEE approaches. In Section 3,
the proposed Word Movers Distance based Pair-Wise
AEE approach is introduced. Experiments and results
are described in Section 4. Section 5 concludes the
paper and discusses prospective plans for future work.
2 RELATED WORK
The research on automatically evaluating and scoring
essay question answers is ongoing for more than a
decade where Machine Learning (ML) and NLP tech-
niques were used for evaluating essay question an-
swers.
Project Essay Grade (PEG) was the first AEE
system developed by Ellis Page and his colleagues
(Page, 1966). Earlier versions of this system used 30
computer quantifiable predictive features to approxi-
mate the intrinsic features valued by human markers.
Most of these features were surface variables such as
the number of paragraphs, average sentence length,
length of the essay in words, and counts of other tex-
tual units. PEG has been reported as being able to pro-
vide scores for separate dimensions of writing such as
content, organization, style, mechanics (i.e., mechan-
ical accuracy, such as spelling, punctuation and capi-
talization) and creativity, as well as providing an over-
all score (Shermis et al., 2002; Wang, 2005; Zhang,
2014). However, the exact set of textual features un-
derlying each dimension as well as details concerning
the derivation of the overall score are not publicly dis-
closed (Ben-Simon and Bennett, 2007; Shermis et al.,
2002).
E-Rater (Attali and Burstein, 2006), the basic
technique of which is identical to PEG, uses statistical
and NLP techniques. E- Rater utilizes a vector-space
model to measure semantic content. It examines the
structure of the essay by using transitional phrases,
paragraph changes, etc., and examines it’s content by
comparing it’s score to other essays. However, if there
is an essay with a new argument that uses an unfamil-
iar argument style, the E-rater will not notice it.
Intelligent Essay Assessor (IEA), based on LSA,
is an essay grading technique developed in the late
1990s that evaluates essays by measuring semantic
features (Foltz et al., 1999). IEA is trained on a
domain-specific set of essays that have been previ-
ously scored by expert human raters. IEA evaluates
each ungraded essay basically by comparing through
LSA, i.e. how similar the new essay is to those it has
been trained on. By using LSA, IEA is able to con-
sider the semantic features by representing each essay
as a multidimensional vector.
IntelliMetric (Shermis and Burstein, 2003), uses a
blend of Artificial Intelligence (AI), NLP and statisti-
cal techniques. IntelliMetric needs to be trained with
a set of essays that have been scored before by human
expert raters. To analyze essays, the system first in-
ternalizes the known scores in a set of training essays.
Then, it tests the scoring model against a smaller set
of essays with known scores for validation purposes.
Finally, once the model scores the essays as desired,
it is applied to new essays with unknown scores.
AEE systems that use LSA ignore the order of
words or arrangement of sentences in its analysis of
the meaning of a text because LSA does not have such
a feature. A text document in LSA is simply treated as
a “bag of words” – an unordered collection of words.
As such, the meaning of a text as derived by LSA is
not the same as that which could be understood by
human beings from grammatical, syntactic relations,
logic, or morphological analysis. The second problem
is that LSA does not deal with polysemy. This is be-
cause each word is represented in the semantic space
as a single point and its meaning is the average of all
its different meanings in the corpus (Dumais and Lan-
dauer, 2008). In this paper, we used the “skip-gram”
model of word2vec (Mikolov et al., 2013) to obtain
word embedding that learns to predict the context and
to train the semantic vectors that is resulted from neu-
ral networks to address the issue of word polysemy
(Mikolov et al., 2013; Kusner et al., 2015).
CSEDU 2018 - 10th International Conference on Computer Supported Education
60