used. In our case, the normalization is a Euclidean
distance norm, resulting in a roughly sigmoid shape
for the graph.
Experiments reproduce the predicted behavior,
where increasing sparseness in A causes increasing
deviations from purely monotonic. If we find 5% de-
viations acceptable (that is, 5% of the works receive
an out-of-order q
j
), the matrix A can be as sparse
as 10%. In other words, for a cluster size as large
as 50, with five reviews per work, the algorithm is
capable to find an approximation to the correct order
with no more than 5% errors.
It turns out, however, that the outcome of these tri-
als is sensitive to the assumptions with respect to the
precise form of (3). If we assume students are slightly
more competent in reviewing, the performance of the
algorithm is drastically better; if students are less
competent, the performance is considerably worse—
which is not too unexpected. Therefore, although
these model trials suggest a cluster size of 50 with
five reviewed works per student, we may want to be
a bit more conservative when we actually implement
the scenario for the first time.
5 DISCUSSION
An algorithm for calculating students’ reviewing
competence and the quality of students’ work may
be a necessary ingredient of peer reviewing, but it is
definitely not sufficient. In (van Zundert, 2012), ed-
ucational considerations regarding peer reviewing are
studied. In this section, we list a number of assump-
tions that should hold for an algorithm like the present
one to be trustworthy.
• Peer groups should be unbiased and uncorrelated
so that every assessment can be seen as an in-
dependent measurement of each student’s perfor-
mance. Careful randomization helps to remove
correlations; bias is more subtle to deal with,
though. For instance, in case of misconceptions,
shared by a majority of the students (‘homework
is boring’), correct answers (such as ‘homework
is exciting’) may score systematically low, and the
algorithm has no means to detect this error. It will
manifest itself in that the order, calculated by the
algorithm, consistently differs from the order ob-
tained by staff. In preparing assignments, there-
fore, questions with likely answers that are objec-
tively ‘wrong’, but that could result from collec-
tively shared misconceptions, should be avoided.
Rather, assignments should be such that students
can base their scores on how much detail is pro-
vided, how elaborate an answer is, how clearly the
answer has been written, how convincing the an-
swer is, et cetera.
• Peer ranking should be applied to a series of as-
signments rather than a single assignment, so that
statistical evidence can be used to assess the reli-
ability of the final outcome. Statistical evidence
could be, e.g., the standard deviation σ of the
marks over a series of assignments in one term.
If σ decreases as one over the square root of the
number assignments, N, it may be the case that the
outcome indeed measures students’ performance
during that term. In case σ does not decrease with
increasing N when averaging over the series of as-
signments, the per-assignment scores apparently
do not measure the actual performance level of
a student, and the peer review gives no informa-
tion about this level. There could be various rea-
sons for such inconclusive outcomes: perhaps the
assignments do not accurately measure students’
performance levels, or students performance lev-
els vary wildly over the term. From a method-
ological point, it would be good to include the σ’s
in the final marks.
6 POSSIBLE VARIATIONS
We briefly present three possible variations.
1. The algorithm calculates both c and q from
scratch, using the matrix A as only input. We
may expect, however, that the students’ reviewing
competence will not vary much over time. This
suggests to bootstrap the algorithm with the re-
sults in the first week, and use the found c as a
first estimate in the next week. We may even con-
sider to use the running average of the c’s over
subsequent weeks, representing the intuition that
we get increasingly more accurate estimates of the
individual students’ reviewing competence.
2. Teachers may consider to have one or more ‘ex-
ample elaborations’ to be, unknowingly, reviewed
by the students. Since works are reviewed anony-
mously, students will not know that they review a
teacher’s work instead of one of their peers. As-
suming that teacher’s works have insurmountable
quality, the associated q
j
must keep a constant
value of 1 during the iterations. Therefore, they
serve to further stabilize the algorithm.
3. Despite the efficiency improvementoffered by the
algorithm, reviewing still requires works to be as-
sessed by teachers, which takes time. To reduce
waiting time for students, feedback can be given
in three tiers. The first tier is immediately after the
Self-consistentPeerRankingforAssessingStudentWork-DealingwithLargePopulations
403