of which were written about one of two topics. From
the results of the two-way ANOVA test, they con-
cluded that there is a significant correlation between
the stylometric features and topic text, and that use
of such features in authorship attribution over multi-
topic corpora should be done with caution.
The second study, conducted by Koppel, Schler,
and Bonchek-Dokow, explored the depth of differ-
ence between topic variability in authorship attribu-
tion using an unmasking technique (Koppel et al.,
2008). The intuition behind this technique is to gauge
how fast the cross-validation accuracy degrades dur-
ing the process of iteratively removing the most dis-
tinguishable features between two classes. They used
a corpus of 1,139 Hebrew-Aramaic legal query re-
sponse letters written by three distinct authors about
three distinct topics. They concluded that it is more
difficult to distinguish writings by the same author on
different topics than writings by different authors on
the same topic.
The third study, conducted by Corney (Corney,
2003), showed that the topic did not adversely affect
the identification of the author in e-mail messages.
In order to support this claim, Corney used a cor-
pus of 156 e-mail messages from three distinct au-
thors about three distinct topics. He then developed a
model for each of the three authors, using one of the
three topics. Next, he used a support vector machine
to test for authorship on e-mails from the remaining
two topics. He reported a success rate of approxi-
mately 85% when training on one topic and testing on
the others, which was consistent with the rate of suc-
cess for authorship attribution across all topics. We
attribute Corney’s results to the length and structure
of e-mail communications. Often, the most discrimi-
natory words associated with topic are in the subject
of an e-mail and, therefore, if only the body of the e-
mail text is evaluated, the impact of content-specific
words could easily be negligible.
In contrast to results obtained by Corney (Corney,
2003), the fourth study, by Madigan
et al.
(Madigan
et al., 2005), tested the effect of topic on authorship
attribution with 59 Usenet postings by two distinct au-
thors and three distinct topics. Just as with Corney,
they constructed a model of each author on one of
the three topics and tested for authorship on postings
written about the remaining two topics. Their results
demonstrated poor performance by a unigram model;
however, their bi-gram parts-of-speech model proved
to be one of the best among the tested possibilities.
Finally, the fifth study, conducted by Baayen et
al. (Baayen et al., 1996), used principal compo-
nents analysis (PCA) and linear discriminant analy-
sis (LDA) to evaluate the effectiveness of grouping
text by author, using stylometric features. Their data
set consisted of 576 documents written by eight stu-
dents. Each student wrote a total of 24 documents
in three different genres about three different top-
ics. They found that compensating for the topic im-
balance coverage led to increased performance in a
cross-validation.
The recent review by Stamatatos (Stamatatos,
2009) points to a small number of additional similar
studies. A key difference between our data set and
those used by previous researchers is size. The num-
ber of observations in our multi-category data set is
much larger than any of the previous examples. In
addition, our multi-category data set has many more
topics, which we will later argue is advantageous in a
novel topic evaluation. The nature of our evaluation
is also a bit different in that it simulates what happens
when a author attribution classifier encounters a new
topic. Many of the previous studies are “in-sample”
analysis or examine other questions pertaining to top-
ics.
2.2 Evaluation Methodology
Typical evaluations of author attribution divide a cor-
pus into a train/test split. In some cases standard-
ized train/test splits have been developed for repro-
ducibility (Stamatatos et al., 2000). When developing
an evaluation, typically researchers have attempted to
control for factors that can influence outcome. In ad-
dition to topic (the focus of the current work), age,
sex, or other attributes of the author may have predic-
tive abilities that need to be controlled. In our opin-
ion, within the literature a consensus has formed that
an evaluation will ideally have a balanced number of
documents per author in a test set; this greatly sim-
plifies the interpretability of a test set accuracy. In
practice, requiring data set balance limits the qualified
data sets available to the author attribution researcher.
In particular, it is challenging to locate a data set with
many authors writing very many documents if we re-
quire these authors to write on the same topics and
with the same frequency.
In his recent review of the author attribution meth-
ods, Stamatatos comments on evaluation: “Ideally, all
the texts of the training corpus should be on exactly
the same topic for all the candidate authors.” (Sta-
matatos, 2009). This advice is important in aspects
of algorithm evaluation. However, we see the field
of author attribution progressing by embracing topic
and social distinctions as a source of complexity
with scientifically (and functionally) interesting con-
sequences. We believe there is an important place for
evaluation methodologies that focus on exploring fac-
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
208