biased models, where the accuracy of author detection
is highly dependent on the degree to which the topics
in training and test sets match each other (Luyckx and
Daelemans, 2008). The experiments we conceive are
based on a closed dataset, i.e. each test author also
appears in the training set, so the task is simplified to
author classification rather than detection.
The paper is organized as follows. Sec. 2 intro-
duces the database used for this project. Sec. 3 ex-
plains the methodology of the NNLM, including cost
function definition, forward-backward propagation,
and weight and bias updates. Sec. 4 describes the
implementation of the NNLM, provides the classifica-
tion metrics, and compares results with conventional
baseline N-gram models. Finally, Sec. 5 presents the
conclusion and suggests future work.
2 DATA PREPARATION
The database is a selection of course transcripts from
Coursera, one of the largest Massive Open Online
Course (MOOC) platforms. To ensure the author de-
tection less replying on the domain information, 16
courses were selected from one specific text domain
of the technical science and engineering fields, cov-
ering 8 areas: Algorithm, Data Mining, Information
Technologies (IT), Machine Learning, Mathematics,
Natural Language Processing (NLP), Programming
and Digital Signal Processing (DSP). Table 1 lists
more details for each course in the database, such as
the number of sentences and words, the number of
words per sentence, and vocabulary sizes in multiple
stages. For privacy reason, the exact course titles and
instructor (author) names are concealed. However, for
the purpose of detecting the authors, it is necessary to
point out that all courses are taught by different in-
structors, except for the courses with IDs 7 and 16.
This was done intentionally to allow us to investigate
how the topic variation affects performance.
The transcripts for each course were originally
collected in short phrases with various lengths, shown
one at a time at the bottom of the video lectures. They
were first concatenated and then segmented into sen-
tences, using straight-forward boundary determina-
tion by punctuations. The sentence-wise datasets are
then stemmed using the Porter Stemming algorithm
(Porter, 1980). To further control the vocabulary size,
words occurring only once in the entire course or
with frequency less than 1/100, 000 are considered
to have negligible influence on the outcome and are
pruned by mapping them to an Out-Of-Vocabulary
(OOV) mark hunki. The first top bar graph in Fig-
ure 1 shows how the vocabulary size of each course
Table 1: Subtitle database from selected Coursera courses.
ID Field
No. of No. of Words / Vocab. size (original
sentences words sentences / stemmed / pruned)
1 Algorithm 5,672 121,675 21.45 3,972 / 2,702 / 1,809
2 Algorithm 14,902 294055 20.87 6,431 / 4,222 / 2,378
3 DSP 8,126 129,665 15.96 3,815 / 2,699 / 1,869
4 Data Mining 7,392 129,552 17.53 4,531 / 3,140 / 2,141
5 Data Mining 6,906 129,068 18.69 3,008 / 2,041 / 1,475
6 DSP 20,271 360,508 17.78 8,878 / 5,820 / 2,687
7 IT 9,103 164,812 18.11 4,369 / 2,749 / 1,979
8 Mathematics 5,736 101,012 17.61 3,095 / 2,148 / 1,500
9 Machine Learning 11,090 224,504 20.24 6,293 / 4,071 / 2,259
10 Programming 8,185 160,390 19.60 4,045 / 2,771 / 1,898
11 NLP 7,095 111,154 15.67 3,691 / 2,572 / 1,789
12 NLP 4,395 100,408 22.85 3,973 / 2,605 / 1,789
13 NLP 4,382 96,948 22.12 4,730 / 3,467 / 2,071
14 Machine Learning 6,174 116,344 18.84 5,844 / 4,127 / 2,686
15 Mathematics 5,895 152,100 25.80 3,933 / 2,697 / 1,918
16 Programming 6,400 136,549 21.34 4,997 / 3,322 / 2,243
dataset shrinks after stemming and pruning. There are
only 0.5 ∼ 1.5% words among all datasets mapped
to hunki, however, the vocabulary sizes are signifi-
cantly reduced to an average of 2000. The bottom bar
graph provides a profile of each instructor in terms
of word frequency, i.e. the database coverage of the
most frequent k words after stemming and pruning,
where k = 500, 1000,2000. For example, the most
frequent 500 words cover at least 85% of the words in
all datasets.
0 2 4 6 8 10 12 14 16
0
5000
10000
Dataset index (C)
Vocabulary size (V)
Vocabulary size for each dataset
V
original
V
stemmed
V
stemmed−pruned
0 2 4 6 8 10 12 14 16
0.8
0.85
0.9
0.95
1
Dataset index (C)
Database Coverage (DC)
Database coverage from most frequent k words for each dataset
stemmed & pruned datasets, k = 500, 1000, 2000
DC
2000
DC
1000
DC
500
Figure 1: Database profile with respect to vocabulary size
and word coverage in various stages.
3 NEURAL NETWORK
LANGUAGE MODEL
The language model is trained using a feed-forward
neural network illustrated in Figure 2. Given a se-
quence of N words W
1
,W
2
,. ..,W
i
,. .. ,W
N
from
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
598