veys the related work, Section 3 describes the corpus
we built by exploiting movie subtitles, Section 4 for-
malizes the problem we face and introduces the fea-
tures we use, Section 5 discusses our Naive Bayes
Classifier, and Section 6 concludes and highlights
some future research directions.
2 RELATED WORK
The most widely used methods to programmatically
identify the language of a given text compare the char-
acteristics (usually called features) of the text with
those most common in the various languages. The
features most often compared are n-grams. Given a
string, an n-gram is defined as any sequence com-
posed by n consecutive characters. 3-grams are com-
monly called trigrams and are the most widely used
n-grams.
Most free programs providing language identifi-
cation of text are based on the work of Cavnar and
Trenkle (Cavnar and Trenkle, 1994). These authors
described a simple algorithm comparing the 300 most
frequent trigrams in every language with the top 300
trigrams of the text given in input. The trigrams are
ordered from the most frequent to the least frequent,
without keeping any additional frequency informa-
tion. The language whose profile is most similar to the
profile of the input text is then chosen. This method
has been proved to work well in practice given long
enough input strings. Accuracy is nearly 100% for
texts more than 300 characters long. For simplicity
in the rest of the paper this algorithm will be called
ROCNNN, where ROC stands for Rank Order Clas-
sifier, and NNN is the number of features stored for
each language (thus the original version will be called
ROC300).
A number of improvements have since been pro-
posed. Prager (Prager, 1999) compared the results ob-
tained using n-grams of various length, words, and
combinations thereof. Those features were weighted
using their inverse document frequency (features
found in less languages were weighted higher; the ter-
minology is a remainder from the document retrieval
field), while the distance used was the cosine distance
of the vectors normalized in feature space.
Ahmed et al. (Ahmed et al., 2004) calculate the
distance from a given feature model using a custom
“cumulative frequency addition” measure in compar-
ison with an algorithm similar to naive Bayes. A
database is used to store the frequencies of all tri-
grams encountered in the training set.
Other classification methods such as decision trees
(Hakkinen et al., 2001) and vector quantization (Pham
and Tran, 2003) have been proposed. While look-
ing promising, it is hard to reach final conclusions,
as not much information is provided about the exact
methodologies by which the results were obtained.
MacNamara et al. (MacNamara et al., 1998) ex-
plore the application of specific architecture of Recur-
rent Neural Networks to the problem, showing they
perform worse than trigrams methods. It must how-
ever be noted that a wide variety of Neural Networks
exists, and other variations might give better results.
Elworthy (Elworthy, 1998) proposes to avoid pro-
cessing the whole document, managing only as many
characters as needed to reach the required confidence
level. This speeds up the text categorization, specially
in the case of very long texts, under the assumption
that all documents are monolingual.
3 LANGUAGE CORPORA
All algorithms using statistical comparison of lan-
guage features need some amount of text in the lan-
guages of interest to be trained. The bigger and the
more representative of the language the data is, the
better the algorithm will perform.
Big existing corpora are freely available
in some languages (http://corpus.byu.edu/,
http://www.clres.com/corp.html), but homoge-
neous corpora for a relatively high number of
languages were needed. The Universal Declara-
tion of Human Rights has been translated into at
least 375 languages and dialects, and is therefore
much used when text in a large number of idioms is
needed. Wikipedia is also available in a wide range
of languages, it is therefore quite easy to harvest
a large number of data in any of the languages
Wikipedia exists. The Internet may be the most
obvious place to look for data, as it contains huge
amounts of text in almost every language, spanning
every field. Movie subtitles in various languages
can be downloaded from a number of websites.
They contain a nice compromise between formal
and informal language (whereas most other sources
only provide text written in formal language). In the
sequel we describe how subtitles were used to build
our own multilingual corpus. BBC Worldservice
contains articles in a wide range of languages (http://
www.bbc.co.uk/worldservice/languages/index.shtml)
which might be easily scraped. Combining some of
the sources above, in a clever and statistically sound
way, may provide even better results.
Since none of the corpora described above
met our needs, we built our own corpus by
using movie subtitles. To obtain subtitles open-
STATISTICAL LANGUAGE IDENTIFICATION OF SHORT TEXTS
499