A CRYPTOGRAPHIC APPROACH TO LANGUAGE

IDENTIFICATION: PPM

Ebru Celikel

Ege University International Computer 35100 Bornova/Izmir, Turkey

Keywords: Language identification, Statistical m

odelling, PPM

Abstract: The problem of language discrimination may arise in situations when many texts belonging to different

source languages are at hand but we are not sure to which language each belongs to. This might usually be

the case during information retrieval via Internet. We propose a cryptographic solution to the language

identification problem: Employing the Prediction by Partial Matching (PPM) model, we generate a lan-

guage model and then use this model to discriminate languages. PPM is a cryptographic tool based on an

adaptive statistical model. It yields compression rates (measured in bits per character –bpc) to far better

levels than that of many other conventional lossless compression tools. Language identification experiment

results obtained on sample texts from five different languages as English, French, Turkish, German and

Spanish Corpora are given. The rate of success yielded that the performance of the system is highly

dependent on the diversity, as well as the target text and training text file sizes. The results also indicate that

the PPM model is highly sensitive to input language. In cryptographic aspect, if the training text itself is

kept secret, our language identification system would provide security to promising degrees.

1 INTRODUCTION

With ongoing improvements in network facilities,

the number of computers getting connected to

Internet is increasing everyday. This widespread use

of Internet makes diversity of languages available

online. Hence, there should be a sound way of

determining the correct language of a retrieved text.

This would not only help discriminate texts

belonging to different source languages, but also

facilitate text categorization, natural language

processing and information retrieval.

Of many techniques used to determine the correct

anguage, exploiting the statistical characteristics of

disputed languages is quite common. Frequency

values for the alphabet size, average word length,

statistics of vowel/consonants throughout the text

might all be used to generate the characteristics of

different languages (Ganesan and Sherman, 1998).

Using the frequency of short word occurrences

(Klukowski, 1991 and Ingle, 1991), the independent

as well as joint probability of symbols (Rau, 1974),

n-grams (Beesley 1988; Cavner and

Trenkle, 1994), n-graphs, i.e. group of n-words

(Batch

elder, 1992), diacritics and special characters

(Newman, 1987) may all help extract the linguistic

characteristics. Exploiting statistics might be

extended to incorporate syllable characteristics

(Mustonen, 1965) and morphology and syntax

(Ziegler, 1991) of texts under suspect. Being a

statistical approach, Hidden Markov Model (HMM)

is employed in language discrimination studies, too

(Preez and Weber, 1996).

Another approach in language discrimination is to

se artificial intelligence system with neural

networks. In this way, the system can learn

characteristics of a language and the language of

unknown texts might be determined (Braum and

Levkowitz, 1998).

Our approach towards language identification

corporates the Prediction by Partial Matching

(PPM) algorithm. The premises of our study are

cryptographic, as well as statistical: It is

cryptographic in that, PPM is basically a lossless

compression tool which is used to encode texts into

formats that are hard to recover, as long as the

algorithm itself is hidden. It is also statistical

because, PPM extracts a statistical model out of a

given text (called training text) and uses this model

to decide whether the text in question belongs to the

same language as that of the text from which the

model was extracted. Teahan used PPM technique to

discriminate the language of Bibles written in six

213

Celikel E. (2005).

A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 213-219

DOI: 10.5220/0002556102130219

 SciTePress

languages as English, French, German, Italian, Latin

and Spanish (Teahan, 2000)

2 COMPONENTS

The scheme we designed and developed for

language identification is made up of an adaptive

statistical model PPM, followed by an encoder,

Arithmetic Coding. In the following subsection,

these components are introduced. We then give the

details of Corpora we used during our

implementation.

2.1 PPM Model

PPM is a compression technique, making use of a

statistical model of the input text. It was first

introduced by Cleary and Witten (Cleary and

Witten, 1984), and was improved by Moffat

(Moffat, 1990). The algorithm reads the upcoming

symbol from the input text, assigns a probability to it

and sends it to an adaptive encoder, i.e. the

Arithmetic Coder. The probabilities assigned to each

symbol are determined by blending together and

adaptively updating the several fixed order context

models (Nelson, 1991).

PPM is a bijective algorithm. By employing fixed

input symbol sizes, it is capable of interpreting any

collection of bytes as valid compressed input. This

property has positive implications for compression

efficiency and security: there is no way to

distinguish random data from valid output

(Nodeworks Encyclopedia URL).

PPM employs a context of length k to predict and

assign a probability to the up-coming symbol. If the

upcoming symbol is a new one that has never

occurred so far, then the length-k context cannot

predict it. In this case, an escape symbol is issued

and the model is transferred to a lower level, i.e.

length k-1 context. This reduction may continue

until a non-first-time symbol is reached. When the

upcoming symbol is determined successfully,

Arithmetic Coder is then used to encode it (Teahan,

1998).

Arithmetic Coding is an encoding technique

providing compression rates close to the language

entropy. Arithmetic Coding is considered to be

optimal because, on the average, it is not possible to

encode better than the entropy rate (Witten, Moffat

and Bell, 1999). The algorithm for Arithmetic

Coding is as follows

procedure ArithmeticCoding()

set low to 0.0 and high to 1.0

while there is more input symbol do

get an input symbol

code_range = high – low

high=low + range * high_range(symbol)

low = low + range * low_range(symbol)

end of while

output low

end procedure

Arithmetic Coding is achieved by representing a

stream of input symbols as a floating point number

between 0 and 1, instead of replacing an input

symbol with a new symbol. As the text is encoded,

this interval narrows as much as the frequency of the

symbols in the source text. Source message symbols

with high frequency narrows the interval less as

compared to the low frequency source message

symbols. In this manner, high frequency source

message symbols require less number of bits in the

output sequence

2.2 Language Identification System

The language identification system we designed and

developed is made up of two components as a

statistical model generator and an encoder (Fig. 1).

There are two inputs to the system: One being the

text whose language is already known and the other

is the text in question (the disputed text), i.e. whose

language is to be determined. Within the system, the

former is called the training text and is fed into the

PPM model, and the latter is called the plaintext and

is fed into the encoder itself.

Tr a i n i n g text

PPM

Model

Pl ainte xt

En c o de r

Com pressed

te xt

Figure 1: The Language identification system

During implementation, we incorporated the PPM

algorithm as follows: We used training text to gather

symbol frequencies and out of these frequencies, we

com-pressed different texts whose languages are of

question. When the languages of the training text

and disputed text are the same, the compression

performance of PPM is expected to be better. But

there are some effects as linguistic characteristics,

type and length of texts, etc. that might affect the

performance of our system.

In order to evaluate our scheme’s rate of success

in determining the right language, we employed the

cross-entropy. The cross-entropy H(L,M) is defined

as an upper bound to the actual entropy H(L) of a

ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

214

language by using a probability model M. We need

the cross-entropy because, the exact probability

distribution of a language is never certain. Hence,

H(L) is just a theoretical value. The cross-entropy of

a language L calculated using a model M is

formulated in Eq. 1. The entropy and cross-entropy

are both measured in bits per character – bpc

In Eq. 1, the entropy of a language L, i.e. H(L) is

calculated as follows (Eq. 2)

Entropy H in Eq. 1 is the average number of bits

per symbol needed to encode a message (Shannon,

1948)

Assuming that the probabilities summing to 1 are

independent and the k possible symbols have a

probability distribution P=p(x

), p(x

), …, p(x

), the

entropy can be calculated as below (Eq. 3):

For this study, we incorporated Corpora from five

European languages as English, French, Turkish,

German and Spanish. These languages differ from

each other to varying degrees. For example, while

English, French and Spanish are more alike, Turkish

is totally different from these languages and German

has some common linguistic characteristics with the

above mentioned group of three. The alphabet size

for these languages are 26 letters for English and

French, 29 letters for Turkish and 30 letters for both

German and Spanish.

All languages of concern in this study are

analytical, agglutinative and fusional (having

affixes). Analytical languages either does not

combine concepts into single words at all (like in

Chinese) or does so economically as is the case in

English and French. The sentence itself is of primary

concern in analytical languages, while the word is of

minor interest. Turkish is synthetic and a free

constituent order language, morphologically

extendible with the its rich set of derivational and

inflectional suffixes. In a synthetic language, the

concepts cluster more thickly, the words are more

richly chambered, but there is a tendency to keep the

range of concrete significance in the single word

down to a moderate compass (Sapir, 1921). German

has extensive use of inflectional endings and

compound words are quite common (German

Linguistic URL). Spanish has quite the same

linguistic characteristics as French, with higher

average word length.

Our Corpora consists of these five different

languages and each language has a group of seven

texts. The texts within each language have been

deliberately selected from different essay categories

to reflect the changes of the style into our language

discrimination implementation. These categories are

novel, technical document, poetry, manual, theatre

text, Holy book (Bible or Qoran) and a

dictionary/encyclopedia. We first based our Corpora

construction on text files from the standard English

Cor-pus Canterbury (Canterbury Corpus). Modeling

after the Canterbury Corpus, we then compiled

Corpora from the other four source languages as

French, Turkish, German and Spanish. The texts in

Turkish Corpus are from Celikel (Celikel, 2004),

and the rest of the texts in other languages are all

from Internet. The sizes and contents of each Corpus

are listed in Tables 1 through Table 5

∑

−= ),...,,(log),...,,(),(

2121 mMmM

xxxpxxxpML

(

)

∑

−

∞→

),...,,(log),...,,(

lim

)(

2121 mm

xxxpxxxp

(

)

∑

−=

xpxpPH

)(log)()(

(

)

Table 1: English Corpus Table 2: French Corpus.

Table 3: Turkish Corpus Table 4: German Corpus

Table 5: Spanish Corpus

3 RESULTS

To discriminate among languages, we applied the

PPM model on texts from each Corpus. During

implementation, we repeated the language

identification experiments on each text file. Within

each language set, we employed each of seven text

files as the training text to PPM to compress the

texts of the whole Corpora. Since there are five

different languages, it makes 7x5=35 runs for each

text; since there are seven texts within each language

set, it makes 7x5x7=245 runs for each language; and

since there are five different languages, it makes

245x5=1,225 runs in total.

In order to evaluate the performance of our

language discriminator, we used the accuracy rate

measure (Eq. 4). In this formula, the successes are

the cases when both the training text and the

EN G L I S H

Fi l e s i z e

(bytes)

152,089 Novel: Lewis Carroll's "Alice in Wonderland"

426,754 Technical document

481,861 English poetry

4,227 GNU manual

125,179 Theatre text of the play "As You Like it"

4,047,392 Bible in English

2,473,401 World Fact Book of CIA

l e Expl an at io nFi

FRENC H

Fi l e si z e

(bytes)

871,286 Novel: Jules Verne's "20000 Leagues under the Sea"

66,049 Technical document

185,205 French poetry

32,428 GNU manual

135,477 T heatre t ext of the play "T artuffe" by Molière

4,669,107 Bible in French

51,521 French dictionary of computers

Fi le Expl an ati on

TU R KI S H

Fi l e si z e

(bytes)

167,799 Novel: Ataturk's Discourse

9,664 Technical document

59,386 Turkish poetry

18,526 GNU manual

113,545 T heat re t ext of t he play "Galilei Galileo"

937,532 Quran in T urkish

765,624 Online Philosophy terms disctionary

ile Explanation

GERMAN

Fi l e si z e

(bytes)

716,800 Novel

25,872 Technical document

105,868 German poetry

4,622 GNU manual

100,712 T heatre t ext of the play "Faust"

4,359,878 Bible in German

157,289 Online dictionary of medical terms

Fi le Expl an ati on

SPANISH

Fi l e si z e

(bytes)

63,170 Novel "Oración Cívica" by Gabino Barreda

25,295 Technical document

129,005 Spanish poetry

5,865 GNU manual

158,991 T heatre text of t he play "Las Mocedades Del Cid"

4,126,848 Bible in Spanish

216,267 Online ophtalmology dictionary

File Explanation

A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM

215

Repeating the language identification

experiments on French texts, we obtained the results

in Table 7. Here, the effect of training text size on

the performance of the iden-tification scheme is

obvious: In cases when the training text size is small

(files 2, 4, 5 and 7) the number of failures are

greater, while with larger training text (file 6), the

system has only two failures. Moreover, when the

compressed text is not in French but its size is

relatively large, the language identifier system

incorrectly decides that it was written in French.

This is the effect of compressed text size on the

overall system performance. The accuracy rate is

46.94% for French with 23 successes and 26

failures.

disputed text belong to the same language. All other

cases are considered to be as failure

Accuracy rate =

# of successes

(

)

total # of experiments

In Table 6 below, language determination

experiment results with training texts as-signed from

the English Corpus are given. The rows having the

lowest bpc rates for English at each row are depicted

as success cases. If so, we conclude that the text in

question is in English, as well. Unfortunately, this is

not the case all the time: There are some rows in

Table 6, where the lowest compression rates with

PPM are obtained with training texts in other

languages than English. These cases are called as

fails and are depicted as bold cells. The reason why

the language identifier fails might be caused by the

insufficient size of the training text or the

compressed text (Table 6). According to Table 6,

there are 18 failures and 31 successes, yielding the

accuracy rate of 63.27%. This is the accuracy rate

performed by the PPM model on English.

Applying our scheme with training texts in

Turkish this time, we obtained the bpc values in

Table 8. The number of failures for Turkish is 39

while the number of suc-cesses is only 10 out of 49

experiments. Hence, the accuracy rate of our model

on Turkish is only 20.41%. This performance

degradation in the identification scheme is probably

due to the different linguistic characteristics of

Turkish

Table 6: Language identification experiments with training text from English Corpus

English French Turkish German Spanish

1.36 2.04 2.12 1.78 2.74

1.95 2.16 3.78 2.60 2.48

2.35 2.64 2.86 2.94 2.69

2.86

2.71

2.75 3.88 3.72

2.45 2.02 2.53 2.63

1.79

1.61 1.78 1.95 1.76 1.80

1.51 2.43 1.91 1.80 2.01

TT E1

Fil e

Language

152,089 bytes

English French Turkish German Spanish

2.22 2.08 2.17

1.83

2.84

1.34 2.24 3.81 2.65 2.54

2.38 2.73 2.93 3.04 2.78

2.69

2.77

2.62 3.73 3.61

2.54 2.09 2.6 2.71

1.85

1.63 1.8 1.97 1.77 1.81

1.52 2.44 1.93 1.86 2.08

TT E2

Fil e

Language

426,754 bytes

English French Turkish German Spanish

2.23 2.09 2.18

1.84

2.84

1.99 2.25 3.88 2.70 2.59

1.70 2.72 2.93 3.08 2.75

2.91 2.83

2.78

3.88 3.71

2.39 2.07 2.61 2.73

1.85

1.61 1.80 1.98 1.78 1.82

1.53 2.52 1.94 1.87 2.07

TT E3

Fil e

Language

481,861 bytes

English French Turkish German Spanish

2.19 1.99 2.01

1.73

2.55

1.93 2.00 3.52 2.39 2.27

2.33 2.51 2.67 2.76 2.53

0.88 2.48

2.62

3.58 3.44

2.47 1.88 2.40 2.45

1.68

1.60 1.76 1.91 1.74 1.78

1.49 2.26 1.87 1.70 1.92

TT E4

Language

4,227 bytes

Fil e

English French Turkish German Spanish

2.17 2.03 2.11

1.78

2.73

1.95 2.14 3.77 2.58 2.47

2.31 2.62 2.83 2.93 2.66

2.85

2.69

2.73 3.83 3.70

1.47 1.99 2.52 2.60 1.77

1.60 1.78 1.94 1.76 1.80

1.50 2.41 1.91 1.79 2.00

TT E5

125,179 bytes

Fil e

Language

English French Turkish German Spanish

2.35 2.25 2.27

1.97

3.11

2.17 2.45 3.98 2.87 2.78

2.52 2.71 3.03 3.28 2.94

3.06

3.13

2.92 4.10 4.00

2.43 2.25 2.71 2.89 1.96

1.43 1.87 2.04 1.86 1.88

1.63 2.76 2.01 1.98 2.23

TT E6

Fil e

Language

4,047,392 bytes

English French Turkish German Spanish

2.49 2.20 2.32

1.92

3.09

2.11 2.41 3.98 2.86 2.80

2.58 2.92 3.14 3.25 2.98

3.02 3.05

2.89

3.96 3.80

2.77

2.28

2.76 2.91 2.01

1.70 1.85 2.03 1.81 1.86

1.07 2.69 2.02 2.03 2.26

TT E7

Fil e

Language

2,473,401 bytes

Table 7: Language identification experiments with training text from French Corpus

English French Turkish German Spanish

2.45 1.54 2.19 1.86 2.92

2.13 2.03 3.85 2.72 2.67

2.41 2.39 2.95 3.09 2.83

3.41 2.10 2.98 3.92 3.72

2.76 1.79 2.63 2.74 1.89

61.66

1.78 1.99 1.79 1.85

71.56

2.23 1.95 1.89 2.15

TT F1

Fil e

Language

871,286 bytes

English French Turkish German Spanish

2.27 1.99 2.07 1.76 2.65

1.98 1.18 3.69 2.50 2.39

32.37

2.51 2.77 2.85 2.61

3.21 2.39 2.78 3.74 3.57

2.57 1.88 2.47 2.54

1.74

61.61

1.77 1.93 1.75 1.79

71.50

2.27 1.89 1.75 1.97

TT F2

Language

66,049 bytes

File

English French Turkish German Spanish

3.40 1.97 2.13

1.79

2.75

2.04 2.01 3.80 2.61 2.53

2.42 1.60 2.87 2.97 2.70

3.34 2.34 2.90 3.88 3.71

2.65 1.81 2.55 2.65

1.80

61.65

1.76 1.96 1.76 1.80

71.52

2.30 1.91 1.80 2.04

TT F3

185,205 bytes

Fil e

Language

English French Turkish German Spanish

2.25 1.97 2.06

1.75

2.60

1.96 1.94 3.64 2.46 2.34

32.36

2.48 2.74 2.83 2.58

3.08 1.22 2.69 3.56 3.32

2.54 1.83 2.46 2.51

1.72

61.60

1.76 1.92 1.75 1.79

71.50

2.13 1.88 1.74 1.95

TT F4

Language

32,428 bytes

Fi l e

English French Turkish German Spanish

2.31 1.97 2.10

1.77

2.69

2.00 1.98 3.72 2.55 2.44

32.39

2.44 2.85 2.90 2.65

3.25 2.30 2.86 3.83 3.63

2.59 1.23 2.51 2.59 1.77

61.61

1.76 1.94 1.75 1.80

71.51

2.28 1.90 1.78 2.00

TT F5

Fil e

Language

135,477 bytes

English French Turkish German Spanish

2.61 2.08 2.28

1.95

3.18

2.28 2.16 4.10 2.90 2.89

2.66 2.51 3.08 3.25 3.02

3.69 2.29 3.17 4.19 4.01

2.93 1.84 2.74 2.89 2.02

1.72 1.60 2.06 1.87 1.96

71.64

2.48 2.02 1.99 2.33

TT F6

Language

4,669,107 bytes

Fi l e

English French Turkish German Spanish

2.27 1.99 2.07

1.76

2.64

21.96

2.01 3.65 2.47 2.36

32.37

2.53 2.78 2.86 2.61

3.10 2.30 2.74 3.64 3.47

2.56 1.90 2.48 2.55

1.74

61.61

1.77 1.93 1.75 1.79

1.50 1.08 1.89 1.75 1.96

TT F7

Fil e

Language

51,521 bytes

ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

216

For German, we repeated the experiments and

gathered the figures in Table 9. Whenever the

training texts are in German, the performance of the

PPM model in discriminating the languages of texts

are not quite good. It has 36 failures and only 13

successes, yielding an accuracy rate of 26.53%.

Lastly, we run our language discriminator with

training texts assigned from the Spanish language set

and recorded the results on Table 10. According to

the values in Table 10, our scheme yielded 34

failures versus 15 successes. So, the accuracy rate

indicated by our system for Spanish is 30.61%.

To observe the overall success rate of our scheme

on the five languages of concern, we have plotted

the accuracy rate vs. language chart in Fig. 2.

English French Turkish German Spanish

Language

Accuracy rate %

Figure 2: Accuracy rate of the language identification

system for different languages.

According to Fig. 2, the language identifier employing

the

PPM model achieves the best accuracy rate with

training texts in English (with 63.27%). The system

indicates accuracy rates of 46.94% with French,

30.61% with Spanish, 26.53% with German and the

worst performance as 20.41% with Turkish. The

difference in performance levels might probably

occur due to the characteristics of source languages,

as well as the category, i.e. the type, together with

the different time span each text belongs to

Table 8: Language identification experiments with training text from Turkish Corpus

English French Turkish German Spanish

2.33 2.03 1.20 1.78 2.74

22.00

2.15 3.23 2.60 2.49

32.40

2.62 2.64 2.93 2.68

3.42 2.73 2.73 3.93 3.84

2.63 2.00 2.33 2.61

1.78

61.61

1.78 1.92 1.75 1.80

71.52

2.44 1.86 1.80 2.01

TT T1

File

Lan gu age

167,799 bytes

English French Turkish German Spanish

2.22 1.99 1.99

1.74

2.58

1.94 2.03 0.94 2.42 2.31

32.34

2.53 2.63 2.79 2.55

3.18

2.54

2.60 3.67 3.54

2.51 1.90 2.36 2.48

1.69

61.60

1.76 1.90 1.74 1.78

71.49

2.30 1.86 1.72 1.93

TT T2

Lan gu age

9,664 bytes

File

English French Turkish German Spanish

2.29 2.02 2.01

1.76

2.68

21.98

2.11 3.33 2.55 2.44

2.38 2.59 1.25 2.88 2.63

3.34

2.66

2.74 3.86 3.74

2.59 1.98 2.33 2.56

1.75

61.61

1.77 1.91 1.75 1.79

71.51

2.40 1.87 1.77 1.98

TT T3

59,386 bytes

File

Lan gu age

English French Turkish German Spanish

2.22 2.00 2.02

1.74

2.59

21.94

2.04 3.36 2.42 2.31

32.34

2.54 2.68 2.81 2.57

2.91 2.52 0.89 3.25 3.15

2.51 1.92 2.41 2.49

1.71

61.60

1.77 1.91 1.74 1.78

71.49

2.29 1.87 1.73 1.94

TT T4

Lan gu age

18,526 bytes

File

English French Turkish German Spanish

2.31 2.03 1.97

1.77

2.72

22.00

2.14 3.15 2.59 2.48

32.39

2.61 2.52 2.92 2.66

3.41

2.71

2.75 3.93 3.82

2.61 1.99 1.26 2.59 1.77

61.61

1.77 1.90 1.75 1.79

71.51

2.44 1.85 1.79 2.00

TT T5

File

Lan gu age

113,545 bytes

English French Turkish German Spanish

2.44 2.08 2.07

1.83

2.90

22.08

2.28 3.40 2.76 2.66

32.47

2.75 2.62 3.08 2.81

3.58

2.88

2.90 4.11 4.04

2.75 2.11 2.33 2.75

1.86

1.63 1.80 1.44 1.78 1.82

71.54

2.59 1.93 1.86 2.12

TT T6

Lan gu age

937,532 bytes

File

English French Turkish German Spanish

2.42 2.08 1.98

1.82

2.87

22.06

2.22 3.07 2.69 2.59

32.47

2.71 2.65 3.06 2.79

3.48 2.81 2.75 3.88 3.87

2.73 2.09 2.29 2.72

1.86

61.63

1.79 1.96 1.77 1.82

1.54 2.52 1.33 1.89 2.08

TT T7

File

Lan gu age

765,624 bytes

Table 9: Language identification experiments with training text from German Corpus

English French Turkish German Spanish

2.42 2.09 2.18 1.34 2.84

22.08

2.25 3.83 2.34 2.58

32.50

2.73 2.93 2.84 2.75

3.41

2.86

2.95 2.89 3.80

2.74 2.08 2.61 2.49

1.83

61.64

1.80 1.99 1.78 1.81

71.54

2.54 1.94 1.83 2.07

TT G 1

File

Langu age

716,800 bytes

English French Turkish German Spanish

2.25 2.00 2.05 1.73 2.61

1.96 2.06 3.62 0.94 2.09

32.35

2.56 2.74 2.77 2.58

3.16

2.57

2.72 3.11 3.53

2.54 1.93 2.45 2.45

1.71

61.60

1.77 1.92 1.74 1.79

71.50

2.31 1.88 1.71 1.95

TT G 2

Language

25,872 bytes

Fi l e

English French Turkish German Spanish

2.34 2.04 2.12 1.75 2.73

22.01

2.16 3.81 2.42 2.49

2.41 2.64 2.85 1.53 2.67

3.34

2.73

2.89 3.28 3.79

2.64 2.01 2.54 2.31

1.78

61.62

1.78 1.95 1.74 1.79

71.51

2.45 1.91 1.77 2.01

TT G 3

File

Langu age

105,868 bytes

English French Turkish German Spanish

2.21 1.99 2.02 1.72 2.55

21.94

2.01 3.51 2.27 2.27

32.33

2.52 2.68 2.74 2.53

3.04 2.48 2.57 1.00 2.95

2.50 1.89 2.41 2.42

1.68

61.60

1.76 1.91 1.74 1.78

71.49

2.26 1.87 1.69 1.92

TT G 4

Fi l e

Language

4,622 bytes

English French Turkish German Spanish

2.33 2.03 2.11 1.74 2.71

22.00

2.15 3.79 2.39 2.47

32.40

2.63 2.85 2.64 2.65

3.34

2.71

2.78 3.31 3.78

2.62 2.00 2.53 1.39 1.77

61.61

1.78 1.94 1.74 1.79

71.51

2.44 1.90 1.76 2.00

TT G 5

File

Langu age

100,712 bytes

English French Turkish German Spanish

2.62 2.21 2.30 1.97 3.05

22.21

2.41 4.05 2.67 2.79

32.66

2.91 3.09 2.76 2.91

3.65

3.09

3.17 3.36 4.16

2.95 2.23 2.75 2.39

1.93

1.72 1.88 2.07 1.55 1.87

71.60

2.74 2.03 2.02 2.20

TT G 6

Language

4,359,878 bytes

Fi l e

English French Turkish German Spanish

2.32 2.03 2.12 1.77 2.71

22.00

2.13 3.76 2.45 2.46

32.40

2.62 2.85 2.88 2.65

3.33

2.70

2.89 3.44 3.75

2.62 1.99 2.53 2.55

1.78

61.61

1.78 1.94 1.75 1.79

1.51 2.42 1.91 0.91 1.99

TT G 7

157,289 bytes

File

Langu age

A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM

217

Table 10: Language identification experiments with training text from Spanish Corpus

English French Turkish German Spanish

2.28 2.02 2.08 1.76 1.4

1.98 2.1 3.67 2.51 2.23

2.37 2.58 2.78 2.87 2.44

3.22

2.61

2.79 3.74 2.91

2.57 1.96 2.49 2.54

1.64

1.61 1.77 1.93 1.75 1.78

1.51 2.37 1.89 1.76 1.91

TT S 1

Fi l e

Language

63,170 by te s

English French Turkish German Spanish

2.23 2 2.04 1.74 2.52

1.95 2.04 3.6 2.20 0.92

2.35 2.55 2.73 2.81 2.53

3.1

2.54

2.7 3.66 3.08

2.53 1.92 2.44 2.5

1.69

1.6 1.7 7 1.92 1.74 1.78

1.5 2.3 1.88 1.73 1.91

TT S 2

Fi l e

Language

25,295 by te s

English French Turkish German Spanish

2.33 2.04 2.13

1.78

2.40

2.01 2.15 3.77 2.58 2.28

2.40 2.63 2.86 2.92 1.49

3.33 2.69

2.89

3.86 3.09

2.62 2.01 2.54 2.60

1.62

1.61 1.78 1.95 1.75 1.78

1.52 2.44 1.91 1.79 1.96

TT S 3

129,005 bytes

Language

Fi l e

English French Turkish German Spanish

2.21 1.99 2.02

1.73

2.48

1.94 2.00 3.51 2.38 2.17

2.33 2.52 2.68 2.77 2.50

3.01 2.45

2.55

2.94 1.02

2.49 1.89 2.41 2.46

1.66

1.60 1.76 1.91 1.74 1.78

1.49 2.25 1.87 1.71 1.90

TT S 4

Fi l e

Language

5,865 bytes

English French Turkish German Spanish

2.31 2.03

4 CONCLUSION AND FUTURE

WORK

We have designed and implemented a scheme

incorporating the Prediction by Partial Matching

(PPM) model. We applied this scheme which is both

a cryptographic and statistical tool, in language

identification problem. We have implemented our

system on texts from five different languages as

English, French, Turkish, German and Span-ish.

Results revelaed that, the cryptographic tool we

introduced achieves accuracy rates of 63.27% ,

46.94%, 20.41%, 26.53% and 30.61% for English,

French, Turkish, German and Spanish, respectively.

Experiments with the scheme can be extended with

equally sized training as well as disputed text sizes

to eliminate the size effect. Implementations further

be carried on with more languages belonging to

various language families. The performance of our

system can be compared with that of the existing

automatic language identification methods by

running the algorithms on the same data set.

REFERENCES

Batchelder, E.O., 1992. A learning experience: Training

an artificial neural network to discriminate languages.

Technical report.

Beesley, K.R., 1988. Language identifier: A computer

program for automatic natural-language identification

on on-line Text. In proceedings of the 29

annual

conference of the American translators association,

47-54.

Braum, J., Levkowitz, H., 1998. Automatic language

identification with perceptually guided training and

recurrent neural networks. In Int’l conf. on spoken

language processing (ICSLP 98), Sydney, Australia.

Canterbury Corpus http://www.Corpus.canterbury.ac.nz

Cavner, W.B., Trenkle J. M., 1994. N-gram based text

categorization. In proceedings of the 3

annual

symposium on document analysis and information

retrieval, 261-269.

Celikel, E., 2004. The compression and modelling of

turkish texts, PhD Thesis, Ege University,

International Computer Institute, Izmir/Turkey.

Cleary, J.G., Witten, I.H., 1984. Data compression using

adaptive coding and partial string matching. In IEEE

transactions on communications. 32(4), 396-402.

Ganesan R., Sherman A. T., 1988. Statistical techniques

for language recognition: An introduction and guide

for cryptanalyst. Cryptologia XVII:4, 321-366

German Linguistic URL:

http://www.infoplease.com/ce6/society/A0858390.htm

Ingle, N. C., 1991. A language Identification Table. In The

Incorporated Linguist, Vol. 15(4), 98-101.

Kulikowski, S., 1991: Using short words: A language

identification algorithm. Unpublished Technical

Report.

Moffat, A., 1990. Implementing the PPM data

compression scheme. In IEEE transactions on

communications, 38(11), 1917-1921.

Mustonen, S., 1965. Multiple discriminant analysis in

linguistic problems. In statistical methods in

linguistics, no: 4, Skriptor Fack, Stockholm.

Newman, P., 1987. Foreign language identification: First

step in the translation process. In Proceedings of the

annual conference of the American translators

association, 509-516.

Nelson, M., 1991. Arithmetic coding + statistical

modeling = data compression part 1 – arithmetic

coding. Dr.Dobb’s Journal.

Nodeworks Encyclopedia URL:

http://pedia.nodeworks.com/P/PP/PPM/PPM_compres

sion_algorithm/

Preez, J., Weber, D., 1996. Automatic language

recognition using high-order HMMs, In Inter-national

.11

1.77

2.47

2.00 2.14 .74 2.56 2.34

2.39 2.62 .83 2.90 2.43

3.31

2.68

.87 3.87 3.22

2.60 1.99 .52 2.58 1.08

1.61 1.78 .94 1.75 1.78

1.52 2.42 .91 1.79 1.97

TT S 5

Fi l e

Language

8,991 bytes

English French Turkish German Spanish

2.56 2.29 2.29

1.91

2.38

2.23 2.46 4.02 2.86 2.56

2.62 2.96 3.08 3.19 2.49

3.62

3.11

3.15 4.19 3.17

2.91 2.27 2.75 2.84 1.68

1.70 1.92 2.06 1.82 1.56

1.61 2.80 2.02 1.96 2.23

TT S 6

Fi l e

Language

4,126,848 bytes

English French Turkish German Spanish

2.34 2.05 2.13

1.79

2.54

2.03 2.19 3.80

2.60

2.26

2.41 2.67 2.87 2.95 2.60

3.26 2.74 2.86 3.81 3.06

2.63 2.03

2.55

2.62 1.76

1.62

1.79

1.96 1.76 1.80

1.52 2.45 1.91 1.79 1.16

TT S 7

Fi l e

6,267 bytes

English

ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

218

conference on spoken language processing (ICSLP),

Sydney, Australia.

Rau, M.D., 1974. Language identification by statistical

analysis. Master's thesis, Naval post-graduate school.

Sapir, E., 1921. Language: An introduction to the study of

speech.

Shannon, C.E., 1948. A mathematical theory of

communication. In Bell system technical journal. vol.

27, 623-656.

Teahan, W.J., 1998. Modeling English text. PhD Thesis,

Univ. of Waikato, NZ.

Teahan, W.J., 2000. Text classification and segmentation

using minimum cross-entropy. In proceedings of

RIAO’2000. Vol. 2, Paris, France, 943-961.

Witten, I., Moffat, A. and Bell, T.C., 1999. Managing

Gigabytes Compressing & Indexing Documents and

Images, 2nd ed., Morgan Kauffman Pub., CA, USA.

Ziegler, D.V., 1991 The Identification of Languages Using

Linguistic Recognition Signals. PhD thesis, SUNY

Buffalo.

A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM

219