IMPROVING N-GRAM LINGUISTIC STEGANOGRAPHY

BASED ON TEMPLATES

Alfonso Muñoz, Justo Carracedo Gallardo

Universidad Politécnica de Madrid, Escuela de Ingeniería Técnica de Telecomunicación

Departamento de Ingeniería y Arquitecturas Telemáticas, Carretera de Valencia Km. 7, 28031 Madrid, Spain

Irina Argüelles Álvarez

Universidad Politécnica de Madrid, Escuela de Ingeniería Técnica de Telecomunicación

Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología

Carretera de Valencia Km. 7, 28031 Madrid, Spain

Keywords: Linguistic Steganography, Automatic generation of stegotext, Stelin.

Abstract: The automatic generation of stegotexts is an area of research within linguistic steganography with a great

interest. Different proposals have been published about the development of stegotexts with statistical and

linguistic validity in natural language but, unfortunately, the stegotexts generated following some of these

procedures suffer from different types of lexical, syntactic, semantic problems or problems of global

coherence that would eventually lead the automatically generated texts to be detected by a human reader.

This article presents an improved implementation of an algorithm of N-Gram statistic imitation and

introduces a new concept of manual edition of the stegotexts generated, based on the idea of templates. This

procedure permits the creation of high quality stegotexts to hide a small quantity of information, hundreds

of bits, useful to distribute a symmetrical key, a short message, an url, etc. The procedure developed in the

article can be applied to different languages, but here, its usefulness is demonstrated for Spanish. The Stelin

tool is presented free at http://stelin.sourceforge.net.

1 IMPROVING N-GRAM

LINGUISTIC

STEGANOGRAPHY

BASED ON TEMPLATES

Using a N-Gram model (the statistics that guides the

relation between words and collocations in a

language) to conceal information can be very

productive today. In the following subsections, a

series of improvements applied to Wayner’s model,

mimic function, (Bergmair 2007) are commented and

the concept of manual annotation is introduced. As

an example, fragments of stegotexts in Spanish are

included. Spanish language was chosen because it is

a language spoken by more than 400 million people

in the Internet and because it is the native language

of the authors. A similar reasoning can be followed

for other languages as for example, English.

1.1 Improving Lexical and

Grammatical Quality of the

Generated Stegotext

The tests carried out indicate that it is more useful to

measure the statistics of every "word" of the training

text instead of every "letter" (in Wayner's original

idea). This change implies a lexical and grammatical

improvement of the generated stegotexts quality, at

least in Spanish language. In the same way, the use

of the punctuation marks (.:; ¿?!, etc.) as individual

words, separating them if they are joined to another

word, not only influences the grammatical

improvement of the generated stegotext but it also

facilitates the concealment of more information. For

example, if we consider the phrase: "The sky is

cloudy.", it is more probable that possible following

words are written after the full stop rather than after

the word "cloudy.", and following Wayner's idea, if

the number of following words is higher, more

209

Muñoz A., Carracedo Gallardo J. and Argüelles Álvarez I. (2010).

IMPROVING N-GRAM LINGUISTIC STEGANOGRAPHY BASED ON TEMPLATES.

In Proceedings of the International Conference on Security and Cryptography, pages 209-212

DOI: 10.5220/0002918202090212

 SciTePress

information could be hidden. Having in mind these

principles the following algorithm is defined:

1. Create a root table with all the different

existing words in the training text. Every entry of the

table will connect to another table (level) that

contains the words that can follow it and this way

recursively up to an order N (Figure1). With this

information make a Huffman tree for each level.

2. A "word" is selected at random from the root

table. This selection allows for the same training text

to generate different stegotexts.

3. If the word does not have successors, it does

not point to another node, another term in the root

table is chosen. If the node only has a successor

word, this word is added to the stegotext and the

following available node is chosen. Information

cannot be concealed in this case. If the successor

node has several possible words among which to

choose, choose one with a binary code (path of the

Huffman’s tree) that coincides with the information

to be concealed. After, the following available node

is chosen.

4. If the last node is reached (order n=8, for

example, 8 consecutive words) the last word is

selected for the stegotext and back to step b). This

process is repeated until the stegotext that conceals

the information is generated.

5. The receiver needs to build the table of

frequencies of the selected training text to know the

bits that conceal every word of the received

stegotext.

Although this variant generates stegotexts of

larger size, it is easier to obtain texts with lexical and

syntactic validity (Figure 2). The tests carried out

indicate, at least in Spanish language, that in general,

an order 7 or higher provides lexical and syntactic

reasonable results. Also, following the results of

tests carried out, a higher order than 7 or 8

(depending on the training text) can be sufficient. In

fact, over this order the generated stegotext does not

improve substantially its linguistic appearance

whereas its size is substantially bigger.

The importance of the statistical imitation is that

on having imitated statistics, the linguistic validity

of the training text would be imitated. Depending on

the channel used to transmit it, the stegotext should

keep the statistical conditions and those linguistic

properties typical of the channel not to raise

suspicions.

Figure 1: Example of implemented algorithm in Stelin.

Mode=word.

1.2 Manual Annotation and Templates

The Stelin tool (Muñoz 2009) developed generates

automatically stegotexts in natural language based

on the statistical imitation (N-Gram) of one or more

training texts, known only by the transmitter and the

receiver. The transmitter and the receiver only need

to keep the training sources private and a

cryptographic key if they want to cipher the

information to be concealed; the rest of the system is

public.

Imitating the statistics of the words in the

training texts leads to stegotexts with better lexical

and grammatical quality. Nevertheless, depending

on the training source, the generated stegotexts still

might have less important grammatical mistakes as

for example question marks that are opened but are

not closed, as well as possible problems of global

coherence which is a main problem in the science of

linguistic steganography. An interesting idea, while

an algorithm that solves the above mentioned

problems is obtained, would be editing the

stegotexts generated after their generation to correct

those mistakes. In general, this procedure has the

fundamental problem of synchronization with the

receiver, that is to say, introducing changes in the

generated stegotext implies that the receiver must

know what changes have been made in order not to

have problems in the recovery of the secret

information. Up to what is known, solutions have

not been published in this respect, though there are

solutions that permit to improve the quality of the

stegotext in the process of generation (Bergmair

2007). The proposed idea consists of improving the

linguistic appearance of the generated stegotext a

posteriori without the need that the receiver has any

additional information about the changes. This idea

can be put into practice with the Stelin tool with a

series of considerations and limitations.

Let's think for a moment about the structure in

the example in Figure 1. A piece of information is

hidden by means of the selection of a word among

the available ones in the following level. That is to

say, if the word “La” (the) is selected, the following

SECRYPT 2010 - International Conference on Security and Cryptography

210

word might be “habitación” (room) (bit 0), "había”

(had) (bit 10), “limpió” (cleaned) (bit 11) depending

on the bits to be concealed. The receiver would

construct the frequency table as the transmitter and

would relate the words in the stegotext received with

the generated table. The receiver would know that

the words "La habitación", “La había” or “La

limpió” hidden one bit (0) or two (10-11)

respectively.

The receiver might operate not considering the

words that are read until one finds one of the

possible words at the level of the linked list where

we are. This small not published detail, permits the a

posteriori manual improvement of generated

stegotexts by the transmitter without involving the

receiver. The only condition is that the transmitter

can use any word that one cannot find in the

following level so that the receiver does not lose

synchrony. That is to say, in the previous example

among the words that conform the pairs "La

habitación”, “La había”, “La limpió”, one might use

any word that was not “habitación”, “había” or

“limpió”. This way, the transmitter can correct

possible grammatical mistakes and improve the

coherence of the text without any need that the

receiver has this information. For example, if we

conceal a small information of 126 bits, using an

order 9 and Antonio Machado's complete poetry as

training text we would obtain among the possible

stegotexts one like the following one:

[…] suena el rebato de la tarde en la arboleda!

Mientras el corazón pesado. El agua en sombra

pasaba tan melancólicamente, bajo los arcos del

puente al ímpetu del río sus pétreos tajamares; […]

Figure 2: Stegotext that conceals 126 bits. Order 9. Clave:

Alfonso. Training source Antonio Machado's complete

poetry.

As it can be observed, this stegotext has a series

of small grammatical mistakes (we leave aside

problems of global coherence), for example in the

fragment“la tarde en la arboleda! Mientras” (the

evening in the grove! While[…]), a word with a

capital letter that is not preceded by a full stop and a

question mark closed but not opened before. To

solve these problems the Stelin tool generates a

template with the possible words in every level, so

that the transmitter could select the words to be

added among the words of the stegotext, words that

will be rejected by the receiver. For example, we

select the fragment: “de la tarde en la arboleda!

Mientras el corazón” (of the evening in the grove!

While the heart"). Let's see an example of the

generated template:

[WORD:en][muerta][flota][.][bella][roja][en][,]

[sobre][arrebolada][y]

[WORD:la][sus][la]

[WORD:arboleda] [arboleda]

[WORD:!] [!]

[WORD:Mientras] [Mientras]

Bearing this in mind we edit the fragment.

Among the multiple possible options we choose the

following one:

“tarde en la dulce arboleda, ¡qué sensación!.

Mientras el corazón” (late in the sweet grove, what

a sensation!. While the heart)

This so simple way the quality of the generated

stegotext can be improved substantially, for

example, solving problems derived from punctuation

marks or others. The modification of the stegotexts

by this procedure needs a bit of practice. Thankfully

in the texts in natural language in Spanish (and in

other languages) the words have certain positions,

that is to say, it is more probable that certain words

appear after others (Zipf law) and it is more probable

that more words appear after some others. In

general, there would be few words after which it

would be difficult to choose a new word because

there are many options in the level of the

corresponding table, and many words after which

few options will exist so there will be easier to add

new words. For example, in texts in Spanish

language words as de, la que, el, en, y, a, los, etc,

are more probable, then it is more probable that

there are more words that can appear with these. If

we concentrate on the previous example “de la tarde

en la arboleda! Mientras el corazón” (of the evening

in the grove! While the heart ") it would turn out to

be trivial to add information before the article “el”

(the) but more difficult to find words after this word

and before the word “corazón” (heart). Bearing this

principles in mind can help to reduce the time and

the positions where it is better to work to correct the

possible grammatical mistakes. So it is possible to

automatically generate stegotexts, at least in Spanish

language, and to correct the mistakes manually

without involving the receiver and obtaining

stegotexts with a quality more than average.

Nowadays the concealment of information with

the Stelin tool has a capacity of approximately 1

WORD-2 BIT (it depends on the training text and on

the information to be concealed), this capacity of

concealment diminishes depending on the number of

words that are added by means of manual edition.

To conceal information over a hundred bits the

generated stegotexts will be of large size (it depends

on the size and on the "quality" of the training texts)

and therefore the manual edition will take a

IMPROVING N-GRAM LINGUISTIC STEGANOGRAPHY BASED ON TEMPLATES

211

considerable amount of time. Nevertheless, another

function of the tool is possible taking advantage of

the manual edition characteristics. The procedure

would consist in the generation of "bad quality"

stegotexts but with the quality of occupying little

space. Provided that we can modify the resultant

stegotext we might use a very low order (what

would produce a bad quality stegotext) to generate a

small size stegotext to improve it manually later. In

addition, it would turn out to be simpler, when the

generated stegotext had fewer words, to make a

substantial improvement of the global coherence

(Figure 3 and 4).

está bien visto bastante por la puerta del salón

del trono. El hombre no se podía dejarlas salir de la

mano, y la cabeza, la idea de si no me pareció tan

grande como si no me apetecía estar ni nada de los

que le hizo un gesto de asombro, Marla se

Figure 3: Created stegotext that conceals 126 bits. Order 3.

52 words. 2,4 bits/Word. Training source: Antonio

Machado's Complete Poetry.

está bien visto sus modales por bastante gente,

especialmente cuando luce sus encantos por la

puerta del salón del trono. El hombre no se

esconde, es más podía ser el amante perfecto

enamorando a las mujeres sin dejarlas decidir. Es

famosa la doncella que no quiso salir de la mano

con él, y orgullosa levantó la cabeza, la valiente

Marla. Una idea disparatada de asombrosa

valentía. Soy el narrador y si no lo escribo morirá

está hazaña, una hazaña que me pareció tan grande

como si un animal hablara. Es extraño, no me

apetecía estar callado ni guardar silencio, […]

Figure 4: Stegotext of the Figure 3 with manual annotation

and coherence. 128 words. 1 bit/word.

her modals are well seen by most people,

especially when she shows her charms through the

throne-room door. The man does not hide, what is

more, he could be the perfect lover, making women

fall in love without letting them decide. Famous is

the lady who did not want to walk hand in hand with

him and proud raised her head, brave Marla. An

absurd idea of amazing courage. I am the narrator

and if I do not write it, the heroic deed will die, an

heroic deed that I thought to be so incredible as if an

animal talked. It is strange, I did not feel like being

quiet or keep silent, [...]

Figure 5: Stegotext of the Figure 4 in English.

Approximate Translation.

As it can be observed, even generating bad

quality stegotexts (low order or as a result of a bad

selection of the training source) the use of templates

can help not only to correct grammatical mistakes

but also to provide coherence to the text. As it can

be seen, if the chosen order is high (7 or more) the

correction of the texts will be faster (as fewer

mistakes exist) though the generated stegotext will

be larger. On the contrary for low orders (for

example, N=3) the text will be smaller but the

correction will need of more time and possibly it

will be necessary to add more words.

2 CONCLUSIONS

Linguistic steganography is a science that in the last

years is arousing the interest of the scientific

community due to its enormous potential.

This article has presented the Stelin tool that

implements a series of improvements of a stegotext

generation automatic algorithm based on N-GRAM

statistical imitation. A new concept has been

introduced, that of stegotexts manual correction that

allows the correction of grammatical mistakes and

even to improve the coherence of the generated

stegotexts by means of the use of templates.

In the practice good quality stegotexts are

obtained but it is necessary to invest a lot of time (it

depends on the “writing skills” of every transmitter)

to conceal more than one hundred bits

(cryptographical keys, urls, mobilization messages,

GPS coordinates, etc) by means of manual edition

(more time for low order). In any case, the quality of

the stegotext can always be modified and perfected

by the transmitter. If the information to conceal is

important might be that the invested time was worth.

The transmitter and the receiver must share a

cryptographic key if they want to cipher the

information and one or more training texts. In the

case that the training texts were compromised, they

would always have the option to change without any

problem choosing new texts with which to work.

Today it is not clear if it is possible to obtain

secure automated procedures of linguistic

steganography that allow to conceal thousands of

bits. Meanwhile Stelin can be a good tool to obtain

stegotexts of a reasonable quality.

REFERENCES

Bergmair, R., 2007. A comprehensive Bibliography of

Linguistic Steganography. SPIE International

Conference on Security, Steganography, and

Watermarking of Multimedia Contents, volume 6505,

January 2007.

Muñoz, A., 2009. The Stelin Tool http://

stelin.sourceforge. net

SECRYPT 2010 - International Conference on Security and Cryptography

212