COZO - CONTENT ZONING FOR SPAM EMAILS

Claudine Brucks, Cynthia Wagner, Michael Hilker and Ralph Weires

University of Luxembourg, Campus Kirchberg, CSC, 6, Rue Richard Coudenhove-Kalergi, L-1359 Luxembourg

Keywords:

Content Zoning, Spam Email Detection, Text Analysis and Statistics, Human-Computer Interaction.

Abstract:

Spam is an increasing problem when using email as communication medium. Spam is detected and removed

using spam ﬁlters. Furthermore, the spammers use more and more intelligent and complex techniques so that

novel approaches are required to enhance existing spam ﬁlters. One promising technique is the Argumentative

Zoning that classiﬁes a text in different parts where each part has a meaning. In this paper, we want to use this

technique in order to divide an email into different zones and evaluate whether the email is spam or not. We

introduce the technique Content Zoning, the way we use it, the implementation, and our results.

1 INTRODUCTION

In the 21st century with the growing importance of

electronic messaging, the problem of spam emails,

also called junk emails, arises. Many solutions al-

ready exist and improvements for avoiding this kind

of messages are common. The most used techniques

are spam ﬁlters (Androutsopoulos et al., 2000; Wit-

tel and Wu, 2004; Rigoutsos and Huynh, 2004) and

IDS (Intrusion Detection Systems) (Roesch, 1999;

Hilker and Schommer, 2006). IDS check each net-

work packet whether they contain a certain pattern

and if a packet contains such a pattern it is removed

in order to secure a network against intrusions. In

contrast, spam ﬁlters are specialised to detect spam.

They use different techniques: white and black lists

describing allowed and denied email addresses. An-

other approach is to deﬁne patterns used by spammers

or to analyse the header and content of the email in

order to evaluate whether it is spam or not. In ad-

dition, language ﬁlters remove all email written in a

foreign language and user-deﬁned rules that remove

spam emails, too.

Unfortunately, the spammers continuously intro-

duce novel techniques in order to fail spam ﬁlters and

to hide spam emails. For example, spam emails can

contain text hidden between complex HTML tags or

replace characters, e.g. 0 through o, so that it is hard

for normal spam ﬁlters to detect it. Another example

is the recent trend in spam emails for using scrambled

words: this means words recognizable by the human

brain but not by normal spam ﬁlters, e.g. “online” be-

comes “onilne”. In the permanent ﬁght against spam

emails, the need in ﬁnding new ways for spam email

detection grows permanently.

In this article, we want to analyse whether Content

Zoning can increase the performance of spam ﬁlters.

In Content Zoning, an email is divided into different

zones and each zone has a meaning, e.g. a price-zone

describes the price with currency of an offered prod-

uct and the offer-zone describes the product. We will

introduce the Content Zoning, implement a program

for Content Zoning, and perform tests on two types

of spam emails: ﬁnancial and pharmaceutical spam.

Thereafter, we conclude the paper with some of our

next steps.

2 RELATED WORK

Major inspirations for this work can be found in

(Teufel, 1999; Feltrim et al., 2005), which describe

a technique called “Argumentative Zoning”. In this

work, zones are used to describe scientiﬁc documents.

The aim of this new approach is to provide an intelli-

Brucks C., Wagner C., Hilker M. and Weires R. (2007).

COZO - CONTENT ZONING FOR SPAM EMAILS.

In Proceedings of the Third International Conference on Web Information Systems and Technologies - Internet Technology, pages 23-27

DOI: 10.5220/0001263400230027

 SciTePress

gent library search tool with generation of summaries

on scientiﬁc articles and the detection of intellectual

ownership of documents, e.g. for a better document

management or for plagiarism detection.

We apply similar principles to the analysis of

spam emails, as means to separate spam from non-

spam mails. A part of this work can also be seen

in (Brucks and Wagner, 2006), which deals with the

manual analysis of up-to-date spam emails to check

whether they contain signiﬁcant patterns in the sub-

ject and content.

In this paper, we focus on a tool for analysis of

emails by their content structure to ﬁnd out about

common patterns in spam emails. By doing this,

we want to gain information that can be used to re-

alise automatic recognition of zones for further work.

Automatically derived information from spam emails

can then be used to help classifying spam emails and

separating them from normal emails.

The aim of the Content Zoning technique is to

provide a useful function or add-on to normal spam

ﬁlters for detecting patterns in the content of spam

emails and for analyzing if spam emails have charac-

teristic content structures. By that way, spam ﬁlters

can perform better results in spam detection, because

the technical aspect is extended by content analysis.

3 CONTENT ZONING

Before going into detail about applying Content Zon-

ing to spam emails, we ﬁrst give some general infor-

mation in this section about what Content Zoning ac-

tually is.

Generally, Content Zoning means to divide a

given document into different parts. The separation

is done according to the given structure of the docu-

ments (e.g. different paragraphs) and especially be-

cause of the semantic meaning of the document parts.

In the end, the zones are supposed to represent the

various semantic units of the document.

Besides the plain division of a document in zones,

additional analysis can also be performed to gain fur-

ther information about the separate zones. Rather

simple to extract information could be statistics about

the size and layout of zones but a more sophisticated

analysis of their text content is also possible. The lat-

ter can lead to an extraction of the semantic content

and purpose of a zone. Such kind of information can

be used for various purposes, such as comparing doc-

uments to each other (regarding their analaysed zone

structure and content). In the following, we refer to

such information about the zones as zone variables.

4 APPLYING CONTENT ZONING

TO SPAM EMAIL

By using Content Zoning, the email text can be di-

vided into different regions, and it can be seen if there

exists redundancy in the structure and in the content

of spam emails. Email analyses have given reason

to the assumption that spam emails contain the same

structure or zone ordering, so they are in general more

similar than other emails with a completely different

structure.

The test set of spam emails that has been applied

to the process of “zoning” mainly belongs to two dif-

ferent categories: spam emails of pharmaceutical do-

main (i.e. drug offerings), and spam emails from

the ﬁnancial domain (more precisely, stock spam

emails). Financial emails typically seem to be deﬁned

as emails with large content, where the offer and com-

pany have a very detailed description. Pharmaceutical

emails on the other hand can be described as emails

with short content, direct offers and only a brief de-

scription. By analysing the characteristics of ﬁnancial

and pharmaceutical spam emails, it became obvious

that emails of pharmaceutical domain generally had

a lower amount of zones than ﬁnancial spam emails

(see also (Brucks and Wagner, 2006)).

Figure 1: Example of a zoned pharmaceutical email.

WEBIST 2007 - International Conference on Web Information Systems and Technologies

In the following, we give a more detailed descrip-

tion of how the zones for these two types of spam

emails are deﬁned and which zone variables are used.

Figure 1 visualises a pharmaceutical email that is

zoned.

4.1 Deﬁnition of a Zone

A zone can be deﬁned as a region in a document that

is allocated to a speciﬁc information in that document.

For example, when having a text document with a

price offer, this price offer can be annotated by a tag;

a logical name for that zone could be “price offer”. Of

course, when having a huge document, zones can re-

occur multiple times. The ﬁgure 1 illustrates an exam-

ple of a pharmaceutical spam email in order to show

what can be considered as a zone; each zone is repre-

sented by a colour.

The analysis on content and structure of pharma-

ceutical and ﬁnancial spam emails showed a kind of

redundancy in content rubrics and structures. These

observations have given the main idea to the follow-

ing concept: the most observed redundant rubrics in

the spam email data set have been considered as the

default zones.

Not all text of the email can be attributed to a

zone, because often irrelevant text is included in spam

emails. When having that case, no zone is attributed

to it. After having deﬁned a set of zones, these can be

used for describing the content of a spam email.

Table 1 shows the most important zones we de-

ﬁned for both domains, the ﬁnancial and pharmaceu-

tical domain (but this does not mean that they have an

equal number of zones, as already stated above):

Table 1: Most occurring zones.

Information Offer

Additional Information Product Description

Price Expected Price

Name of company Company Description

Testimony Address

Mail Signature Date

Symbol Greetings

Forward Looking Stmt. News

Stock Volume Link

The annotation tags have logical names, so that

mostly no supplementary information is necessary for

understanding what the zone is about. Some examples

are shown below to illustrate how a zone is deﬁned.

• Information:

The introductory information zone informs the

reader on what the email is about

• Offer:

The offer zone contains the offer itself, a promised

product with its speciﬁc information

• Price:

In the price zone, the actual price of a promised

product is found (this zone can be considered as a

sub-zone of “Offer”)

• Link:

This zone contains web-pages indicated in spam

emails

Deﬁning the zones alone is not sufﬁcient for effec-

tively realising Content Zoning, because they only in-

dicate a position of an information in a text, but do not

give information on the content of the text. For this,

“zone variables” are introduced.

4.2 Deﬁnition of Zone Variables

A zone variable is deﬁned as a parameter that de-

scribes speciﬁc information of a zone. By describing

the zones with zone variables, more structured infor-

mation can be extracted of the zone content. The zone

variables are one of the most important factors in Con-

tent Zoning, because they contribute to the statistical

evaluation and ﬁnally to the detection of similarities

in spam emails. In this work, two kinds of variables

have been deﬁned:

Zone independent variables are deﬁned as param-

eters which can be applied to every zone. Examples

for zone independent variables are:

• position of the zone in the complete email

• length of the zone, expressed in number of char-

acters

• number of words contained in the zone

• most occurring word (not considering stopwords)

Zone dependent variables on the other hand are

deﬁned as parameters, which are speciﬁc for a certain

type of zone and hence can only be applied to zones

of that type. Mostly, this kind of variables represents

semantic parameters. The following list shows some

examples for zone dependent variables:

• top-level domain for links (e.g. .com, .net, .lu)

• telephone numbers for the address zones

• value for price zones

• currency for price zones (e, £, $, etc.)

COZO - CONTENT ZONING FOR SPAM EMAILS

5 IMPLEMENTATION

The implemented system - CoZo - integrates a graph-

ical user interface (GUI) which allows users to make

their Content Zoning on emails. Most commonly

used email types (HTML, EML, etc.) are supported

by CoZo. Information that is not relevant for Content

Zoning like email header or HTML tags can be au-

tomatically removed on user-demand, so that princi-

pally only the content part remains. We implemented

CoZo in Delphi and Perl. The user can easily zone

the content and the results are stored in XML ﬁles for

further analysis.

Figure 2: Screenshot of the CoZo Interface.

The ﬁgure 2 shows a screenshot of the CoZo ap-

plication, where an example of selecting the zones is

given. After charging the email onto the GUI, the dif-

ferent zones can be selected. To facilitate the zoning,

18 predeﬁned zones have been implemented. How-

ever, if there is no convenient zone for a certain email

region, the user has the possibility to create his spe-

ciﬁc zones or to leave that region unzoned. Each zone

is visualised using a colour and CoZo calculates for

each zone variables, e.g. the position of the zone in

the email, the density of the zone, and the amount

of characters in the zone. The output - deﬁnition of

zones, calculated variables, and a picture of the zoned

email with colours - is also stored in ﬁles for further

analysis, e.g. picture matching.

The focus in the implementation lies on an appli-

cation to test the idea of Content Zoning. Therefore,

we have decided to use a manual zoning system in or-

der to quickly zone many emails and compare these

(proof-of-concept implementation). Furthermore, the

application provides the output that we analyse in or-

der to evaluate the Content Zoning approach for spam

detection.

In order to use Content Zoning for real spam de-

tection, it should be implemented as a part in a spam

ﬁlter. It should zone the email automatically, calculate

required variables, and evaluate the output to check

whether the email is spam or not. Consequently, Con-

tent Zoning would be a technique for the evaluation

of emails.

6 RESULTS

After ﬁnishing the implementation of CoZo, we tested

the approach. We used a data set of emails captured

from January 2005 to September 2005. The data set

contains emails from various domains (e.g. credit of-

fers, movie downloads, porn emails), but only emails

from the pharmaceutical and ﬁnancial domain were

used for the tests of CoZo. For more details on the

data set, we refer to the document (Brucks and Wag-

ner, 2006).

For the tests, no supplementary zones had to be

generated. This means that the 18 predeﬁned zones

mostly covered the email content. The zones of the

emails from the data set have been coloured as in ﬁg-

ure 1. After having realized the zoning, the statistics

are calculated and stored in a XML-ﬁle:

...

...

We have obtained different results:

• It was possible to zone the emails and CoZo cal-

culates the output without any problems.

• Pharmaceutical spam has mostly the same struc-

ture of zones. This means that after zoning, the

order and the size of the zones are similar. Fur-

thermore, the calculated variables of the zones are

similar.

• Financial spam has a different architecture of

zones but always contains some signiﬁcant zones

like e.g. offer, information, and stock details.

• Financial, pharmaceutical, and normal advertising

emails have a different order and architecture of

zones.

With these results, we can say that Content Zoning is

a technique helping to classify emails as spam or not.

After zoning several emails, we obtained lots of

pictures of zoned emails. These pictures are sorted by

the average colour using ImageSorter

. The results

are that the emails are clustered in normal non-spam

advertising mails, pharmaceutical, and ﬁnancial spam

http : //mmk.f4.fhtw − berlin.de/?page

id = 40

WEBIST 2007 - International Conference on Web Information Systems and Technologies

emails. Consequently, zoning of emails and auto-

mated ordering of the resulting pictures maybe helps

identifying spam emails.

7 FUTURE WORK AND

CONCLUSION

This project and the implementation of CoZo are

not ﬁnished. With this status of implementation, we

tested the approach of Content Zoning. One of our

next challenges is to adapt the application so that it

will be possible to have a semi-automatically recog-

nition of the zones. Semi-automatically means in this

case that most obvious zones are detected automati-

cally, e.g. price, currency symbol, etc. This would

simplify the zoning for the user, but the more com-

plex zoning should still be realised by the user for

getting more precise results. The complexity of this

function is to detect the zones correctly even when

having no legal formats. An example is “price” zone,

where numbers are represented by characters, e.g. “0”

replaced by “o”.

The next step is to implement an automated zon-

ing so that the system calculates the zones and impor-

tant variables without any input from users. There-

after, the evaluation of the zoned emails must be au-

tomated and the system can be added to an existing

spam ﬁlter.

There are different possibilities for classifying an

email into the various types of spam emails using

Content Zoning. One approach would be to deﬁne

a similarity metric according to the calculated zone

statistics (e.g. zone ordering, zone sizes, etc.). An-

other approach is to use picture matching to realise

a comparison between two emails. To do this, the

emails are e.g. coloured in order to denote the dif-

ferent zones. The actual comparison is of course not

limited to basic (exact) picture matching but can as

well include more sophisticated image analysis tech-

niques. By doing this, we can perform a comparison

of a new email to our existing spam email types (ﬁ-

nancial and pharmaceutical here) to decide whether

or not the email is spam at all.

To conclude the article, we can say that Content

Zoning is a promising technique for spam detection

and supports existing spam ﬁlters with additional in-

formation. The results show that the picture of the

zoned email and the calculated variables contain indi-

cators whether an email is spam or not.

ACKNOWLEDGEMENTS

This project is a part of the TRIAS (TRIAS, 2005)

project of the Computer Science and Communication

research unit from the University of Luxembourg. We

want to thank the staff of the MINE-team for their

support and advise.

REFERENCES

Androutsopoulos, I., Koutsias, J., Chandrinos, K.,

Paliouras, G., and Spyropoulos, C. (2000). An evalu-

ation of naive bayesian anti-spam ﬁltering.

Brucks, C. and Wagner, C. (2006). Spam analysis for net-

work protection. TFE Thesis - University of Luxem-

bourg.

Feltrim, V., Teufel, S., Nunes, G. G., and Alusio, S. (2005).

Argumentative Zoning applied to Critiquing Novices’

Scientiﬁc Abstracts. In Computing Attitude and Af-

fect in Text: Theory and Applications, pages 233–245.

Springer, Dordrecht, The Netherlands.

Hilker, M. and Schommer, C. (2006). SANA security anal-

ysis in internet trafﬁc through artiﬁcial immune sys-

tems. Proceedings of the Trustworthy Software Work-

shop Saarbruecken, Germany.

Rigoutsos, I. and Huynh, T. (2004). Chung-kwei: a pattern-

discovery-based system for the automatic identiﬁca-

tion of unsolicited e-mail messages (spam). In Proc.

of the Conference on Email and Anti-Spam (CEAS).

Roesch, M. (1999). SNORT - lightweight intrusion detec-

tion for networks. LISA, 13:229–238.

Teufel, S. (1999). Argumentative zoning: Information ex-

traction from scientiﬁc text. Phd Thesis, University of

Edinburgh, England.

TRIAS (2005). Logic of trust and reliability of

information agents in science. CSC, Univer-

sity of Luxembourg, Project description, Link:

http://wiki.uni.lu/mine/TRIAS.html.

Wittel, G. and Wu, S. (2004). On attacking statistical spam

ﬁlters. In Proc. of the Conference on Email and Anti-

Spam (CEAS).

COZO - CONTENT ZONING FOR SPAM EMAILS