The messages collected have characteristics that
differ from formal conversations. Grammar rules
and punctuations are rarely used, words are often
spelt wrong, and there is a frequent use of slang and
abbreviations. This lack of pattern makes
preprocessing more difficult since the same word
can appear in several ways. For example, the user
can repeat some specific letter to imitate a spoken
language as in "hiii" (the correct word is "hi"). Also,
the words in this kind of scenario might be typing
with an accent missed, by mistake of the user or
with the purpose to save time. Another possible
scenario is one in which the user uses an
abbreviation instead of the word itself.
There are other problems, such as punctuation.
Punctuation as a semicolon, exclamation,
interrogation, points and commas are hardly used.
Even when they are used, there is no guarantee that
they are used correctly. Another problem is that
these characters also appear in the use of emoticons,
which are graphical representations of feelings, such
as: ":-)" (happy), ": @" (angry) and ":-(" (unhappy).
All these particularities are challenges for the
classification processing. The solution adopted in
this paper was to identify the main problems and to
create a sequence of methods to standardise the
sentences. Some of these methods were proposed in
related works and others were created by the authors
of this work. Methods such as lowercasing, stripping
punctuation, and stemming were also used by
(Morris, 2013). They have proposed replacing
emoticons and proper names, although this paper
differs from the word chosen for the replacement.
The methods to remove the accents and the
stopwords were also used by (Leite, 2015).
The list of preprocessing techniques applied, as
described in Figure 1:
1. Capitalization: every character in lowercase.
2. Links removal.
3. Identify laughs and replace with a single word
that represents laughter. The first part of this step is
to identify a laugh. In Brazilian Portuguese, a laugh
has more of a way of being written like: "hahaha",
"kkkkk", "huehue". After identifying some of these
several words that represent laughter, it is replaced
by a unique word that will represent laughter. So
different forms of laughter are transformed into a
simple form. Thus the algorithm can more easily
identify a word that represents a laugh.
4. Exclusion of sequence of repeated characters. The
user can miss typing and put several characters in
sequence. Sometimes he/she does it on purpose to
express a way of speaking. In both cases, the
repeated characters are transformed into one.
5. Emoticon replacement by a word that represents
that feeling. It is easier for the classifier to work only
with words. For example: ":-)" is replaced by
"happy".
6. Accents, punctuation marks and numbers
removal.
7. Surnames, abbreviations and pejorative words
replacement by an appropriate word. Some
nicknames and abbreviations are well known and
widely used in chats, so these nicknames and
abbreviations have been replaced by the word they
represent. In this way either write abbreviations or
the correct word, it will represent a unique word at
the end of this step. Some pejorative words with
several variations were also replaced by a unique
word.
8. Stopwords removal. Stopwords are words that are
not relevant for the classification processing, like
articles and prepositions.
9. Proper names removal. Proper names are also not
relevant for this analysis.
10. Words removal with one character. For some
typing error, the user may have typed an isolated
character, this will have no relevance in the
classification.
11. Stemming. In Brazilian Portuguese some
suffixes are added at the end of a word to generate
another word, called "derived word". For example,
consider the word "Cachorro" (dog in Portuguese).
We can also have the words "Cachorrinho" (small
dog in Portuguese), "Cachorrão" (big dog in
Portuguese), "Cachorra" (female dog). All these
words have essentially the same meaning. In the
classification, we are only interested in the "root" of
the word. The words "Cachorro", "Cachorrinho",
"Cachorrão", "Cachorra", will be transformed into
"Cachorr" ("root") at the end of this step.
3.3 The Algorithm
The Naive Bayes algorithm is well known in
machine learning. The algorithm uses Bayes'
theorem to calculate the probability of an attribute
belonging to a particular class. It is called a "naive"
because it assumes that the attributes are
independent, a naive premise. In the text document
classification each attribute to be classified are the
words in the document.
The Naive Bayes classifier has two different
models. The binary model where the document is
represented by a vector of binary attributes
indicating which words occur and do not occur in
the document. The number of times the word
Textual Analysis for the Protection of Children and Teenagers in Social Media - Classification of Inappropriate Messages for Children and
Teenagers
659