4 EXPERIMENTS
To evaluate the performance of each of the
techniques and their combinations, we conducted
two types of experiments. The first experiment
contained two stages. Re-implementing existing
normalisation methods represented the first stage in
finding the best cleaning techniques for solving each
problem (misspelled words, abbreviations, and
repeated characters). The second stage involved
combining techniques used to address each problem
in a different order. This experiment was intended to
help us find the best combination of techniques that
could solve all problems. The second experiment
used the same tweets dataset to prove that our best
combination of normalisation techniques found in
the first stage is better than the baseline model, in
terms of accuracy and time efficiency, at
normalising a noisy tweet into a clean and readable
sentence. All techniques were implemented in
Python and NLTK framework. The baseline model
used is Text Cleanser by Gouws et al. (2011).
Dataset. The experiments used a dataset of 1200
tweets containing messages from popular celebrities
in the entertainment area and the replies from their
fans. The dataset contains 489 abbreviations, 152
words with repeated characters, and 375 misspelled
words.
For our evaluation setup, we formed the datasets
for each of our tests by manually normalising those
1200 tweets and creating four reference datasets. In
the first reference dataset, we corrected all the
abbreviations from the original tweets. For instance,
if the original tweet was “That viedo is fuuunnnnyy
LOL”, in the first reference dataset (Ref_AB) the
tweet became “That viedo is fuuunnnnyy laugh out
loud.” In the second dataset (Ref_RC), we corrected
only the repeated characters. Thus, the tweet became
“That viedo is funny LOL”. In the third dataset
(Ref_MSW), we corrected only the misspelled
words, i.e. “That video is fuuunnnnyy LOL”. In the
last dataset (Ref_All), we corrected all of those
cases, i.e. “That video is funny laugh out loud.” To
sum up, the first three reference datasets were used
for evaluating each technique that was used to
address each problem and the fourth reference
dataset was used to evaluate the combined models
against the baseline model, and our combination
model against the baseline model.
Evaluation Metrics. BLEU and WER metrics are
widely used as evaluation metrics for finding a
normalisation method’s accuracy. We use the
iBLEU developed by Madnani (2011) and Gouws et
al. (2011) WER evaluator. The efficiency of a
technique is evaluated by the time that is required by
a normalisation technique to perform a data cleaning
procedure. Furthermore, a paired t-test was used to
examine whether there is a statistically significant
difference between the performances of each
technique at the 95% confidence level.
4.1 Results from Individual Method
A comparison of the techniques that solve the same
noisy problem on the same tweet dataset is required
to find the best technique. The results of the first
experiment are presented according to the type of
noisy problems they are trying to solve and they are
explained in detail as follows.
4.1.1 Detecting Abbreviations
Two techniques for normalising abbreviations are
compared: DAB1 and DAB2. Similar to DAB2,
DAB1 expands abbreviations by performing
dictionary look-up. However, DAB1 did not convert
each word to lowercase prior to the look up. Ref_AB
is used as the reference dataset in the BLEU and
WER score calculations.
Both techniques achieve more than 90% in the
BLEU score, less than 4% WER value and spent
only 30 seconds. However, both are not able to
resolve abbreviations that require context at
sentence-level, which is out of the scope of this
research. For example, “ur” in “I love that ur in” is
currently resolved to “your” instead of “you are”.
Although our abbreviation dictionary has defined
“ur” with two separate meanings, neither technique
can select the right meaning of the given sentence.
There is no significant difference in performance
between DAB1 and DAB2, but DAB1 yields lower
accuracy as it cannot detect an abbreviation that
contains the upper case letter (i.e. “Wld”) due to our
dictionary merely having the reference abbreviations
that have the lower case letters (i.e. “luv”).
4.1.2 Removing Repeated Characters
Three variants of Perkins (2014) techniques for
removing repeated characters are evaluated: RRC1,
RRC2 and RRC3. RRC2 is Perkins (2014) original
approach, where repeated letters in a word are
removed 1 letter at a time. Each time a letter is
deleted, the system performs a WordNet lookup.
RRC2 will stop removing repeated characters if
WordNet recognises the word. Instead of using
WordNet, Enchant dictionary is used in RRC3. In