4 EXPERIMENTS 
To evaluate the performance of each of the 
techniques and their combinations, we conducted 
two types of experiments. The first experiment 
contained two stages. Re-implementing existing 
normalisation methods represented the first stage in 
finding the best cleaning techniques for solving each 
problem (misspelled words, abbreviations, and 
repeated characters). The second stage involved 
combining techniques used to address each problem 
in a different order. This experiment was intended to 
help us find the best combination of techniques that 
could solve all problems. The second experiment 
used the same tweets dataset to prove that our best 
combination of normalisation techniques found in 
the first stage is better than the baseline model, in 
terms of accuracy and time efficiency, at 
normalising a noisy tweet into a clean and readable 
sentence. All techniques were implemented in 
Python and NLTK framework.  The baseline model 
used is Text Cleanser by Gouws et al. (2011). 
 
Dataset.  The experiments used a dataset of 1200 
tweets containing messages from popular celebrities 
in the entertainment area and the replies from their 
fans. The dataset contains 489 abbreviations, 152 
words with repeated characters, and 375 misspelled 
words.  
For our evaluation setup, we formed the datasets 
for each of our tests by manually normalising those 
1200 tweets and creating four reference datasets. In 
the first reference dataset, we corrected all the 
abbreviations from the original tweets. For instance, 
if the original tweet was “That viedo is fuuunnnnyy 
LOL”, in the first reference dataset (Ref_AB) the 
tweet became “That viedo is fuuunnnnyy laugh out 
loud.” In the second dataset (Ref_RC), we corrected 
only the repeated characters. Thus, the tweet became 
“That viedo is funny LOL”. In the third dataset 
(Ref_MSW), we corrected only the misspelled 
words, i.e. “That video is fuuunnnnyy LOL”. In the 
last dataset (Ref_All), we corrected all of those 
cases, i.e. “That video is funny laugh out loud.” To 
sum up, the first three reference datasets were used 
for evaluating each technique that was used to 
address each problem and the fourth reference 
dataset was used to evaluate the combined models 
against the baseline model, and our combination 
model against the baseline model. 
 
Evaluation Metrics. BLEU and WER metrics are 
widely used as evaluation metrics for finding a 
normalisation method’s accuracy. We use the 
iBLEU developed by Madnani (2011) and Gouws et 
al. (2011) WER evaluator. The efficiency of a 
technique is evaluated by the time that is required by 
a normalisation technique to perform a data cleaning 
procedure. Furthermore, a paired t-test was used to 
examine whether there is a statistically significant 
difference between the performances of each 
technique at the 95% confidence level.  
4.1  Results from Individual Method 
A comparison of the techniques that solve the same 
noisy problem on the same tweet dataset is required 
to find the best technique. The results of the first 
experiment are presented according to the type of 
noisy problems they are trying to solve and they are 
explained in detail as follows. 
4.1.1 Detecting Abbreviations 
Two techniques for normalising abbreviations are 
compared: DAB1 and DAB2. Similar to DAB2, 
DAB1 expands abbreviations by performing 
dictionary look-up. However, DAB1 did not convert 
each word to lowercase prior to the look up. Ref_AB 
is used as the reference dataset in the BLEU and 
WER score calculations. 
Both techniques achieve more than 90% in the 
BLEU score, less than 4% WER value and spent 
only 30 seconds. However, both are not able to 
resolve abbreviations that require context at 
sentence-level, which is out of the scope of this 
research. For example, “ur” in “I love that ur in” is 
currently resolved to “your” instead of “you are”.  
Although our abbreviation dictionary has defined 
“ur” with two separate meanings, neither technique 
can select the right meaning of the given sentence. 
There is no significant difference in performance 
between DAB1 and DAB2, but DAB1 yields lower 
accuracy as it cannot detect an abbreviation that 
contains the upper case letter (i.e. “Wld”) due to our 
dictionary merely having the reference abbreviations 
that have the lower case letters (i.e. “luv”). 
4.1.2  Removing Repeated Characters 
Three variants of Perkins (2014) techniques for 
removing repeated characters are evaluated: RRC1, 
RRC2 and RRC3. RRC2 is Perkins (2014) original 
approach, where repeated letters in a word are 
removed 1 letter at a time. Each time a letter is 
deleted, the system performs a WordNet lookup. 
RRC2 will stop removing repeated characters if 
WordNet recognises the word. Instead of using 
WordNet, Enchant dictionary is used in RRC3. In