Authors:
Phavanh Sosamphan
1
;
Veronica Liesaputra
1
;
Sira Yongchareon
2
and
Mahsa Mohaghegh
1
Affiliations:
1
Unitec Institute of Technology, New Zealand
;
2
AUT, New Zealand
Keyword(s):
Text Mining, Social Media, Text Normalisation, Twitter, Statistical Language Models, Lexical Normalisation.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Mining Text and Semi-Structured Data
;
Pre-Processing and Post-Processing for Data Mining
;
Soft Computing
;
Symbolic Systems
;
Web Mining
Abstract:
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, particularly from micro-blog websites like Twitter. Twitter messages, called tweets, are commonly written in ill-forms, including abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ require text normalisation techniques to detect and convert them into more accurate English sentences. There are several existing techniques proposed to solve these issues, however each technique possess some limitations and therefore cannot achieve good overall results. This paper aims to evaluate individual existing statistical normalisation methods and their possible combinations in order to find the best combination that can efficiently clean noisy tweets at the character-level, which contains abbreviations, repeated letters and misspelled words. Tested on our Twitter sample dataset, the best combination can achieve 88% accuracy in the Bilingual Evaluation Understudy (BLEU) scor
e and 7% Word Error Rate (WER) score, both of which are considered better than the baseline model.
(More)