𝑝
𝑡
,𝑡
,………….𝑡
𝑝
𝑡
,𝑡
,………….𝑡
The backward model works in the same way as
the forward model, but predicts the previous token
from a sequence of future tokens. These two models
are combined in ELMo, which maximizes the log
likelihood in both directions.
The F-Measure, which is derived from
information retrieval, assesses the accuracy of
pairwise relationship judgments and is also known as
pairwise F-Measure. The F- Measure is calculated as :
𝐹
𝛽
1𝑃𝑅
𝛽
𝑃𝑅
Where β is a variable function.
Propositions, pronouns, and articles are
commonly used stop words that do not contribute in
clustering. Hence, they are eliminated from dataset
texts. Following that, the pre-processed dataset is read
as vectors containing numerical values for Term
Frequency-Inverse Document Frequency (TF-IDF)
for each word (Term) in the dataset. Term Frequency
(TF) is the number of times a word (Term) appears in
a document, and inverse Document Frequency (DF)
is the log of the ratio of the total number of documents
in the dataset to the number of documents containing
that word. The TF-IDF matrix is the product of these
two measurements, TF and IDF as shown below:
After that TF-IDF matrix is converted into single
frequency.
TF-IDF informs the importance of a document's
terms. A word may appear in a document more
frequently if it is longer than it is shorter(Parket.al.,
2019).
4 EXPERIMENTAL ANALYSIS
A short text lexical normalization dataset for lexical
normalisation is available on UCI Repository ML
database. It's an annotated version of the Twitter
dataset, which is a combination of structured and
unstructured data on number of tweets (Gómez-
Hidalgo et.al. 2014). We construct embeddings using
four-word embedding models to test our word-level
model: Word2Vec, FastText, ELMo, and BERT. The
Bi-LNM word-level model takes the embedding
vectors from all embedding models except BERT as
input. As proposed by the BERT publication, the
BERT embeddings are fed via a single feed-forward
layer followed by a softmax classifier. With an
embedding dimension of 512 and a window size of
10, the Word2Vec and FastText models were trained.
The ELMo was also trained with a 512-embedding-
dimensions.
Word-level Lexical Normalization Evaluation
Process
The three types of embedding techniques are
investigated and compared. For the dataset, we may
copy any unstructured data as a corpus and paste it as
a “.txt” file. Following are the normalization steps:
Step 1 Importing important libraries and
initializing the dataset.
Step 2
- Pre-processing of data,
- Substituting regular expressions
- Removing Stop-words
Step 3 Assigning unique data values to vocabulary
Step 4 Implementing the one-hot vector encoding
to preprocess categorical features in the machine
learning model.
Step 5 Assigning X and Y for training and testing,
and then splitting them.
Step 6 Implementing Word embedding
5 OBSERVATIONS ON
WORD-LEVEL LEXICAL
NORMALIZATION
TECHNIQUE (WLNT)
Table 1 Illustrates the results of the word-level model
after applying the co-occurrence strategies to
initialize the embeddings. We utilize Principle
Component Analysis (PCA) to minimize the length of
each embedding vector. There isn't much of a
difference in the results of each co-occurrence model.
In each iteration, the word-level model
outperforms both the character-level and the word-
level models. Despite not conducting character-level
normalisation like most state-of-the-art lexical
normalization systems, the word-level model's
capacity to use contextual information, paired with
the strong word representations from ELMo, allows it
to outperform other models. Overall, it's obvious that,