POS Tagging, Sentiment Analysis, Spell Checking,
and Text Classification. These subclass nodes also
have more subclass child nodes. For instance, there
are four subclass child nodes of Text Classification,
including Logistic Regression, Na
¨
ıve Bayes, Support
Vector Machine, and MaxEnt, which are all the com-
mon text classifiers. This ontology-based approach
combines the strengths of both previous two ap-
proaches (i.e., inexpensive and interactive), in which
learners can download the Prot
´
eg
´
e software (Protege,
2020) for free and learn NLP with the interactive tool
respectively. However, the NLP scope covered by
this tool is very broad that includes 16 study areas
of NLP and only illustrates the main concept and its
sub-concepts without providing any in-depth explana-
tions and descriptions that definitely could not assist
new learners in understanding any one of those top-
ics. More importantly, most of the concepts displayed
by the ontology are the NLP applications but not the
core processing components of NLP, including text
pre-processing, token building, and text vectorization.
Apparently, all the aforementioned approaches
lack the three important educational elements, i.e.,
inexpensive, interactive, and in-focus, to help ama-
teur learners master NLP. To mitigate the shortcom-
ings of the existing approaches, we propose the de-
velopment and implementation of a visual-based edu-
cational support platform for learning NLP Analytics
(VisNLP). Currently, there are two broad approaches
for the NLP analytical processes: (1) statistical-based
(Bengfort et al., 2018; Hastie et al., 2020; Lane et al.,
2019) and (2) neural network-based (Goldberg, 2017;
Kamath et al., 2020; Reese & Bhatia, 2018). The for-
mer approach is to perform statistical techniques to
process and analyze text data. The latter approach
is to use deep neural networks to conduct text min-
ing and analytics. In this paper, we mainly focus on
statistical-based methods using One-Hot Encoding,
Term Frequency–Inverse Document Frequency (TF-
IDF), and Word Probability approaches. Specifically,
we develop and implement a web-based, interactive
visual NLP learning platform that enables learners
to study the core processing components of statisti-
cal NLP analytics in sequence: (1) Text Preprocess-
ing (e.g., splitting sentences, spelling check, lower-
ing cases, converting numbers, and removing punctu-
ations); (2) Token Building (i.e., Bag of Words and
N-grams tokenization); (3) Text Vectorization (i.e.,
One-Hot Encoding, TF-IDF and Word Probability);
and (4) Text Similarity Dashboard (i.e., Heatmap Ta-
bles, Cosine Similarity Matrix, and Euclidean Dis-
tance Measurement). Using a variety of interactive vi-
sual diagrams with a practical example, novice learn-
ers can have a good grasp of the NLP process.
The remainder of the paper is organized as fol-
lows. First, we describe our VisNLP framework to
show the process of statistical NLP analytics in Sec-
tion 2. In Section 3, we illustrate our implemented
web platform and use the classification of job posi-
tion advertisements as a pilot example to demonstrate
how novice learners can utilize our interactive plat-
form to understand and study statistical NLP analyt-
ics step by step and piece by piece. In Section 4, we
conclude and briefly outline our future work.
2 VisNLP FRAMEWORK
Fig. 1, home page shows the high-level framework
of our VisNLP that consists of five main modules:
Text Preprocessor, Token Manager, Text Vectorizer,
Text Similarity Dashboard, and Visual Web Inter-
face. Each module has sub-components that manip-
ulate and process texts.
2.1 Text Preprocessor
Text Preprocessor is composed of ten sub-modules
that includes Document Separator (DS), Sentence
Splitter (SS), Spelling Corrector (SC), Contraction
Expander (CE), Number Converter (NC), Punctuation
Remover (PR), Non-alphanumeric Remover (NR),
Stopword Remover (SR), Word Lemmatizer (WL)
and Lowercase Converter (LC). First, the Text Pre-
processor takes a text corpus, i.e., a collection of
document files from a to z stored in a database of
the platform, as an input and the DS module segre-
gates them from the corpus into many individual doc-
uments. Each document is then sent to the SS mod-
ule, which splits the document into individual sen-
tences. The SC module conducts the spell check on
each sentence and uses the tokenization approach to
replace misspelled words with the highest-probability
corrected words. The corrected-word sentences are
then sent to the CE module, which expands the con-
tracted form of the words into a longer form, such as
”I’m” to ”I am”, ”You’re” to ”You are”, ”It’s” to ”It
is”, ”S/He isn’t” to ”S/He is not”, ”They aren’t” to
”They are not” and ”We aren’t” to ”We are not”, in
each sentence. Then, the NC module substitutes the
numeric value for words, such as ”5” to ”Five”, ”11”
to ”Eleven” and ”3” to ”Three”.
The subsequent modules, PR, NR, and SR, respec-
tively removes punctuations (e.g., full stop(.), comma
(,), and colon (:)), non-alphanumeric characters (e.g.,
#GoLangCode123!$! to GoLangCode123), and stop-
words (e.g., ”ourselves”, ”hers”, ”between”, ”your-
self”, ”but”, and ”again”) from expanded sentences.
VisNLP: A Visual-based Educational Support Platform for Learning Statistical NLP Analytics
225