(3) Named-Entity Recognition (NER) Accuracy:
Measure the system’s ability to preserve named
entities as single tokens by testing against a
dataset annotated for NER.
In this paper, we have addressed critical limitations in
conventional tokenization approaches that fail to pre-
serve the integrity of linguistically meaningful units,
such as multi-word expressions, phrasal verbs, and
morphologically complex tokens. By analyzing foun-
dational morphological concepts, contemporary tok-
enization strategies, and the requirements for model-
ing a human language lexicon, we proposed an en-
coding pipeline designed to bridge the gap between
surface-level text processing and linguistically aware
tokenization. The proposed pipeline incorporates
pre-processing, named-entity recognition (NER), tok-
enization, part-of-speech (POS) tagging, morpholog-
ical analysis, lemmatization, and word embeddings.
Preliminary testing demonstrated the need for this ap-
proach, as a comparison between a manually anno-
tated corpus and the output of NLTK’s Punkt tok-
enizer revealed that multi-word expressions, such as
phrasal verbs, are a primary source of tokenization
errors in existing systems. By preserving such ex-
pressions, the tokenizer shows promise in improv-
ing parsing performance and downstream natural lan-
guage processing (NLP) tasks.
This material is based upon work supported by the
National Science Foundation (NSF) under Grant No.
2219712 and 2219713. Any opinions, findings, and
conclusions or recommendations expressed in this
material are those of the authors and do not neces-
sarily reflect the views of the NSF.
