
(3) Named-Entity Recognition (NER) Accuracy:
Measure the system’s ability to preserve named
entities as single tokens by testing against a
dataset annotated for NER.
7 CONCLUSIONS
In this paper, we have addressed critical limitations in
conventional tokenization approaches that fail to pre-
serve the integrity of linguistically meaningful units,
such as multi-word expressions, phrasal verbs, and
morphologically complex tokens. By analyzing foun-
dational morphological concepts, contemporary tok-
enization strategies, and the requirements for model-
ing a human language lexicon, we proposed an en-
coding pipeline designed to bridge the gap between
surface-level text processing and linguistically aware
tokenization. The proposed pipeline incorporates
pre-processing, named-entity recognition (NER), tok-
enization, part-of-speech (POS) tagging, morpholog-
ical analysis, lemmatization, and word embeddings.
Preliminary testing demonstrated the need for this ap-
proach, as a comparison between a manually anno-
tated corpus and the output of NLTK’s Punkt tok-
enizer revealed that multi-word expressions, such as
phrasal verbs, are a primary source of tokenization
errors in existing systems. By preserving such ex-
pressions, the tokenizer shows promise in improv-
ing parsing performance and downstream natural lan-
guage processing (NLP) tasks.
ACKNOWLEDGMENTS
This material is based upon work supported by the
National Science Foundation (NSF) under Grant No.
2219712 and 2219713. Any opinions, findings, and
conclusions or recommendations expressed in this
material are those of the authors and do not neces-
sarily reflect the views of the NSF.
REFERENCES
Adger, D. (2003). Core Syntax: A Minimalist Approach.
Oxford University Press, Oxford.
Alers-Valent
´
ın, H. and Fong, S. (2024). Towards a
biologically-plausible computational model of human
language cognition. In Proceedings of the 16th Inter-
national Conference on Agents and Artificial Intelli-
gence, volume 3, pages 1108–1118. SciTePress.
Alers-Valent
´
ın, H., Fong, S., and Vega-Riveros, J. F. (2023).
Modeling syntactic knowledge with neuro-symbolic
computation. In Proceedings of the 15th International
Conference on Agents and Artificial Intelligence, vol-
ume 3, pages 608–616. SciTePress.
Alers-Valent
´
ın, H., Rivera-Vel
´
azquez, C. G., Vega-Riveros,
J. F., and Santiago, N. G. (2019). Towards a princi-
pled computational system of syntactic ambiguity de-
tection and representation. In Proceedings of the 15th
International Conference on Agents and Artificial In-
telligence (ICAART 2023), volume 2, pages 980–987.
INSTICC, SciTePress.
Arppe, A., Carlson, L., Lind
´
en, K., Piitulainen, J., Suomi-
nen, M., Vainio, M., Westerlund, H., and Yli-Jyr
¨
a, A.
(2005). Inquiries into words, constraints and contexts.
Chomsky, N. (2021). Minimalism: Where are we now, and
where can we hope to go. Gengo Kenkyu, 160:1–41.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding.
Dridan, R. and Oepen, S. (2012). Tokenization: Returning
to a long solved problem — A survey, contrastive ex-
periment, recommendations, and toolkit —. In Li, H.,
Lin, C.-Y., Osborne, M., Lee, G. G., and Park, J. C.,
editors, Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics (Volume 2:
Short Papers), pages 378–382, Jeju Island, Korea. As-
sociation for Computational Linguistics.
Fares, M., Oepen, S., and Zhang, Y. (2013). Machine Learn-
ing for High-Quality Tokenization Replicating Vari-
able Tokenization Schemes. In Computational Lin-
guistics and Intelligent Text Processing - 14th Inter-
national Conference, CICLing 2013, Samos, Greece,
March 24-30, 2013, Proceedings, Part I, pages 231–
244.
Fong, S. (2005). Computation with probes and goals. In UG
and External Systems: Language, Brain and Compu-
tation, pages 311–334. John Benjamins, Amsterdam.
Graf, T. (2021). Minimalism and computational linguistics.
Lu, X. (2014). Computational Methods for Corpus Anno-
tation and Analysis. Springer Publishing Company,
Incorporated.
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A.
(1993). Building a large annotated corpus of En-
glish: The Penn Treebank. Computational Linguis-
tics, 19(2):313–330.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever,
I. (2018). Improving language understanding by gen-
erative pre-training.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi,
M., Macherey, W., Krikun, M., Cao, Y., Gao, Q.,
Macherey, K., Klingner, J., Shah, A., Johnson, M.,
Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T.,
Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang,
W., Young, C., Smith, J., Riesa, J., Rudnick, A.,
Vinyals, O., Corrado, G., Hughes, M., and Dean, J.
(2016). Google’s neural machine translation system:
Bridging the gap between human and machine trans-
lation. CoRR, abs/1609.08144.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1464