Limitations of Tokenizers for Building a Neuro-Symbolic Lexicon
Hilton Alers-Valentín, José D. Maldonado-Torres, J. Vega-Riveros
Tokenization is a critical preprocessing step in natural language processing (NLP), as it determines the units of text that will be analyzed. Conventional tokenization strategies, such as whitespace-based or frequency-based methods, often fail to preserve linguistically meaningful units, including multi-word expressions, phrasal verbs, and morphologically complex tokens. Such failures result in downstream processing errors and hinder parsing performance. This paper examines contemporary tokenization approaches and their limitations in light of foundational concepts in morphology that are relevant for natural language parsing. We then proceed to describe the required features for the cognitive modeling of a human language lexicon and introduce a linguistically aware encoding pipeline. Finally, a preliminary assessment of this system will be presented and major points of the proposed system will be summarized in the conclusions.
DownloadPaper Citation
in Harvard Style
Alers-Valentín H., Maldonado-Torres J. and Vega-Riveros J. (2025). Limitations of Tokenizers for Building a Neuro-Symbolic Lexicon. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 1458-1464. DOI: 10.5220/0013386100003890
in Bibtex Style
author={Hilton Alers-Valentín and José Maldonado-Torres and J. Vega-Riveros},
title={Limitations of Tokenizers for Building a Neuro-Symbolic Lexicon},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
in EndNote Style
JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Limitations of Tokenizers for Building a Neuro-Symbolic Lexicon
SN - 978-989-758-737-5
AU - Alers-Valentín H.
AU - Maldonado-Torres J.
AU - Vega-Riveros J.
PY - 2025
SP - 1458
EP - 1464
DO - 10.5220/0013386100003890
PB - SciTePress