One of the main advantages of the EMTE ap-
proach is its flexibility and maintainability. The dic-
tionaries can be updated at any time without any need
to retrain the models on new medical terms. In ad-
dition, EMTE can be used as a document quality en-
hancer as it can unify the negations writing styles and
replace the abbreviations with their full-terms.
7 CONCLUSION AND FUTURE
WORK
This paper presented a cleansing approach that im-
proves the quality of medical terms extraction from
unstructured clinical data using pattern matching
rules based on dictionaries. The solution was con-
ceived with flexibility and maintainability in mind for
industrial use. The experiments showed that our ap-
proach helps solving the the ICD-10 prediction prob-
lem by improving the quality of the data fed to the
DNNs. As a result, the performance of the trained
models was improved according to various metrics.
The proposed approach also reduced the required re-
sources to train the models and decreased the training
time by accelerating the convergence of the models.
In future works and in order to improve further-
more the quality of the medical data, we aim to ex-
tend this work to improve data quality by tackling
several challenges like: medical term synonyms, im-
prove abbreviation detection by adding more features
(e.g. body site, gender, and age), and medical investi-
gation results (laboratory and radiology) in CCs.
ACKNOWLEDGEMENTS
All computations have been performed on the
M
´
esocentre of Franche-Comt
´
e, France and the med-
ical data was aquired from the Specialized Medical
Center Hospital in Riyadh, KSA.
REFERENCES
Abadi, M., Agarwal, A., et al. (2015). Tensorflow: Large-
scale machine learning on heterogeneous systems.
Adnan, K. and Akbar, R. (2019). Limitations of information
extraction methods and techniques for heterogeneous
unstructured big data. International Journal of Engi-
neering Business Management, 11.
Alsentzer, E., Murphy, J., et al. (2019). Publicly avail-
able clinical bert embeddings. In Proceedings of the
2nd Clinical Natural Language Processing Workshop,
pages 72–78, Minneapolis, Minnesota, USA.
Atutxa, A., de Ilarraza, A. D., et al. (2019). Inter-
pretable deep learning to map diagnostic texts to icd-
10 codes. International Journal of Medical Informat-
ics, 129:49–59.
Azam, S. S., Raju, M., et al. (2020). Cascadenet: An lstm
based deep learning model for automated icd-10 cod-
ing. In Advances in Information and Communication,
pages 55–74. Springer International Publishing.
Bai, T. and Vucetic, S. (2019). Improving medical code
prediction from clinical text via incorporating online
knowledge sources. In The World Wide Web Confer-
ence, pages 72–82, NY, USA.
Bose, P., Srinivasan, S., et al. (2021). A survey on re-
cent named entity recognition and relationship extrac-
tion techniques on clinical texts. Applied Sciences,
11(18):8319.
Chen, Q., Du, J., et al. (2020). Deep learning with sen-
tence embeddings pre-trained on biomedical corpora
improves the performance of finding similar sentences
in electronic medical records. BMC Medical Informat-
ics and Decision Making, 20.
Chollet, F. et al. (2015). Keras.
Chraibi, A., Delerue, D., et al. (2021). A deep learning
framework for automated icd-10 coding. Studies in
Health Technology and Informatics, 281.
de Marneffe, M.-C., Manning, C. D., et al. (2021). Uni-
versal dependencies. Computational Linguistics,
47(2):255–308.
Devlin, J., Chang, M.-W., et al. (2019). Bert: Pre-training
of deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics, pages 4171–4186, Min-
neapolis, Minnesota.
Du, J., Chen, Q., et al. (2019). Ml-net: multi-label classifi-
cation of biomedical texts with deep neural networks.
Journal of the American Medical Informatics Associ-
ation, 26(11):1279–1285.
Dugas, M., Neuhaus, P., et al. (2016). Portal of medical
data models: information infrastructure for medical
research and healthcare. Database, 2016.
Grossman Liu, L., Grossman, R. H., et al. (2021). A deep
database of medical abbreviations and acronyms for
natural language processing. Scientific Data, 8(1).
Honnibal, M., Montani, I., et al. (2020). spacy: Industrial-
strength natural language processing in python.
Joachims, T. (1998). Text categorization with support vec-
tor machines: Learning with many relevant features.
Koptient, A. and Grabar, N. (2021). Disambiguation of
medical abbreviations in french with supervised meth-
ods.
Li, P., Wang, H., et al. (2018). Employing semantic context
for sparse information extraction assessment. ACM
Transactions on Knowledge Discovery from Data,
12(5).
Lucini, F. R., Fogliatto, F. S., et al. (2017). Text mining ap-
proach to predict hospital admissions using early med-
ical records from the emergency department. Interna-
tional Journal of Medical Informatics, 100:1–8.
ICAART 2023 - 15th International Conference on Agents and Artificial Intelligence
310