CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology
Shuxin Zhou, Hao Liu, Pritam Sen, Yehoshua Perl, Mahshad Koohi H. Dehkordi
2025
Abstract
In this paper, we present a novel algorithm designed to address the challenge of annotating electronic health record (EHR) text using an interface terminology dataset. Annotated text datasets are essential for the continued development of Large Language Models (LLMs). However, creating these datasets is labor-intensive and time-consuming, highlighting the urgent need for automated annotation methods. Our proposed method, the Cluster-Focused Combination (CFC) Algorithm, which stores intermediate results to minimize annotation loss from terminology-based annotators, such as BioPortal’s (mgrep), while achieving high coverage and significantly improving execution efficiency. We conduct a thorough evaluation of CFC on the benchmark dataset MIMIC-III, using the previously developed Cardiology Interface Terminology (CIT). Results show that CFC captured approximately 5,756 missed annotations from the baseline BioPortal (mgrep) while achieving a remarkable improvement in execution speed across different size of datasets. These findings demonstrate CFC’s scalability and robustness in processing large datasets, offering an efficient solution for EHR text annotation. This work contributes to the preparation of large, high-quality training datasets for Natural Language Processing (NLP) tasks in biomedical domains.
DownloadPaper Citation
in Harvard Style
Zhou S., Liu H., Sen P., Perl Y. and Dehkordi M. (2025). CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology. In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF; ISBN 978-989-758-731-3, SciTePress, pages 195-206. DOI: 10.5220/0013244500003911
in Bibtex Style
@conference{healthinf25,
author={Shuxin Zhou and Hao Liu and Pritam Sen and Yehoshua Perl and Mahshad Dehkordi},
title={CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology},
booktitle={Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF},
year={2025},
pages={195-206},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013244500003911},
isbn={978-989-758-731-3},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF
TI - CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology
SN - 978-989-758-731-3
AU - Zhou S.
AU - Liu H.
AU - Sen P.
AU - Perl Y.
AU - Dehkordi M.
PY - 2025
SP - 195
EP - 206
DO - 10.5220/0013244500003911
PB - SciTePress