CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology

Shuxin Zhou, Hao Liu, Pritam Sen, Yehoshua Perl, Mahshad Koohi H. Dehkordi

2025

Abstract

In this paper, we present a novel algorithm designed to address the challenge of annotating electronic health record (EHR) text using an interface terminology dataset. Annotated text datasets are essential for the continued development of Large Language Models (LLMs). However, creating these datasets is labor-intensive and time-consuming, highlighting the urgent need for automated annotation methods. Our proposed method, the Cluster-Focused Combination (CFC) Algorithm, which stores intermediate results to minimize annotation loss from terminology-based annotators, such as BioPortal’s (mgrep), while achieving high coverage and significantly improving execution efficiency. We conduct a thorough evaluation of CFC on the benchmark dataset MIMIC-III, using the previously developed Cardiology Interface Terminology (CIT). Results show that CFC captured approximately 5,756 missed annotations from the baseline BioPortal (mgrep) while achieving a remarkable improvement in execution speed across different size of datasets. These findings demonstrate CFC’s scalability and robustness in processing large datasets, offering an efficient solution for EHR text annotation. This work contributes to the preparation of large, high-quality training datasets for Natural Language Processing (NLP) tasks in biomedical domains.

Download


Paper Citation


in Harvard Style

Zhou S., Liu H., Sen P., Perl Y. and Dehkordi M. (2025). CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology. In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF; ISBN 978-989-758-731-3, SciTePress, pages 195-206. DOI: 10.5220/0013244500003911


in Bibtex Style

@conference{healthinf25,
author={Shuxin Zhou and Hao Liu and Pritam Sen and Yehoshua Perl and Mahshad Dehkordi},
title={CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology},
booktitle={Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF},
year={2025},
pages={195-206},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013244500003911},
isbn={978-989-758-731-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: HEALTHINF
TI - CFC Annotator: A Cluster-Focused Combination Algorithm for Annotating Electronic Health Records by Referencing Interface Terminology
SN - 978-989-758-731-3
AU - Zhou S.
AU - Liu H.
AU - Sen P.
AU - Perl Y.
AU - Dehkordi M.
PY - 2025
SP - 195
EP - 206
DO - 10.5220/0013244500003911
PB - SciTePress