Authors:
Shorabuddin Syed
1
;
Adam Jackson Angel
2
;
Hafsa Bareen Syeda
3
;
Carole Franc Jennings
4
;
Joseph VanScoy
5
;
Mahanazuddin Syed
1
;
Melody Greer
1
;
Sudeepa Bhattacharyya
6
;
Shaymaa Al-Shukri
1
;
Meredith Zozus
7
;
Fred Prior
1
and
Benjamin Tharian
8
Affiliations:
1
Department of Biomedical Informatics, University of Arkansas for Medical Sciences, U.S.A.
;
2
Department of Internal Medicine, Washington University, U.S.A.
;
3
Department of Neurology, University of Arkansas for Medical Sciences, U.S.A.
;
4
Department of Internal Medicine, Tulane University, U.S.A.
;
5
College of Medicine, University of Arkansas for Medical Sciences, U.S.A.
;
6
Department of Biological Sciences, Arkansas State University, U.S.A.
;
7
Department of Population Health Sciences, University of Texas Health Science Centre at San Antonio, U.S.A.
;
8
Division of Gastroenterology and Hepatology, University of Arkansas for Medical Sciences, U.S.A.
Keyword(s):
Colonoscopy, Taxonomy, Annotation, Natural Language Processing, Machine Learning, Clinical Corpus.
Abstract:
Colonoscopy plays a critical role in screening of colorectal carcinomas (CC). Unfortunately, the data related to this procedure are stored in disparate documents, colonoscopy, pathology, and radiology reports respectively. The lack of integrated standardized documentation is impeding accurate reporting of quality metrics and clinical and translational research. Natural language processing (NLP) has been used as an alternative to manual data abstraction. Performance of Machine Learning (ML) based NLP solutions is heavily dependent on the accuracy of annotated corpora. Availability of large volume annotated corpora is limited due to data privacy laws and the cost and effort required. In addition, the manual annotation process is error-prone, making the lack of quality annotated corpora the largest bottleneck in deploying ML solutions. The objective of this study is to identify clinical entities critical to colonoscopy quality, and build a high-quality annotated corpus using domain spec
ific taxonomies following standardized annotation guidelines. The annotated corpus can be used to train ML models for a variety of downstream tasks.
(More)