datasets to be used mainly by Machine Learning
algorithms, in order to predict classes given a set of
examples. The goal was to build a software which
could allow a group of experts to label a set of
automatically-extracted set of sentences with a given
set of classes. We describe the extraction process as
well as the labeling process, which was designed to
make the task uncomplicated and therefore obtain the
largest dataset possible.
Proof-of-concept results show that the process
was very fast and easy for the user in most cases.
Average label time shows that labeling a big dataset
could be possible with a reasonable size team of
experts.
Some drawbacks were identified in the exercise,
mostly regarding the balancing of the classes, and a
discussion of how this could be resolved using
information retrieval and relevance feedback,
suggesting future work.
In addition, we are going to study length of the
context presented to the user. Some of the experts felt
they needed more context than just a sentence to make
a proper classification and expanding to the paragraph
level could ease even more the labeling task.
Finally, the tool could be expanded to other
annotation tasks like annotating named entities in
sentences or chunk detection to detect the structure of
the text. The entities that sponsor this project hope to
leave LABAS-TS available in its next version for the
community.
ACKNOWLEDGEMENTS
This work is part of the project entitled “Sistema de
análisis de indicadores de adherencia a las guías de
práctica clínica ” funded by Hospital Universitario
San Ignacio and Pontificia Universidad Javeriana.
REFERENCES
Aggarwal, C. C., & Zhai, C. (2012). A survey of text
clustering algorithms. Mining text data (pp. 77-128)
doi:10.1007/978-1-4614-3223-4_4.
Bodenreider, O. (2004). The Unified Medical Language
System (UMLS): integrating biomedical terminology.
Nucleic Acids Research, 32(90001).
Cocos, A., Qian, T., Callison-Burch, C. and Masino, A.J.,
2017. Crowd control: Effectively utilizing unscreened
crowd workers for biomedical data annotation. Journal
of Biomedical Informatics, 69, pp.86-92.
CrowdFlower. 2017. AI for your business. (online)
Available at: https://www.crowdflower.com.(Accessed
12 Jun 2017).
de Herrera, A.G.S., Foncubierta-Rodríguez, A., Markonis,
D., Schaer, R. and Müller, H., 2014. Crowdsourcing for
medical image classification. Swiss Medical
Informatics, 30.
Li, M., Wang, D., Lu, Q. and Long, Y., 2016. Event Based
Emotion Classification for News Articles. PACLIC 30,
p.153.
Dehghan, A., Keane, J.A. and Nenadic, G., 2013, October.
Challenges in clinical named entity recognition for
decision support. In Systems, Man, and Cybernetics
(SMC), 2013 IEEE International Conference on (pp.
947-951). IEEE.
Gate.ac.uk. (2017). GATE.ac.uk. (online) Available at:
https://gate.ac.uk/ (Accessed 12 Jun. 2017)
Manning, C.D., Raghavan, P. and Schütze Hinrich (2009).
Introduction to information retrieval. Cambridge:
Cambridge University Press.
Naik, M. P., Prajapati, H. B., & Dabhi, V. K. (2015). A
survey on semantic document clustering. Paper
presented at the Proceedings of 2015 IEEE
International Conference on Electrical, Computer and
Communication Technologies, ICECCT 2015,
doi:10.1109/ICECCT.2015.7226036.
Pomares, A., Sierra, A., González, R.A., Daza, J.C.,
Muñoz, O.M, García, A.A. and Labbé, C., 2016.
Named Entity Recognition Over Electronic Health
Records Through a Combined Dictionary-based
Approach. Procedia Computer Science, 100, pp.55-61.
Santamaría, V. (2016) Spanish NLP Tools for GATE.
(online) SourceForge. Available at: https://
sourceforge.net/projects/nlptools-es/ (Accessed 12 Jun.
2017).
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou,
S. and Tsujii, J.I., 2012, April. BRAT: a web-based tool
for NLP-assisted text annotation. In Proceedings of the
Demonstrations at the 13th Conference of the European
Chapter of the Association for Computational
Linguistics (pp. 102-107). Association for
Computational Linguistics.
Sun, C., Rampalli, N., Yang, F. and Doan, A., 2014.
Chimera: Large-scale classification using machine
learning, rules, and crowdsourcing. Proceedings of the
VLDB Endowment, 7(13), pp.1529-1540.
Kozareva, Z. (2006). Bootstrapping named entity
recognition with automatically generated gazetteer
lists. Proceedings of the Eleventh Conference of the
European Chapter of the Association for
Computational Linguistics: Student Research
Workshop on - EACL '06. doi:10.3115/
1609039.1609041.