proposed, and it will be further investigated through
academic research in the future. This architecture will
enable medical data to be extracted from scanned doc-
uments, standardized using FHIR/OpenEHR and then
stored in EHRs.
7 CONCLUSION
In recent years, there has been an increasing interest
in using the quantity of data found in scanned medical
documents. Healthcare delivery could be revolution-
ized by using scanned medical documents to predict
patient outcomes and utilizing them in this way has
the potential to improve patient outcomes, save doc-
tors’ times, and save costs for the healthcare systems.
Extracting structured data from scanned documents
using technologies such as OCR and NLP, and ML
models to extract useful information standardizing it
in a common format like FHIR or OpenEHR, and us-
ing methods like ML and BERT models to generate
predictions is a relatively new field. As healthcare or-
ganizations look for ways to use the enormous quan-
tity of data included in scanned medical documents to
enhance patient outcomes and reduce costs, this ap-
proach has gained increasing attention in recent years.
However, it is important to note that the process for
collecting, standardizing, and analyzing the data can
be challenging, time-consuming, and expensive. Nev-
ertheless, the advantages of using scanned documents
in this way make it a worthwhile endeavor for health-
care researchers.
REFERENCES
ACCERN (2022). Differences between
structured, unstructured, and semi-
structured data. https://accern.com/blog/
structured-vs-semi-structured-vs-unstructured-data/.
Aggarwal, A., Garhwal, S., and Kumar, A. (2018). Hedea:
a python tool for extracting and analysing semi-
structured information from medical records. Health-
care informatics research, 24(2):148–153.
Ahmed, A., Rebeiro-Hargrave, A., Nohara, Y., Kai, E.,
Ripon, Z. H., and Nakashima, N. (2014). Targeting
morbidity in unreached communities using portable
health clinic system. IEICE Transactions on Commu-
nications, E97.B(3):540–545.
Bishop, C. M. (1994). Neural networks and their applica-
tions. Review of scientific instruments, 65(6):1803–
1832.
Chowdhary, K. R. (2020). Natural Language Processing,
pages 603–649. Springer India, New Delhi.
Hossain, F. and Ahmed, A. (2021). Visualization of health-
care data for busy doctors in developing countries to
make efficient clinical decisions. In 10th Social Busi-
ness Academia Conference.
Hossain, F., Islam, R., Ahmed, M. T., and Ahmed, A.
(2022). Technical requirements to design a personal
medical history visualization tool for doctors. In Pro-
ceedings of the 8th International Conference on Hu-
man Interaction and Emerging Technologies. IHIET,
https://ihiet. org.
Hsu, E., Malagaris, I., Kuo, Y.-F., Sultana, R., and Roberts,
K. (2022). Deep learning-based nlp data pipeline for
ehr-scanned document information extraction. JAMIA
open, 5(2):ooac045.
Kaneko, K., Onozuka, D., Shibuta, H., and Hagihara, A.
(2018). Impact of electronic medical records (emrs)
on hospital productivity in japan. International jour-
nal of medical informatics, 118:36–43.
Kessels, R. P. (2003). Patients’ memory for medical in-
formation. Journal of the Royal Society of Medicine,
96(5):219–222.
Kodali, R. K., Swamy, G., and Lakshmi, B. (2015). An im-
plementation of iot for healthcare. In 2015 IEEE Re-
cent Advances in Intelligent Computational Systems
(RAICS), pages 411–416. IEEE.
LaValley, M. P. (2008). Logistic regression. Circulation,
117(18):2395–2399.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,
and Kang, J. (2019). BioBERT: a pre-trained biomed-
ical language representation model for biomedical text
mining. Bioinformatics, 36(4):1234–1240.
Miro
´
nczuk, M. M. and Protasiewicz, J. (2018). A recent
overview of the state-of-the-art elements of text clas-
sification. Expert Systems with Applications, 106:36–
54.
Mithe, R., Indalkar, S., and Divekar, N. (2013). Optical
character recognition. International journal of recent
technology and engineering (IJRTE), 2(1):72–75.
Mohit, B. (2014). Named Entity Recognition, pages 221–
245. Springer Berlin Heidelberg, Berlin, Heidelberg.
Pawar, Y., Henriksson, A., Hedberg, P., and Naucler, P.
(2022). Leveraging clinical bert in multimodal mortal-
ity prediction models for covid-19. In 2022 IEEE 35th
International Symposium on Computer-Based Medi-
cal Systems (CBMS), pages 199–204. IEEE.
Pisner, D. A. and Schnyer, D. M. (2020). Support vector
machine. In Machine learning, pages 101–121. Else-
vier.
Rasmy, L., Xiang, Y., Xie, Z., Tao, C., and Zhi, D. (2021).
Med-bert: pretrained contextualized embeddings on
large-scale structured electronic health records for dis-
ease prediction. NPJ digital medicine, 4(1):1–13.
Rhodes, S., Greene, N. R., and Naveh-Benjamin, M.
(2019). Age-related differences in recall and recog-
nition: A meta-analysis. Psychonomic Bulletin & Re-
view, 26(5):1529–1547.
Rigatti, S. J. (2017). Random forest. Journal of Insurance
Medicine, 47(1):31–39.
Rogers, A., Kovaleva, O., and Rumshisky, A. (2020). A
primer in bertology: What we know about how bert
works. Transactions of the Association for Computa-
tional Linguistics, 8:842–866.
A Machine Learning Approach to Digitize Medical History and Archive in a Standard Format
235