Text Analysis of User-Generated Contents for Health-care Applications - Case Study on Smoking Status Classification

Deema Abdal Hafeth, Amr Ahmed, David Cobham

Abstract

Text mining techniques have demonstrated a potential to unlock significant patient health information from unstructured text. However, most of the published work has been done using clinical reports, which are difficult to access due to patient confidentiality. In this paper, we present an investigation of text analysis for smoking status classification from User-Generated Contents (UGC), such as online forum discussions. UGC are more widely available, compared to clinical reports. Based on analyzing the properties of UGC, we propose the use of Linguistic Inquiry Word Count (LIWC) an approach being used for the first time for such a health-related task. We also explore various factors that affect the classification performance. The experimental results and evaluation indicate that the forum classification performs well with the proposed features. It has achieved an accuracy of up to 75% for smoking status prediction. Furthermore, the utilized features set is compact (88 features only) and independent of the dataset size.

References

  1. Linguistic inquiry and word count. October 2013 Available from http://www.liwc.net/.
  2. The Stanford natural language processing group. October 2013 Available from http://nlp.stanford.edu/software/tagger.shtml.
  3. WEKA, the university of Wekato. October 2013 Available from http://www.cs.waikato.ac.nz/ml/weka/.
  4. Informatics for integrating biology and the bedside. October 2013Available from https://www.i2b2.org/.
  5. Aramaki, Eiji, Takeshi Imai, Kengo Miyo, and Kazuhiko Ohe. 2006. Patient status classification by using rule based sentence extraction and BM25 kNN-based classifier. Paper presented at i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, [no pagination] .
  6. Clark, C., K. Good, L. Jezierny, M. Macpherson, B. Wilson, and U. Chajewska. 2008. Identifying smokers with a medical extraction system. Journal of the American Medical Informatics Association 15 (1): 36-39.
  7. Cohen, Aaron M. 2008. Five-way smoking status classification using text hot-spot identification and error-correcting output codes. Journal of American Medical Informatics Association 15 (1): 32-35.
  8. Gill, Alastair J., Scott Nowson, and Jon Oberlander.2006.Language and personality in computermediated communication: A cross-genre comparison. Journal of Computer Mediated Communication, [no pagination].
  9. Kaiser, C., and F. Bodendorf. 2012. Mining patient experiences on web 2.0-A case study in the pharmaceutical industry. Paper presented at SRII Global Conference (SRII), 2012 Annual, 139-145 .
  10. Leshed, Gilly, and Joseph'Jofish' Kaye. 2006. Understanding how bloggers feel: Recognizing affect in blog posts. Paper presented at CHI'06 extended abstracts on Human factors in computing systems, 1019-1024.
  11. Liu, Mei, Anushi Shah, Min Jiang, Neeraja B. Peterson, Qi Dai, Melinda C. Aldrich, Qingxia Chen, Erica A. Bowton, Hongfang Liu, and Joshua C. Denny. 2012. A study of transportability of an existing smoking status detection module across institutions. Paper presented at AMIA Annual Symposium Proceedings, 577-586.
  12. Pang, Bo, Lillian Lee, and ShivakumarVaithyanathan. 2002. Thumbs up?: Sentiment classification using machine learning techniques. Paper presented at Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 79-86.
  13. Pedersen, Ted. 2006. Determining smoker status using supervised and unsupervised learning with lexical features. Paper presented at i2b2Workshop on Challenges in Natural Language Processing for Clinical Data, [no pagination].
  14. Savova, Guergana K., Philip V. Ogren, Patrick H. Duffy, James D. Buntrock, and Christopher G. Chute. 2008. Mayo clinic NLP system for patient smoking status identification. Journal of American Medical Informatics Association 15 (1): 25-28.
  15. Sordo, Margarita, and Qing Zeng. 2005. On sample size and classification accuracy: A performance comparison. In Biological and medical data analysis., 193-201Springer.
  16. Szarvas, György, RichárdFarkas, SzilárdIván, AndrásKocsor, and RóbertBusaFekete. 2006. Automatic extraction of semantic content from medical discharge records. Paper presented at i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, [no pagination].
  17. Tausczik, Y. R., and J. W. Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29 (1): 24-54.
  18. Uzuner, Özlem, Ira Goldstein, Yuan Luo, and Isaac Kohane. 2008. Identifying patient smoking status from medical discharge records. Journal American Medical Informatics Association 15 (1): 14-24.
  19. Wicentowski, Richard, and Matthew R. Sydes. 2008. Using implicit information to identify smoking status in smokeblind medical discharge summaries. Journal of the American Medical Informatics Association 15 (1): 29-31.
  20. Wu, X., V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, and P. S. Yu. 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14 (1): 1-37.
  21. Zeng, Q. T., S. Goryachev, S. Weiss, M. Sordo, S. N. Murphy, and R. Lazarus. 2006. Extracting principal diagnosis, comorbidity and smoking status for asthma research: Evaluation of a natural language processing system. BMC Medical Informatics and Decision Making 6 (1): 30-38.
Download


Paper Citation


in Harvard Style

Abdal Hafeth D., Ahmed A. and Cobham D. (2014). Text Analysis of User-Generated Contents for Health-care Applications - Case Study on Smoking Status Classification . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 242-249. DOI: 10.5220/0005080502420249


in Bibtex Style

@conference{kdir14,
author={Deema Abdal Hafeth and Amr Ahmed and David Cobham},
title={Text Analysis of User-Generated Contents for Health-care Applications - Case Study on Smoking Status Classification},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={242-249},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005080502420249},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Text Analysis of User-Generated Contents for Health-care Applications - Case Study on Smoking Status Classification
SN - 978-989-758-048-2
AU - Abdal Hafeth D.
AU - Ahmed A.
AU - Cobham D.
PY - 2014
SP - 242
EP - 249
DO - 10.5220/0005080502420249