Juliano Gaspar, Emanuel Catumbela, Bernardo Marques, Alberto Freitas


Background: Patient medical records contain many entries relating to patient conditions, treatments and lab results. Generally involve multiple types of data and produces a large amount of information. These databases can provide important information for clinical decision and to support the management of the hospital. Medical databases have some specificities not often found in others non-medical databases. In this context, outlier detection techniques can be used to detect abnormal patterns in health records (for instance, problems in data quality) and this contributing to better data and better knowledge in the process of decision making. Aim: This systematic review intention to provide a better comprehension about the techniques used to detect outliers in healthcare data, for creates automatisms for those methods in the order to facilitate the access to information with quality in healthcare. Methods: The literature was systematically reviewed to identify articles mentioning outlier detection techniques or anomalies in medical data. Four distinct bibliographic databases were searched: Medline, ISI, IEEE and EBSCO. Results: From 4071 distinct papers selected, 80 were included after applying inclusion and exclusion criteria. According to the medical specialty 32% of the techniques are intended for oncology and 37% of them using patient data. Considering only articles that used administrative medical data, 59% of the techniques were statistical based. Conclusion: The area with outliers detection techniques most widely used in medical administrative data is the statistics, when compared with techniques from data mining such as clustering and nearest neighbor.


  1. Aalen, O. O., Fosen, J., Weedon-Fekjaer, H., Borgan, O., and Husebye, E. (2004). Dynamic analysis of multivariate failure time data. Biometrics 60, 764-773.
  2. Aggarwal, C. C., and Yu, P. S. (2005). An effective and efficient algorithm for high-dimensional outlier detection. Vldb J 14, 211-221.
  3. Ahdesmaki, M., Lahdesmaki, H., Pearson, R., Huttunen, H., and Yli-Harja, O. (2005). Robust detection of periodic time series measured from biological systems. BMC Bioinformatics 6, 117.
  4. Ahlers, C. M., and Figg, W. D. (2006). ETS-TMPRSS2 fusion gene products in prostate cancer. Cancer Biol Ther 5, 254-255.
  5. Alameda, C., and Suarez, C. (2009). Clinical outcomes in medical outliers admitted to hospital with heart failure. Eur J Intern Med 20, 764-767.
  6. Allen, D. P., Stegemoller, E. L., Zadikoff, C., Rosenow, J. M., and Mackinnon, C. D. (2010). Suppression of deep brain stimulation artifacts from the electroencephalogram by frequency-domain Hampel filtering. Clin Neurophysiol.
  7. Antao, T., Lopes, A., Lopes, R. J., Beja-Pereira, A., and Luikart, G. (2008). LOSITAN: a workbench to detect molecular adaptation based on a Fst-outlier method. BMC Bioinformatics 9, 323.
  8. Arts, D., Keizer, N., and Scheffer, G.-J. (2002). Defining and Improving Data Quality in Medical Registries: A Literature Review Case Study, and Generic Framework. J Am Med Inform Assoc 9, 600-611.
  9. Asare, A. L., Gao, Z., Carey, V. J., Wang, R., and SeyfertMargolis, V. (2009). Power enhancement via multivariate outlier testing with gene expression arrays. Bioinformatics 25, 48-53.
  10. Azmandian, F., Kaeli, D., Dy, J. G., Hutchinson, E., Ancukiewicz, M., Niemierko, A., and Jiang, S. B. (2007). Towards the development of an error checker for radiotherapy treatment plans: a preliminary study. Phys Med Biol 52, 6511-6524.
  11. Baker, R., and Jackson, D. (2008). A new approach to outliers in meta-analysis. Health Care Management Science 11, 121-131.
  12. Bakhshi-Raiez, F., Peek, N., Bosman, R. J., de Jonge, E., and de Keizer, N. F. (2007). The impact of different prognostic models and their customization on institutional comparison of intensive care units. Crit Care Med 35, 2553-2560.
  13. Barnett, V., and Lewis, T. (1994). Outliers in Statistical Data (England).
  14. Beguin, C., and Hulliger, B. (2004). Multivariate outlier detection in incomplete survey data: the epidemic algorithm and transformed rank correlations. J R Stat Soc Ser A-Stat Soc 167, 275-294.
  15. Bickel, D. R. (2003). Robust cluster analysis of microarray gene expression data with the number of clusters determined biologically. Bioinformatics 19, 818-824.
  16. Booth, D. E., and Lee, K. (2003). Robust regression-based analysis of drug-nucleic acid binding. Anal Biochem 319, 258-262.
  17. Branden, K. V., and Verboven, S. (2009). Robust data imputation. Comput Biol Chem 33, 7-13.
  18. Breen, H. J., Rogers, P. A., and Johnson, N. W. (2002). Improvements in methods of periodontal probing: comparison of relative attachment level data selected by outlier reduction protocols from Florida disc probe measurements. J Clin Periodontol 29, 679-687.
  19. Cardoso, F. F., Rosa, G. J., and Tempelman, R. J. (2007). Accounting for outliers and heteroskedasticity in multibreed genetic evaluations of postweaning gain of Nelore-Hereford cattle. J Anim Sci 85, 909-918.
  20. Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys 41.
  21. Chen, D., Lu, C.-T., Kou, Y., and Chen, F. (2008). On Detecting Spatial Outliers. GeoInformatica 12, 455- 475.
  22. Cho, H., Kim, Y. J., Jung, H. J., Lee, S. W., and Lee, J. W. (2008). OutlierD: an R package for outlier detection using quantile regression on mass spectrometry data. Bioinformatics 24, 882-884.
  23. Cios, K. (2001). Medical data mining and knowledge discovery (Physica-Verlag).
  24. Cluitmans, P. J. M., and van de Velde, M. (2000). Outlier detection to identify artefacts in EEG signals. In Proceedings of the 22nd Annual International Conference of the Ieee Engineering in Medicine and Biology Society, Vols 1-4, J. D. Enderle, ed. (New York, Ieee), pp. 2825-2826.
  25. Cohen Freue, G. V., Hollander, Z., Shen, E., Zamar, R. H., Balshaw, R., Scherer, A., McManus, B., Keown, P., McMaster, W. R., and Ng, R. T. (2007). MDQC: a new quality assessment method for microarrays based on quality control reports. Bioinformatics 23, 3162- 3169.
  26. Cohen, Y. C., Olmer, L., and Mozes, B. (1996). Twodimensional outcome analysis as a guide for quality assurance of prostatectomy. Int J Qual Health Care 8, 67-73.
  27. Comanor, W. S., Frech, H. E., 3rd, and Miller, R. D., Jr. (2006). Is the United States an outlier in health care and health outcomes? A preliminary analysis. Int J Health Care Finance Econ 6, 3-23.
  28. Commowick, O., and Warfield, S. K. (2009). A Continuous STAPLE for Scalar, Vector, and Tensor Images: An Application to DTI Analysis. IEEE Transactions on Medical Imaging 28, 838-846.
  29. Cooney, R. N., Haluck, R. S., Ku, J., Bass, T., MacLeod, J., Brunner, H., and Miller, C. A. (2003). Analysis of cost outliers after gastric bypass surgery: What can we learn? Obesity Surgery 13, 29-36.
  30. Cruz-Correia, R., Vieira-Marques, P., Ferreira, A., Oliveira-Palhares, E., Costa, P., and Costa-Pereira, A. (2006). Monitoring the integration of hospital information systems: How it may ensure and improve the quality of data. Stud Health Technol Inform 121, 176-182.
  31. Duan, L., Xu, L. D., Liu, Y., and Lee, J. (2009). Clusterbased outlier detection. Annals of Operations Research 168, 151-168.
  32. Englesbe, M. J., Dimick, J. B., Fan, Z., Baser, O., and Birkmeyer, J. D. (2009). Case Mix, Quality and HighCost Kidney Transplant Patients. American Journal of Transplantation 9, 1108-1114.
  33. Fomenko, I., Durst, M., and Balaban, D. (2006). Robust regression for high throughput drug screening. Comput Methods Programs Biomed 82, 31-37.
  34. Freifeld, O., Greenspan, H., and Goldberger, J. (2009). Multiple Sclerosis Lesion Detection Using Constrained GMM and Curve Evolution. Int J Biomed Imaging 2009, 715124.
  35. Ghosh, D. (2010). Discrete nonparametric algorithms for outlier detection with genomic data. J Biopharm Stat 20, 193-208.
  36. Ghosh, D., and Chinnaiyan, A .M. (2009). Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation. Biostatistics 10, 60-69.
  37. Glance, L. G., Dick, A. W., Osler, T. M., and Mukamel, D. (2003). Using hierarchical modeling to measure ICU quality. Intensive Care Medicine 29, 2223-2229.
  38. Glance, L. G., Osler, T. M., and Dick, A. W. (2002). Identifying quality outliers in a large, multipleinstitution database by using customized versions of the Simplified Acute Physiology Score II and the Mortality Probability Model II0. Crit Care Med 30, 1995-2002.
  39. Glance, L. G., Osler, T. M., Mukamel, D. B., and Dick, A. W. (2007). Use of a matching algorithm to evaluate hospital coronary artery bypass grafting performance as an alternative to conventional risk adjustment. Med Care 45, 292-299.
  40. Gold, E. M., and Hoffman, P. J. (1976). Flange detection cluster analysis. Multivariate Behavioral Research 11, 217-235.
  41. Goovaerts, P., and Jacquez, G. M. (2004). Accounting for regional background and population size in the detection of spatial clusters and outliers using geostatistical filtering and spatial neutral models: the case of lung cancer in Long Island, New York. International Journal of Health Geographics 3, 14-23.
  42. Grotkjaer, T., Winther, O., Regenberg, B., Nielsen, J., and Hansen, L. K. (2006). Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm. Bioinformatics 22, 58-67.
  43. Hanauer, D. A., Rhodes, D. R., Sinha-Kumar, C., and Chinnaiyan, A. M. (2007). Bioinformatics approaches in the study of cancer. Curr Mol Med 7, 133-141.
  44. Hauskrecht, M., Valko, M., Kveton, B., Visweswaran, S., and Cooper, G. F. (2007). Evidence-based anomaly detection in clinical domains. AMIA Annu Symp Proc, 319-323.
  45. Hayes, K., Kinsella, A., and Coffey, N. (2007). A note on the use of outlier criteria in Ontario laboratory quality control schemes. Clin Biochem 40, 147-152.
  46. Hibbs, M. A., Dirksen, N. C., Li, K., and Troyanskaya, O. G. (2005). Visualization methods for statistical analysis of microarray clusters. BMC Bioinformatics 6, 115.
  47. Hojjatoleslami, A., Sardo, L., and Kittler, J. (1997). An RBF based classifier for the detection of microcalcifications in mammograms with outlier rejection capability. Paper presented at: Neural Networks,1997, International Conference on.
  48. Hu, J. (2008). Cancer outlier detection based on likelihood ratio test. Bioinformatics 24, 2193-2199.
  49. Hubert, M., and Engelen, S. (2004). Robust PCA and classification in biosciences. Bioinformatics 20, 1728- 1736.
  50. Hughes, S. L., Ulasevich, A., Weaver, F. M., Henderson, W., Manheim, L., Kubal, J. D., and Bonarigo, F. (1997). Impact of home care on hospital days: a meta analysis. Health Serv Res 32, 415-432.
  51. Irigoien, I., and Arenas, C. (2008). INCA: new statistic for estimating the number of clusters and identifying atypical units. Stat Med 27, 2948-2973.
  52. Jackson, M. C., Huang, L., Luo, J., Hachey, M., and Feuer, E. (2009). Comparison of tests for spatial heterogeneity on data with global clustering patterns and outliers. Int J Health Geogr 8, 55.
  53. Jacobs, R. (2001). Outliers in Statistical Analysis: Basic Methods of Detection and Accommodation.
  54. Janeja, V. P., and Atluri, V. (2009). Spatial outlier detection in heterogeneous neighborhoods. Intell Data Anal 13, 85-107.
  55. Kauffmann, A., and Huber, W. (2010). Microarray data quality control improves the detection of differentially expressed genes. Genomics 95, 138-142.
  56. Kazmierczak, S. C., Leen, T. K., Erdogmus, D., and Carreira-Perpinan, M. A. (2007). Reduction of multidimensional laboratory data to a two-dimensional plot: a novel technique for the identification of laboratory error. Clinical Chemistry & Laboratory Medicine 45, 749-752.
  57. Koufakou, A., and Georgiopoulos, M. (2010). A fast outlier detection strategy for distributed highdimensional data sets with mixed attributes. Data Min Knowl Discov 20, 259-289.
  58. Kumar, V., Kumar, D., and Singh, R. K. (2008). Outlier Mining in Medical Databases: An Application of Data Mining in Health Care Management to Detect Abnormal Values Presented In Medical Databases. IJCSNS International Journal of Computer Science and Network Security.
  59. Laurikkala, J., Juhola, M., and Kentala, E. (2000). Informal Identification of Outilers in Medical Data. Intelligent Data Analysis in Medicine and Pharmacology.
  60. Law, G. R., Cox, D. R., Machonochie, N. E., Simpson, J., Roman, E., and Carpenter, L. M. (2001). Large tables. Biostatistics 2, 163-171.
  61. Liu, F., and Wu, B. (2007). Multi-group cancer outlier differential gene expression detection. Comput Biol Chem 31, 65-71.
  62. Livesey, J. H. (2007). Kurtosis provides a good omnibus test for outliers in small samples. Clin Biochem 40, 1032-1036.
  63. Lopes, H. F., Müller, P., and Rosner, G. L. (2003). Bayesian Meta-analysis for Longitudinal Data Models Using Multivariate Mixture Priors. Biometrics 59, 66- 75.
  64. MacDonald, J. W., and Ghosh, D. (2006). COPA--cancer outlier profile analysis. Bioinformatics 22, 2950-2951.
  65. Mahadevan, V., Narasimha-Iyer, H., Roysam, B., and Tanenbaum, H. L. (2004). Robust model-based vasculature detection in noisy biomedical images. IEEE Trans Inf Technol Biomed 8, 360-376.
  66. Marin, J. M. M., Kerrie, L., and Robert, C. (2005). Bayesian modelling and inference on mixtures of distributions, Vol 25 (Elsevier).
  67. Meloun, M., Hill, M., Militký, J., Vrbiková, J., Škrha, J., and Stanická, S. (2004). New methodology of influential point detection in regression model building for the prediction of metabolic clearance rate of glucose. Clinical Chemistry & Laboratory Medicine 42, 311-322.
  68. Model, F., Konig, T., Piepenbrock, C., and Adorjan, P. (2002). Statistical process control for large scale microarray experiments. Bioinformatics 18 Suppl 1, S155-163.
  69. Mramor, M., Leban, G., Demsar, J., and Zupan, B. (2007). Visualization-based cancer microarray data classification analysis. Bioinformatics 23, 2147-2154.
  70. Nielsen, F. A., and Hansen, L. K. (2002). Modeling of activation data in the BrainMap (TM) database: Detection of outliers. Human Brain Mapping 15, 146- 156.
  71. Nielsen, F. A., Hansen, L. K., and Kjems, U. (2001). Modeling of locations in the BrainMap database: Detection of outliers. NeuroImage 13, S211-S211.
  72. Oh, J. H., and Gao, J. (2009). A kernel-based approach for detecting outliers of high-dimensional biological data. BMC Bioinformatics 10 Suppl 4, S7.
  73. Ohlssen, D. I., Sharples, L. D., and Spiegelhalter, D. J. (2007). A hierarchical modelling framework for identifying unusual performance in health care providers. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 865-890.
  74. Penny, K. I., and Jolliffe, I. T. (1999). Multivariate outlier detection applied to multiply imputed laboratory data. Statistics In Medicine 18, 1879-1895.
  75. Penny, K. I., and Jolliffe, I. T. (2001). A Comparison of Multivariate Outlier Detection Methods for Clinical Laboratory Safety Data. Journal of the Royal Statistical Society: Series D (The Statistician) 50, 295.
  76. Read, R. J. (1999). Detecting outliers in non-redundant diffraction data. Acta Crystallogr D Biol Crystallogr 55, 1759-1764.
  77. Rochelson, B., Vohra, N., Krantz, D., and Macri, V.J. (2006). Geometric morphometric analysis of shape outlines of the normal and abnormal fetal skull using three-dimensional sonographic multiplanar display. Ultrasound in Obstetrics & Gynecology 27, 167-172.
  78. Rubin, M. A., and Chinnaiyan, A.M. (2006). Bioinformatics approach leads to the discovery of the TMPRSS2:ETS gene fusion in prostate cancer. Lab Invest 86, 1099-1102.
  79. Ryan, A. M. (2009). Effects of the Premier Hospital Quality Incentive Demonstration on Medicare Patient Mortality and Cost. Health Services Research 44, 821- 842.
  80. Silva-Costa, T., Marques, B., and Freitas, A. (2010). Problemas de Qualidade de Dados em Bases de Dados de Internamentos Hospitalares. Paper presented at: 5ª Conferência Ibérica de Sistemas e Tecnologias de Informação (Santiago de Compostela).
  81. Silva, F. R. (2004). Uma abordagem para detecção de outliers em dados categoricos. In Instituto de Computação (Campinas, SP Universidade Estadual de Campinas).
  82. Song, X., and Wyrwicz, A. M. (2009). Unsupervised spatiotemporal fMRI data analysis using support vector machines. NeuroImage 47, 204-212.
  83. Tomlins, S. A., Rhodes, D. R., Yu, J., Varambally, S., Mehra, R., Perner, S., Demichelis, F., Helgeson, B. E., Laxman, B., Morris, D. S., et al. (2008). The role of SPINK1 in ETS rearrangement-negative prostate cancers. Cancer Cell 13, 519-528.
  84. Van Leemput, K., Maes, F., Vandermeulen, D., Colchester, A., and Suetens, P. (2001). Automated segmentation of multiple sclerosis lesions by model outlier detection. IEEE Transactions on Medical Imaging 20, 677-688.
  85. Vankeerberghen, P., Smeyersverbeke, J., Leardi, R., Karr, C. L., and Massart, D.L. (1995). Robust Regression and Outlier Detection for NonLinear Models Using Genetic Algorithms. Chemometrics Intell Lab Syst 28, 73-87.
  86. Vellido, A., and Lisboa, P. J. (2006). Handling outliers in brain tumour MRS data analysis through robust topographic mapping. Comput Biol Med 36, 1049- 1063.
  87. Whitley, E., and Ball, J. (2002). Statistics review 1: presenting and summarising data. Crit Care 6, 66-71.
  88. Wu, B. (2007). Cancer outlier differential gene expression detection. Biostatistics 8, 566-575.
  89. Yang, S., Guo, X., Yang, Y. C., Papcunik, D., Heckman, C., Hooke, J., Shriver, C. D., Liebman, M. N., and Hu, H. (2007). Detecting outlier microarray arrays by correlation and percentage of outliers spots. Cancer Inform 2, 351-360.
  90. Zervakis, M., Blazadonakis, M. E., Tsiliki, G., Danilatou, V., Tsiknakis, M., and Kafetzopoulos, D. (2009). Outcome prediction based on microarray analysis: a critical perspective on methods. BMC Bioinformatics 10, 53.

Paper Citation

in Harvard Style

Gaspar J., Catumbela E., Marques B. and Freitas A. (2011). A SYSTEMATIC REVIEW OF OUTLIERS DETECTION TECHNIQUES IN MEDICAL DATA - Preliminary Study . In Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2011) ISBN 978-989-8425-34-8, pages 575-582. DOI: 10.5220/0003168705750582

in Bibtex Style

author={Juliano Gaspar and Emanuel Catumbela and Bernardo Marques and Alberto Freitas},
booktitle={Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2011)},

in EndNote Style

JO - Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2011)
SN - 978-989-8425-34-8
AU - Gaspar J.
AU - Catumbela E.
AU - Marques B.
AU - Freitas A.
PY - 2011
SP - 575
EP - 582
DO - 10.5220/0003168705750582