patients do not remain that way every year. Also,
two patients could share very similar profiles with
only one of them being high-cost. Studying these
seemingly anomalous patients could provide a better
understanding of how a high-cost patient is different
from other patients. In addition, the current sampling
approach and available classification techniques
could be further tuned to improve results.
Apart from these possibilities, the most
promising future direction is in working with key
data partners. This avenue provides the opportunity
to obtain information on the cost containment
methods used and their efficiency as well as real
data on the cost benefits obtained from previous
predictive models. Working with such partners, we
endeavor to provide a reasonable, patient-specific
answer to this question that would significantly
impact cost containment in the healthcare industry.
REFERENCES
Anderson, R.T., Balkrishnan, R., & Camacho, F. (2004).
Risk Classification of Medicare HMO Enrollee Cost
Levels using a Decision-Tree Approach. Am J
Managed Care, 10(2), 89-98.
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C.
(2004). A Study of the Behavior of Several Methods
for Balancing Machine Learning Training Data. ACM
SIGKDD Explorations Newsletter, 6(1), 20-29.
Bodenheimer, T. (2005). High and Rising Health Care
Costs. Part 1: Seeking an Explanation. Ann Intern
Med, 142, 847-854.
Berk, M. L., & Monheit, A. C. (2001). The Concentration
of Health Care Expenditures, Revisited. Health
Affairs, 20 (2), 9-18.
Chawla, N. V., Bowyer, K. W., Hall, L. O., &
Kegelmeyer, W. P. (2002). SMOTE: Synthetic
Minority Over-sampling Technique. Journal of
Artificial Intelligence Research, 16, 321-357.
Chawla, N. V., Japkowicz, N., & Kolcz, A. (2004).
Editorial: Special Issue on Learning from Imbalanced
Data Sets, ACM SIGKDD Explorations Newsletter,
6(1), 1-6.
Cios, K. J., & Moore, G. W. (2002). Uniqueness of
Medical Data Mining. Artificial Intelligence in
Medicine, 26(1-2), 1-24.
Diehr, P., Yanez, D., Ash, A., Hornbrook, M., & Lin, D.
Y. (1999). Methods For Analysing Health Care
Utilization and Costs. Ann Rev Public Health, 20, 125-
144.
Drummond, C., & Holte, R. C. (2003). C4.5, Class
Imbalance, and Cost Sensitivity: Why Under-
Sampling beats Over-Sampling. ICML Workshop
Learning From Imbalanced Datasets II, 2003.
Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A
Multiple Resampling Method For Learning From
Imbalanced Data Sets. Computational Intelligence,
20(1), 18-36.
Farley, J. F., Harrdley, C. R., & Devine, J. W. (2006). A
Comparison of Comorbidity Measurements to Predict
Health care Expenditures. Am J Manag Care, 12, 110-
117.
Fleishman, J. A., Cohen, J. W., Manning, W.G., &
Kosinski, M. (2006). Using the SF-12 Health Status
Measure to Improve Predictions of Medical
Expenditures. Med Care, 44(5S), I-54-I-66.
Li, J., Fu, A. W., He, H., Chen, J., Jin, H., McAullay, D. et
al. (2005). Mining Risk Patterns in Medical Data. Proc
11
th
ACM SIGKDD Int’l Conf. Knowledge Discovery
in Data Mining (KDD’05), 770-775.
Maloof, M. (2003). Learning When Data Sets are
Imbalanced and When Costs are Unequal and
Unknown. ICML Workshop Learning From
Imbalanced Datasets II, 2003.
McCarthy, K., Zabar, B., & Weiss, G. Does cost-sensitive
learning beat sampling for classifying rare classes?
Proc 1
st
Int’l Workshop on Utility-based data mining
(UBDM ’05), 69-77.
Meenan, R. T., Goodman, M. J., Fishman, P. A.,
Hornbrook, M. C., O’Keeffe-Rosetti, M. C., &
Bachman, D. J. (2003). Using Risk-Adjustment
Models to Identify High-Cost Risks. Med Care,
41(11), 1301-1312.
Moturu, S.T., Johnson, W.G., & Liu, H. (2007). Predicting
Future High-Cost Patients: A Real-World Risk
Modeling Application. Proc IEEE International
Conference on Bioinformatics and Biomedicine 2007,
Accepted.
Perkins, A. J., Kroenke, K., Unutzer, J., Katon, W.,
Williams Jr., J. W., Hope, C. et al. (2004). Common
comorbidity scales were similar in their ability to
predict health care costs and mortality. J Clin
Epidemiology, 57, 1040-1048.
Scheffer, J. (2002). Data Mining in the Survey Setting:
Why do Children go off the Rails? Res. Lett. Inf.
Math. Sci., 3, 161-189.
Weiss, G.M., & Provost, F. (2001). The Effect of Class
Distribution on Classifier Learning: An Empirical
Study, (Dept. Computer Science, Rutgers University,
2001), tech report ML-TR-44.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical
machine learning tools and techniques, 2nd Edition,
San Francisco: Morgan Kaufmann.
Zhang, D., & Zhou, L. (2004). Discovering Golden
Nuggets: Data Mining in Financial Application. IEEE
Trans. Sys. Man Cybernet., 34(4), 513-522.
Zhao, Y., Ash, A. S., Ellis, R. P., Ayanian, J. Z., Pope, G.
C., Bowen, B. et al. (2005). Predicting Pharmacy
Costs and Other Medical Costs Using Diagnoses and
Drug Claims. Med Care, 43(1), 34-43.
Zweifel, P., Felder, S., & Meiers, M. (1999). Ageing of
Population and Health Care Expenditure: A Red
Herring. Health Econ, 8, 485-496.
HEALTHCARE RISK MODELING FOR MEDICAID PATIENTS - The Impact of Sampling on the Prediction of
High-Cost Patients
133