Prediction for Disease Risk and Medical Cost using Time Series
Healthcare Data
Masatoshi Nagata, Kazunori Matsumoto and Masayuki Hashimoto
KDDI R&D Labs, Saitama, Japan
Keywords: Sequential Latent Dirichlet Allocation, LDA, Sequential LDA, Lifestyle-related Disease, Medical Cost.
Abstract: Foreseeing the medical expenditure is beneficial for both insurance companies and individuals. In this paper
we propose a new methodology to predict disease risk and medical cost. Based on sequential latent dirichlet
allocation (SeqLDA), which classifies hierarchical sequential data into segments of topics, we tried to
predict the number of people with diseases and the one-year cost of lifestyle-related diseases. Using the
health checkup information and medical claims of 6500 people for three years, we achieved that prediction
error was less than conventional LDA, and for accuracy rate, AUC was more than 0.71. The results suggest
that the SeqLDA method serve to predict the number of people with diseases and the related medical costs
using time series healthcare data.
1 INTRODUCTION
The increasing incidence of lifestyle-related diseases
and non-communicable diseases has become a major
issue in many regions (WHO, 2009; Lim et al.,
2012). In Japan, medical expenditures are increasing
dramatically, and exceeded 4 trillion yen in 2013.
Moreover, lifestyle-related diseases now account for
one-third of all medical expenditures (Ministry of
Health, Labour and Welfare, 2011). Predictiion for
such diseases and the related medical costs would
provide valuable information for healthcare
enterprises and administration policymakers.
Several studies have attempted to predict medical
costs based on medical claims (receipts). Many of
the studies achieved accurate results by means of
general regression and cox regression calculations
based on an analysis of billing claims (Brandle et al.,
2003; Zhao et al., 2005; Bertsimas et al., 2008).
However most research was focused on people with
a disease and did not includ healthy people.
Practically, health insurance association or
municipalities incur medical expenditures for
patients who sought medical care even the person
had been healthy in previous years. For this reason,
it would be more desirable predicting medical
expenditure from a certain population including
healthy people.
When and how much medical cost occurs will be
depend on patients’ health status. So if it were
possible to estimate and classify patients’ health
state, we could predict disease risks and medical
costs. A previous study using latent dirichlet
allocation (LDA), which is a topic model where
machine-learning techniques are used for natural
language processing, showed that it is possible to
predict disease risk with data on medical checkups
and claims (Kashima et al., 2013; Ogawa et al.,
2014). However, the data was not processed as time
series data, and it could be refined for the purpose of
practical use.
In this paper we aimed to evaluate whether
adding information of time series of healthcare data
to LDA improve prediction performance for disease
risk and medical cost. Therefore we applied
sequential LDA, which has been developed for
handling sequential data as segments of topics to
healthcare data (Teh et al., 2006; Lan Du et al.,
2010; Lan Du et al., 2012). SeqLDA gives a
sequential topic distribution for a particular period.
For healthcare data, the current health status of a
person may relate to past data, so SeqLDA would be
a better method for predicting the risk of diseases.
We present the preliminary results of predicting the
risks and medical costs of lifestyle-related diseases
using health checkups and claims for three years.