agnostics even if they are not well described in the
medical resources. In order to achieve this goal we
will use different kind of information available in the
inpatient stay to build a decision tree and finally high-
light the important variables used for each diagnostic.
2 RELATED WORK
There are few research works focusing on the use of
data mining to predict a diagnostic code and we can
report some of the following used methods:
• Sequential patterns. (Djennaoui et al., 2014) used
sequential patterns to detect similar medical pro-
cedures patterns between different inpatient stays
and to extract rules of these patterns to predict
missing diagnostics. They studied three diag-
nostics and they were able to extract three rules,
two of them are predicting the same diagnostic.
Another similar work is done by (Pinaire et al.,
2015).
• Text Mining. Other few works tried to extract the
diagnostic codes directly from the medical letter
using thesaurus such as MeSH (Medical Subject
Heading) in (Pereira et al., 2006) or using proba-
bilistic methods as in (Lecornu et al., 2009).
• Clustering. (Erraguntla et al., 2012) used K-
nearest method to cluster all the similar inpatient
stays and predict a missing diagnostic.
In our work we want to explore the use of decision
tree method. Decision trees are useful in a context
when clear results are needed, visually understand-
able, specially when they need to be validated by non
specialist. We made the hypothesis that if we are able
to determine which variables may help to predict a
secondary diagnosis, we could help the coders to pay
attention to these variables while coding.
3 METHODS
3.1 Used Data
We used an anonymous sample data extracted from
the PMSI database of “Pays d’Autan” hospital, it
contains around 75,000 inpatient stays between 2011
and 2014. We decided to use the information recorded
in the PMSI database which are often well encoded
as they are easy to detect (primary diagnoses, sex,
age, stay duration...) to build decision tree. We also
used two levels of diagnostics grouping, the first level
groups the diagnostics into 19 general categories de-
pending on thier similarities, the second level groups
the diagnostics into 126 more specific categories. Af-
ter fixing the primary and the secondary diagnostic of
the inpatient stay, we retained the following informa-
tion to include in the construction of the decision tree:
Table 1: Used variables in building the decision tree.
Sex Male or Female
Mode of Entry (ME)
Patient acceptance mode in the inpa-
tient stay. (GUIDE, 2006)
Mode of Sortie/Exit
(MS)
Patient leaving mode of the inpa-
tient stay. (GUIDE, 2006)
Age Patient’s age when accepted in the
inpatient stay.
Duration The duration of the inpatient stay in
days.
Season
The season of the inpatient stay
when the patient is accepted.
Frequency
Patient’s inpatient stay count in the
hospital
Gap
The gap in days between the entry
date and the first medical procedure.
Passage count
The movements count between dif-
ferent sections during the inpatient
stay.
Medical procedures
count
Medical procedure count while the
inpatient stay.
ICR
The quota cost of medical proce-
dures in teh inpatient stay.
Classified
Whether the inpatient stay contains
a classified/important medical pro-
cedure.
Emergency Whether the inpatient stay contains
an emergency case.
Example/Label
Positive if the inpatient stay has both
the principal and the secondary di-
agnostics. Negative if it has only the
principal diagnostic
Medical procedure
chapters
A set of 19 variables each variable
indicates if the inpatient stay con-
tains a corresponding medical pro-
cedure.
Urgent medical pro-
cedure chapters
A set of 5 variables each variable in-
dicates if the inpatient stay contains
a corresponding urgent medical pro-
cedure category.
First level diagnostic
grouping
A set of 19 variables each vari-
able indicates if the inpatient stay
contains a corresponding diagnostic
grouping.
Second level diagnos-
tic grouping
A set of 126 variables each vari-
able indicates if the inpatient stay
contains a corresponding diagnostic
grouping.
In total, we have 181 information variables we can
use to learn our model. The diagnostics were en-
coded according to the 10th revision of the Interna-
tional Classification of Diseases (ICD-10) (WHO, ).
The French version of it contains 33,816 codes, the
first three characters of the codes stand for code cate-
Increasing Alertness while Coding Secondary Diagnostics in the Medical Record
491