architectures. The concatenated vectors are then used
in further dense layers, which end in a new sigmoid-
activated dense layer.
4.4 Vote Architecture
This architecture is initially the same as the merged
architecture presented in Section 4.3. However, rather
than merging hidden layer vectors and using the re-
sulting concatenated vector in further processing, the
structured data and text data are separately used to de-
termine “votes” for classification through the use of a
sigmoid-activated dense layer. These two votes are
then used to determine the final output.
As previously mentioned, this architecture is the
same as the merged architecture, but at the end of
the branches, just before the concatenation, the ar-
chitecture has sigmoid-activated dense layers, which
can be taken to be the individual predictions for each
branch. These are then concatenated and used in a
third sigmoid-activated layer to achieve the final pre-
diction.
5 TASKS AND RESULTS
Two kinds of tasks were prepared from the
BRATECA datasets: length-of-stay classification and
mortality classification. Test sets were prepared for
these tasks and the architectures discussed in Sec-
tion 4 were adapted to the required inputs and outputs
of each test set.
All models were trained for up to 50 epochs. The
model for the epoch with the best validation loss score
was kept. These models are also available on the
project’s GitHub page. The models were evaluated
by extracting the following metrics: Precision, Recall
and F1 at the 0.5 threshold, to complement the 0.5
threshold confusion matrices analyzed in this section;
AUPRC, to better analyze the unbalanced (i.e., pro-
portional) test set; and AUROC, to better analyze the
balanced test set.
Since the test sets were derived from limited-
access data, only the code for recreating them and in-
structions on how to use that code have been made
available on this project’s GitHub page. Thus, ac-
quiring access to the BRATECA collection through
Physionet is required to recreate these test sets and to
reproduce the results in this paper.
5.1 Length-of-Stay Task
The length-of-stay (LoS) classification task requires
a model to determine whether an admission will ex-
ceed the length of 7 days. To make this prediction,
the model has access to data from the first 24 hours of
admission.
This test set is composed of 32,159 admissions of
patients who stayed at least 24 hours in hospital. Of
these admissions, 10,495 were of patients who were
hospitalized for more than 7 days, henceforth consid-
ered to be the positive class, and 21,664 were of pa-
tients who were hospitalized for less than or equal to
7 days, henceforth considered to be the negative class.
This means that proportionally, for every patient who
exceeds 7 days of hospitalization, 2.06 patients are
hospitalized for less than or equal to 7 days. For the
purposes of balancing the test set, 10,495 examples of
each category were randomly selected for the test set
and the remainder were initially discarded.
The test set was divided into three parts: training,
composed of 70% of all examples; testing, composed
of 20% of all examples; and validation, composed of
10% of all examples. This left the training set with
7,346 examples of each category, the test set with
2,099 examples of each category and the validation
set with 1,050 examples of each category.
Another version of the test set was created, how-
ever, which maintained the original 2.06:1 proportion.
This alternative set had 6,423 examples for testing. It
used the balanced test set as a base, with the addition
of examples from the initially discarded ’less than or
equal to 7 days of hospitalization’ examples in order
to reach the desired proportion. This set will be re-
ferred to as ’Proportional’, while the first will be re-
ferred to as ’Balanced’. Regardless of the kind of set
used for testing, the models were always trained and
validated using a balanced set.
As can be seen in Table 2, the best results were
achieved by the free-text architecture. The structured
architecture was significantly worse than the rest, and
the use of structured data in the merge and vote archi-
tectures only worsened the results, if slightly.
The AUPRC score drops significantly when com-
paring the balanced test set to the proportional test set.
This reveals that in a more realistic scenario, the mod-
els do not perform as well as in a balanced scenario.
Overall, the tests show that the unstructured free-
text information is meaningfully helpful when at-
tempting to predict whether admissions will be of
short or long length. The structured exam data did
not help, and at times seemed to hinder the models
in this task, which points to either the need for better
data integration when creating inputs, or that exam
and admission data are wholly unhelpful for this task.
Further tests are needed to discern which of these pos-
sibilities is the case.
As for the individual architectures, the structured
Benchmarking the BRATECA Clinical Data Collection for Prediction Tasks
341