Benchmarking the BRATECA Clinical Data Collection for Prediction

Tasks

Bernardo Scapini Consoli

1 a

, Renata Vieira

2 b

and Rafael H. Bordini

1 c

School of Technology, Pontiﬁcal Catholic University of Rio Grande do Sul, Porto Alegre, Brazil

CIDEHUS, University of

Evora,

Evora, Portugal

Keywords:

Computational Medicine, Healthcare Informatics, Clinical Prediction, Clinical Data, BRATECA.

Abstract:

Expanding the usability of location-speciﬁc clinical datasets is an important step toward expanding research

into national medical issues, rather than only attempting to generalize hypotheses from foreign data. This

means that benchmarking such datasets, thus proving their usefulness for certain kinds of research, is a worth-

while task. This paper presents the ﬁrst results of widely used prediction tasks from data contained within the

BRATECA collection, a Brazilian tertiary care data collection, and also results for neural network architec-

tures using these newly created test sets. The architectures use both structured and unstructured data to achieve

their results. The obtained results are expected to serve as benchmarks for future tests with more advanced

models based on the data available in BRATECA.

1 INTRODUCTION

Ever since the resurgence of neural networks in the

2010s, there has been much interest in the use of such

architectures for prediction tasks in many different

domains. This is no different in the ﬁeld of health-

care informatics, which has seen many deep learning

breakthroughs in the last decade (Rav

ı et al., 2017).

Given the difﬁculties of data acquisition and in-

tegration when working with clinical information due

to both technical and security issues (Weitschek et al.,

2013; Thapa and Camtepe, 2021), it is understand-

able that database availability is still very limited.

The most well known such database suitable for deep

learning research is the Physionet MIMIC collec-

tion (Goldberger et al., 2000), in its many editions.

While this resource is very useful, it ultimately re-

ﬂects a speciﬁc clinical reality, that of the United

States of America, which is not readily translatable to

the realities of different countries. For this reason, it

is important to also explore other databases and adapt

known approaches to the new data since this may well

lead to unexpected results.

The Brazilian clinical data collection BRATECA

uses data collected by the Brazilian nonproﬁt orga-

https://orcid.org/0000-0003-0656-511X

https://orcid.org/0000-0003-2449-5477

https://orcid.org/0000-0001-8688-9901

nization NoHarm. That dataset has been released

for credentialed access through Physionet by NoHarm

exclusively for use in research (Consoli et al., 2022).

This collection provides different data in different for-

mats compared to MIMIC, so any technique used on

it must be adapted to the new format.

This work is the ﬁrst to explore the develop-

ment of clinical prediction test sets from information

present in the BRATECA collection, as well as the

use of these test sets to evaluate and validate feedfor-

ward neural network architectures. Such architectures

are widely used by the literature, and provide good

initial results to which subsequent research may be

compared.

Evaluation and validation are extremely impor-

tant for the development of machine learning models,

and especially so for clinical prediction models. This

work uses the length-of-stay and mortality prediction

tasks because they are relevant to the clinical reali-

ties of Brazilian hospitals, and their advancement is

beneﬁcial to the future of real world AI deployment,

besides being widely used in the health informatics

literature.

This paper is organized into four further sec-

tions: Section 2 presents previous work related to the

tasks and architectures used herein; Section 3 brieﬂy

presents the BRATECA collection; Section 4 presents

the deep neural network architectures used for the pre-

dictions tasks; Section 5 presents the tasks themselves

338

Consoli, B., Vieira, R. and Bordini, R.

Benchmarking the BRATECA Clinical Data Collection for Prediction Tasks.

DOI: 10.5220/0011671400003414

In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023) - Volume 5: HEALTHINF, pages 338-345

ISBN: 978-989-758-631-6; ISSN: 2184-4305

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

and the results achieved by each of the architectures

presented in Section 4; and Section 6 presents the con-

clusions derived from the work presented here.

2 RELATED WORK

Since clinical prediction often involves challenging,

high-risk scenarios, proper evaluation and validation

are invaluable in proving a model’s usability in real

scenarios (Yong-ho et al., 2016). This is the case be-

cause building trust in the models is critical for use

in clinical scenarios, and high-quality validation is a

good ﬁrst step when attempting to show the reliability

of machine learning to those for whom computation is

not their area of expertise, such as medical profession-

als. This makes good test set availability and quality

some of the main priorities when choosing an appro-

priate data collection for training prediction models.

The MIMIC collections, for example, have sev-

eral test sets for each of iteration. Bardhan et al. 2022

and Yue et al. 2020 developed question answering

datasets using data present in the MIMIC-III collec-

tion (Bardhan et al., 2022; Yue et al., 2020), for ex-

ample. The CLIP action item dataset was also cre-

ated from MIMIC-III data (Mullenbach et al., 2021).

These test sets, alongside others, add usability to the

MIMIC-III collection.

BRATECA, released in 2022 (Consoli et al.,

2022), did not have any associated test sets before the

development of the work presented in this paper. This

prompted the development of a base of test sets to be

used in conjunction with the BRATECA collection in

an effort to encourage the use of Brazilian data for

projects focused on Brazilian clinical care and Brazil-

ian Portuguese clinical natural language processing.

3 THE BRATECA COLLECTION

The BRATECA collection is composed of 5 medical

datasets, each containing different kinds of patient in-

formation. These are:

Admission - which contains structured patient de-

tails such as age and sex;

Exam - which contains structured exam results;

Clinical Note - which contains free-text clinical

notes in Brazilian Portuguese;

Prescription - which contains structured patient pre-

scriptions; and

Prescription Item - which is directly related to the

Prescription dataset and details all medication

items of a given prescription.

The columns for each of the three datasets used

in the development of the test sets, Admission, Exam

and Clinical Note, are detailed in Table 1.

All datasets are united by admission and patient

IDs, which allow one to link entries from different

datasets with one another. The collection includes in-

formation from 73,040 admission records of 52,973

unique adults (18 years of age or older) extracted from

10 hospitals located in two Brazilian states. Of these,

only admissions lasting more than 24 hours were con-

sidered for this work, as all tasks required training

with at least the ﬁrst day’s worth of information. Ad-

ditional information ﬁlters are detailed in the speciﬁc

task descriptions.

The Exam and Admission datasets provided the

structured data used in the models, while the Clinical

Note dataset provided the free-text data used. Struc-

tured data was processed according to its type, with

numerical data being normalized and categorical data

being one-hot encoded, while free-text data was pro-

cessed using pre-trained BERT models. The Prescrip-

tion datasets were not used in this work since they re-

quire signiﬁcantly more processing than the others to

be used with any degree of success, and as such be-

came the main subject of a separate thread of research.

The speciﬁc information used from structured and

free data is explained in detail in Sections 4.1 and 4.2,

respectively.

4 NEURAL NETWORK

ARCHITECTURES

Four neural network architectures were developed to

accomplish the prediction tasks studied in this work.

These are divided into three categories: one which

utilizes only structured information present in the

Exam and Admission datasets from the BRATECA

collection; one which utilizes only unstructured text

data present in the Clinical Notes dataset from the

BRATECA collection; and two which utilize all three

previously mentioned datasets.

The four network architectures are kept mostly

the same between tasks, with the only change being

the expected output which followed the task being

trained. All architectures (including data transforma-

tions performed on the BRATECA datasets in order

to create input features) are available on this paper’s

GitHub page

https://github.com/bsconsoli/brateca-prediction-tasks

Benchmarking the BRATECA Clinical Data Collection for Prediction Tasks

339

Table 1: Columns and descriptions of columns for the Admission, Exam and Clinical Note datasets. Table excerpt from (Con-

soli et al., 2022).

Dataset Column Description Column Description

Admission

Hospital ID

The identiﬁcation code for the hospital

from which the data originated.

Patient ID

The identiﬁcation code for the patient

for whom the admission was registered.

Admission ID

The identiﬁcation code for the admission

to which the information belongs.

Date of Birth Patient’s date of birth.

Gender Patient’s gender. Admission Date

Date patient was admitted

to hospital.

Skin Color Patient’s skin color. Height Patient’s height.

Weight Patient’s weight. Height Date Date patient’s height was measured.

Weight Date Date the patient was weighted.

Exam

Hospital ID

The identiﬁcation code for the hospital

from which the data originated.

Patient ID

The identiﬁcation code for the patient

for whom the admission was registered.

Admission ID

The identiﬁcation code for the admission

to which the information belongs.

Exam Name

Name of the exam that was

performed.

Exam Date Date the exam was performed Value

Numerical value of the result

of the exam.

Unit

Unit of measurement the exam’s

Value is in.

Clinical Note

Hospital ID

The identiﬁcation code for the hospital

from which the data originated.

Patient ID

The identiﬁcation code for the patient

for whom the admission was registered.

Admission ID

The identiﬁcation code for the admission

to which the information belongs.

Note Date Date the note was written.

Note Text The contents of the note. Notetaker Position Notetaker’s job title.

4.1 Structured Data Architecture

This architecture receives input features solely from

the Exam and Admission datasets of BRATECA, as

previously stated. The features include: patient age

(normalized), skin color (one-hot encoded), sex (one-

hot encoded) from the Admission dataset; and an ar-

ray of all 103 exam results, where all exams that

have not been performed on a patient being entered

as 0 (zero) and those that have been performed be-

ing entered as the result (normalized) from the Exam

dataset.

For all tasks, a begin and end date was set when

mining input data (e.g. only data produced within the

ﬁrst day of admission would be considered), as ex-

plained in Section 5, and only the most recent results

were considered for each exam feature when the same

exam was performed multiple times within the speci-

ﬁed time frame.

All but one of the dense layers use ReLU acti-

vation, while the ﬁnal layer uses sigmoid activation

to predict between the two classes of the presented

tasks. The speciﬁcs for each layer can be found on

the project’s GitHub page.

4.2 Free-Text Architecture

This architecture receives clinical notes in a free-text

format. All clinical notes within the task’s input col-

lection time frame are merged into a single document

before processing.

The text was processed with BioBERTpt (Schnei-

der et al., 2020), a BERT-based (Devlin et al., 2019)

model trained on clinical and biomedical texts written

in Brazilian Protuguese. The BERT output was then

used to produce a classiﬁcation with LSTM and dense

layers.

It receives token and masked layers for BERT

and passes the BERT output through a bidirectional

LSTM before using a 1D max pooling layer to adapt

the data for dense layers. All but one of the dense lay-

ers use ReLU activation, while the the ﬁnal layer uses

sigmoid activation to predict between the two classes

of the given tasks. The speciﬁcs for each layer can be

found on the project’s GitHub page.

4.3 Merged Architecture

This architecture receives all collected data from

BRATECA, including Exam, Admission and Clinical

Notes. The structured data from Exams and Admis-

sions is processed separately from the free-text data of

the Clinical Notes at ﬁrst, but are eventually concate-

nated. The concatenated vector is further processed

and then used to produce the output.

The structured data was processed using the same

methods as described in the structured-data architec-

ture presented in Section 4.1, while the free-text data

was processed using the same methods as described in

the Free-Text Architecture presented in Section 4.2.

It begins with two input branches. These are the

same as the free-text and structured data architectures

until the concatenation layer, which happens just be-

fore the sigmoid-activated layers of both the previous

HEALTHINF 2023 - 16th International Conference on Health Informatics

340

architectures. The concatenated vectors are then used

in further dense layers, which end in a new sigmoid-

activated dense layer.

4.4 Vote Architecture

This architecture is initially the same as the merged

architecture presented in Section 4.3. However, rather

than merging hidden layer vectors and using the re-

sulting concatenated vector in further processing, the

structured data and text data are separately used to de-

termine “votes” for classiﬁcation through the use of a

sigmoid-activated dense layer. These two votes are

then used to determine the ﬁnal output.

As previously mentioned, this architecture is the

same as the merged architecture, but at the end of

the branches, just before the concatenation, the ar-

chitecture has sigmoid-activated dense layers, which

can be taken to be the individual predictions for each

branch. These are then concatenated and used in a

third sigmoid-activated layer to achieve the ﬁnal pre-

diction.

5 TASKS AND RESULTS

Two kinds of tasks were prepared from the

BRATECA datasets: length-of-stay classiﬁcation and

mortality classiﬁcation. Test sets were prepared for

these tasks and the architectures discussed in Sec-

tion 4 were adapted to the required inputs and outputs

of each test set.

All models were trained for up to 50 epochs. The

model for the epoch with the best validation loss score

was kept. These models are also available on the

project’s GitHub page. The models were evaluated

by extracting the following metrics: Precision, Recall

and F1 at the 0.5 threshold, to complement the 0.5

threshold confusion matrices analyzed in this section;

AUPRC, to better analyze the unbalanced (i.e., pro-

portional) test set; and AUROC, to better analyze the

balanced test set.

Since the test sets were derived from limited-

access data, only the code for recreating them and in-

structions on how to use that code have been made

available on this project’s GitHub page. Thus, ac-

quiring access to the BRATECA collection through

Physionet is required to recreate these test sets and to

reproduce the results in this paper.

5.1 Length-of-Stay Task

The length-of-stay (LoS) classiﬁcation task requires

a model to determine whether an admission will ex-

ceed the length of 7 days. To make this prediction,

the model has access to data from the ﬁrst 24 hours of

admission.

This test set is composed of 32,159 admissions of

patients who stayed at least 24 hours in hospital. Of

these admissions, 10,495 were of patients who were

hospitalized for more than 7 days, henceforth consid-

ered to be the positive class, and 21,664 were of pa-

tients who were hospitalized for less than or equal to

7 days, henceforth considered to be the negative class.

This means that proportionally, for every patient who

exceeds 7 days of hospitalization, 2.06 patients are

hospitalized for less than or equal to 7 days. For the

purposes of balancing the test set, 10,495 examples of

each category were randomly selected for the test set

and the remainder were initially discarded.

The test set was divided into three parts: training,

composed of 70% of all examples; testing, composed

of 20% of all examples; and validation, composed of

10% of all examples. This left the training set with

7,346 examples of each category, the test set with

2,099 examples of each category and the validation

set with 1,050 examples of each category.

Another version of the test set was created, how-

ever, which maintained the original 2.06:1 proportion.

This alternative set had 6,423 examples for testing. It

used the balanced test set as a base, with the addition

of examples from the initially discarded ’less than or

equal to 7 days of hospitalization’ examples in order

to reach the desired proportion. This set will be re-

ferred to as ’Proportional’, while the ﬁrst will be re-

ferred to as ’Balanced’. Regardless of the kind of set

used for testing, the models were always trained and

validated using a balanced set.

As can be seen in Table 2, the best results were

achieved by the free-text architecture. The structured

architecture was signiﬁcantly worse than the rest, and

the use of structured data in the merge and vote archi-

tectures only worsened the results, if slightly.

The AUPRC score drops signiﬁcantly when com-

paring the balanced test set to the proportional test set.

This reveals that in a more realistic scenario, the mod-

els do not perform as well as in a balanced scenario.

Overall, the tests show that the unstructured free-

text information is meaningfully helpful when at-

tempting to predict whether admissions will be of

short or long length. The structured exam data did

not help, and at times seemed to hinder the models

in this task, which points to either the need for better

data integration when creating inputs, or that exam

and admission data are wholly unhelpful for this task.

Further tests are needed to discern which of these pos-

sibilities is the case.

As for the individual architectures, the structured

Benchmarking the BRATECA Clinical Data Collection for Prediction Tasks

341

Table 2: Results for all tests and architectures in the length-of-stay task.

Balanced Test Set Proportional Test Set

Architecture Prec. Rec. F1 AUPRC AUROC Architecture Prec. Rec. F1 AUPRC AUROC

Structured

0.60 0.52 0.56 0.60 0.64

Structured

0.41 0.52 0.46 0.40 0.62

Free-text

0.72 0.73 0.72 0.76 0.80

Free-text

0.49 0.73 0.58 0.55 0.75

Merged

0.68 0.84 0.75 0.72 0.78

Merged

0.48 0.84 0.61 0.49 0.74

Vote

0.71 0.70 0.70 0.68 0.74

Vote

0.50 0.70 0.58 0.48 0.72

data architecture achieved poor f-scores when com-

pared to the rest of the architectures. Table 3 shows a

proportionally large amount of false negatives, which

indicates it only weakly learned to predict the positive

class.

Table 3: Structured data confusion matrices for balanced

and proportional length-of-stay test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 1362 737 2727 1597

Positive Truth 1000 1099 1000 1099

The free-text architecture performed best overall.

Like the other architectures, it performed poorly in the

proportional test set, but had the best overall AUPRC

score despite having a middling f-score at the 0.5

threshold. Table 4 shows a relatively high number of

false negatives, which affected recall at that threshold.

Table 4: Free-text confusion matrices for balanced and pro-

portional length-of-stay test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 1498 601 2727 1597

Positive Truth 574 1525 574 1525

The merged architecture achieved similar results

to the free architecture in the balanced test set, as per

the AUROC score. Despite having a better f-score at

the 0.5 threshold, the AUPCR for the proportional test

set was considerably lower than the free-text architec-

ture. Table 5 shows overall better predictions for the

positive class, explaining the better f-score.

Table 5: Merge model confusion matrices for balanced and

proportional length-of-stay test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 1274 825 2383 1941

Positive Truth 328 1771 328 1771

The vote architecture performed quite similarly to

the merged architecture. Table 6 shows somewhat

worse results for the 0.5 threshold, but the AUROC

and AUPRC show the similarity of the models for the

balanced and proportional test sets respectively.

Table 6: Vote model confusion matrices for balanced and

proportional length-of-stay test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 1492 607 2857 1467

Positive Truth 630 1469 630 1469

In summary, the free-text, merged and vote archi-

tectures achieved very similar results, though the free-

text architecture can generally be considered to be the

best for this task. The structured data architecture,

meanwhile, failed to achieve comparable results. This

makes it clear that, for the task of length-of-stay pre-

diction using these architectures, the free-text clinical

notes are more meaningful. Structured data by itself

failed to achieve good results, and failed to add value

when combined with free-text data. Our leading hy-

pothesis for why this is the case is that the way the in-

put is merged is inefﬁcient, and must be accomplished

in a more integrated manner.

5.2 Mortality Task

The mortality classiﬁcation task requires a model to

determine whether or not the outcome of an admis-

sion will be the death of the patient. To make this

prediction, the model has access to data from the ﬁrst

24 hours of admission.

This test set is composed of 16,285 unique ad-

missions. Of these, 1,508 were of admissions that

resulted in death, henceforth referred to as the pos-

itive class, and 14,777 were of admissions that re-

sulted in discharge, henceforth referred to as the neg-

ative class. This means that there are approximately

10 discharges for every death in the test set. For the

purposes of balancing 1,508 examples of each cate-

gory were randomly selected for training and testing

the models while the rest were initially discarded. It

should be noted that the BRATECA collection pos-

sesses several classes detailing slightly different kinds

of discharge and death procedures, but all of these

were uniﬁed into the two classes presented previously.

The test set was divided into three parts: training,

composed of 70% of all examples; testing, composed

of 20% of all examples; and validation, composed of

10% of all examples. This left the training set with

HEALTHINF 2023 - 16th International Conference on Health Informatics

342

Table 7: Results for all tests and architectures in the mortality task.

Balanced Test Set Proportional Test Set

Architecture Prec. Rec. F1 AUPRC AUROC Architecture Prec. Rec. F1 AUPRC AUROC

Structured

0.61 0.30 0.40 0.59 0.59

Structured

0.14 0.30 0.19 0.14 0.59

Free-text

0.76 0.65 0.70 0.79 0.75

Free-text

0.22 0.65 0.33 0.37 0.76

Merged

0.78 0.71 0.74 0.82 0.81

Merged

0.23 0.71 0.34 0.41 0.80

Vote

0.39 0.05 0.09 0.53 0.57

Vote

0.04 0.05 0.04 0.08 0.44

1,056 examples of each category, the test set with 301

examples of each category and the validation set with

151 examples of each category.

Another version of the test set was created which

maintained the 10:1 proportion found originally. This

alternative set had 3010 discharge examples and 301

death examples. It used the balanced test set as a base,

with the addition of examples from the initially dis-

carded ‘discharge’ examples in order to reach the de-

sired proportion. This set will be referred to as ’Pro-

portional’, while the ﬁrst will be referred to as ’Bal-

anced’. Regardless of the kind of set used for testing,

the models were always trained and validated with a

balanced set.

Table 7 shows the results of both test sets for each

architecture in the mortality task. The merged ar-

chitecture showed the best results, achieving the best

AUROC and AUPRC scores for balanced and propor-

tional test sets, respectively. While the free-text archi-

tecture achieved similar, if slightly lower, scores to the

best architecture, the structured and vote architectures

achieved much lower scores across the board.

The sharp AUPRC score drop when comparing

the balanced and proportional test set results is the

most noticeable aspect of this task. The large addi-

tion of negative class examples to the proportional test

set clearly negatively affected precision scores for all

models. This reveals that the balanced results do not

account for how poorly the models fair in an envi-

ronment which simulates the imbalance found in real

clinical scenarios.

The text information proved to be the most mean-

ingful when attempting to predict mortality. The

structured information, by itself, provided no mean-

ingful results. This is believed to be the case because

of a general lack of training data, which dispropor-

tionately affected the structured data over the text data

because there is much more text data than structured

data per admission.

The structured model performed rather poorly

even in the balanced test set. Overall, the results seem

to indicate that the structured data alone is not enough

to train a model for this task. This may have been

caused by a lack of training examples. The confusion

matrix, as presented in Table 8, shows that the model

tends to produce many false negatives proportionally

to true positives, which explains these results.

Table 8: Structured model confusion matrices for balanced

and proportional mortality test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 243 58 2428 583

Positive Truth 210 91 210 91

The free-text model reached the second best re-

sults for this task. The model shows that free-text data

is much richer in useful information than the struc-

tured data. Still, it only performed marginally better

in the proportional test set. The additional false pos-

itives in the proportional test set, as seen in Table 9,

decreased the precision by a considerable margin.

Table 9: Free-text model confusion matrices for balanced

and proportional mortality test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 238 63 2330 681

Positive Truth 104 197 104 197

The merged model achieved similar results to the

free-text model, tough slightly higher overall. The

confusion matrix, as seen in Table 10, is also quite

similar to that of the free-text model.

Table 10: Merged model confusion matrices for balanced

and proportional mortality test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 240 61 2302 708

Positive Truth 88 213 88 213

The vote model, unlike the merge model, shows

very poor performance in both test sets. The addition

of the structured architecture branch seems to have

made training more difﬁcult overall, and given the al-

ready lacking number of examples, the model appears

to have been unable to learn properly.

In summary, for the balanced test set the merged

architecture managed to obtain useful information

from the structured data by merging it with the text

data using dense layers, whereas the vote architecture

did not. This is believed to be so because the output

Benchmarking the BRATECA Clinical Data Collection for Prediction Tasks

343

Table 11: Vote model confusion matrices for balanced and

proportional mortality test sets.

Test Sets Balanced Proportional

Predictions Negative Positive Negative Positive

Negative Truth 276 25 2586 425

Positive Truth 285 16 285 16

predictions from the structured architecture were not

learned correctly by the model and thus only muddled

the backpropagation process of the model. None of

the models performed particularly well in the propor-

tional test set, as shown by the rather poor AUPRC

scores, which is more representative for unbalanced

test sets than AUROC, which tends to be too opti-

mistic in these situations.

5.3 Overall Discussion

The tests performed for the mortality and LoS tasks

in this work revealed several aspects pertaining to the

usage of the BRATECA dataset for the improvement

of clinical prediction tasks on clinical notes written

in Brazilian Portuguese. They conﬁrm that the text

data retrieved from the medical records is extremely

rich in meaning and can be used to train fairly accu-

rate mortality prediction models. It also shows that

the structured data recovered from BRATECA is best

used as an addition to text data rather than to provide

predictions by themselves.

Our work also revealed the large difference be-

tween results acquired from a balanced test set against

those acquired from proportional test sets. This fact

becomes especially important when the end goal of

such research is to be used in decision support sys-

tems in hospitals to help both patients and medical

professionals in real-world scenarios.

The lower results in the mortality task also con-

ﬁrm that the ever-present struggle to acquire task-

relevant training data can be very problematic espe-

cially when dealing with highly unbalanced datasets.

This only strengthens the claims that data sharing and

cooperation between researchers and hospitals is of

utmost important to the development of better medi-

cal AI models.

We also found that the integration between struc-

tured and unstructured data is an active and relevant

avenue of research when dealing with medical data,

which is often composed of various heterogeneous

parts each of which requires specialized processing.

This is an incentive to the development of input vec-

tors that can be better used by machine learning ar-

chitectures to solve clinical tasks. This new kind of

holistic input representation will also require new ar-

chitectures to more accurately learn predictions in an

unbalanced environment.

6 CONCLUSIONS

This work showed the effectiveness of simple neural

network models in common clinical tasks created us-

ing data from the Brazilian Portuguese clinical infor-

mation collection BRATECA, as well as their lim-

itations. The tasks addressed in this research can

serve as baselines when testing more advanced deep-

learning architectures in this domain.

The work has also served to form the base of an

ongoing effort to develop test sets for the BRATECA

collection that will be expanded upon in future. These

tasks and the results of preliminary tests such as those

presented in this paper will enable other members of

the community interested in working with Brazilian

data to more easily compare results between differ-

ent teams. This is especially relevant to the ﬁeld of

computational medicine since most research is per-

formed using data that cannot be easily shared among

the community and as such suffers when it comes to

reproduction and comparison. The test sets that have

been made available are as follows: a length-of-stay

prediction test set annotated for whether the admis-

sion lasted more or less than 7 days based on the ﬁrst

24 hours; and a mortality prediction test set annotated

for whether an admission ended in discharge or death

based on the ﬁrst 24 hours. These are both available

on our GitHub page.

Future work will thus focus on expanding the

test sets, creating new ones, and creating baselines

for them. Alongside that work, the development of

new neural-network architectures for clinical predic-

tion tasks will also be a priority, as specializing archi-

tectures to work within the realities of this domain is

paramount to successful deployment of AI solutions

into hospital environments.

ACKNOWLEDGEMENTS

The authors acknowledge the High-Performance

Computing Laboratory of the Pontiﬁcal Catholic Uni-

versity of Rio Grande do Sul (LAD-IDEIA/PUCRS,

Brazil) for providing support and technological re-

sources, which have contributed to the development

of this project and to the results reported in this paper.

We also gratefully acknowledge partial ﬁnan-

cial support by CNPq, CAPES, the FAPERGS

funded CIARS innovation network, the FAPESP

and FAPEMIG funded CIIA-Health innovation

centre, and the FCT (Portugal) under project

UIDB/00057/2020.

HEALTHINF 2023 - 16th International Conference on Health Informatics

344

REFERENCES

Bardhan, J., Colas, A., Roberts, K., and Wang, D. Z. (2022).

Drugehrqa: A question answering dataset on struc-

tured and unstructured electronic health records for

medicine related queries. PhysioNet.

Consoli, B., Dias, H., Ulbrich, A., Vieira, R., and Bordini,

R. (2022). Brateca (brazilian tertiary care dataset):

a clinical information dataset for the portuguese lan-

guage. In Proceedings of the 13th Conference on Lan-

guage Resources and Evaluation, page 5609–5616.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019).

BERT: pre-training of deep bidirectional transformers

for language understanding. In Burstein, J., Doran,

C., and Solorio, T., editors, Proceedings of the 2019

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, NAACL-HLT 2019, Minneapolis,

MN, USA, June 2-7, 2019, Volume 1 (Long and Short

Papers), pages 4171–4186. Association for Computa-

tional Linguistics.

Goldberger, A. L., Amaral, L. A., Glass, L., Hausdorff,

J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody,

G. B., Peng, C.-K., and Stanley, H. E. (2000). Phys-

iobank, physiotoolkit, and physionet: components of

a new research resource for complex physiologic sig-

nals. Circulation, 101(23):e215–e220.

Mullenbach, J., Pruksachatkun, Y., A., S., Seale, J., Swartz,

J., McKelvey, T. G., Yang, Y., and Sontag, D. (2021).

Clip: A dataset for extracting action items for physi-

cians from hospital discharge notes. PhysioNet.

Rav

ı, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-

Perez, J., Lo, B., and Yang, G.-Z. (2017). Deep learn-

ing for health informatics. IEEE Journal of Biomedi-

cal and Health Informatics, 21(1):4–21.

Schneider, E. T. R., de Souza, J. V. A., Knafou, J., Oliveira,

L. E. S. e., Copara, J., Gumiel, Y. B., Oliveira, L.

F. A. d., Paraiso, E. C., Teodoro, D., and Barra, C.

M. C. M. (2020). BioBERTpt - a Portuguese neural

language model for clinical named entity recognition.

In Proceedings of the 3rd Clinical Natural Language

Processing Workshop, pages 65–72, Online. Associa-

tion for Computational Linguistics.

Thapa, C. and Camtepe, S. (2021). Precision health data:

Requirements, challenges and existing techniques for

data security and privacy. Computers in Biology and

Medicine, 129:104130.

Weitschek, E., Felici, G., and Bertolazzi, P. (2013). Clinical

data mining: Problems, pitfalls and solutions. In 2013

24th International Workshop on Database and Expert

Systems Applications, pages 90–94.

Yong-ho, L., Heejung, B., and Jung, K. D. (2016). How to

establish clinical prediction models. enm, 31(1):38–

44.

Yue, X., Zhang, X. F., Yao, Z., Lin, S., and Sun, H. (2020).

Cliniqg4qa: Generating diverse questions for domain

adaptation of clinical question answering. arXiv.

Benchmarking the BRATECA Clinical Data Collection for Prediction Tasks

345