are presented in the form of mathematical expression
(e.g. n = 19). In stubs, we are often expecting the total
number of patients in one cell. Since header may have
values per arm in each column, we created a list of
candidates. Firstly, all the values are added to the list.
If the content of some cell contained the word ”over-
all”, ”total” or the phrase ”all patients”, that value is
considered as the total number of participants. How-
ever, if such cell does not exist, we check if the stub’s
header cell has a value for number of patients. If none
of this is the case, the values from the header columns
are summed (example of this can be seen in Figure 3).
Also, the number of patients may be placed in the
body of the table. Similarly to headers, data cells may
present the number of patients in parts (e.g. per arm),
as single total number, or, in some tables, they may
contain both partial and total numbers. Since the data
cells may contain only numerical values, looking for
trigger words and patterns has to be done in the ap-
propriate stub cells. We have defined trigger phrases
which our method searches for in the stub (Number
of patients, Num. of participants, etc.). If found, val-
ues from the data cells are extracted and added to the
list of candidates. Headers also need to be analysed
(check if header value contain words ”overall”, ”to-
tal” or ”all patients”) in order to determine if there is
some cell presenting the total number of participants.
If there is no such column, the summed value repre-
sents the total number of participants.
3.4 Extracting Body Mass Index and
Weight
The second case study extracts information about
BMI and mean weight of trial participants. This task
is much more complex because we want to extract in-
formation, together with the participant group names
in which these values were measured.
For the BMI extraction, our approach is to look in
the stub of the table for trigger phrases ”body mass
index” or ”bmi”. If a table contains these trigger
phrases, values from the table body are extracted.
However, we also checked whether the value is in
the appropriate range (15-40). If the value is not in
this range, it does not represent mean BMI value,
but other value such as BMI change, standard devi-
ation, etc. If there is more then one column with
BMI values, the headers are probably the names of
the participant groups. To identify header cells that do
not represent participant group names, list of terms is
created with tokens such as ”range”,”p*”,”±”,”T”,”p-
value”,”p* value”,”%”,”significance”. Appearance of
these words indicates that the column does not con-
tain BMI values.
Using these heuristics it is not possible to obtain
only arm names, but rather patients groups, since the
authors may create demographic tables where they di-
vide patients either by treatment (placebo, penicillin),
location (Paris, Toulouse), follow-up period (data on
enrolment, 1 week and 1 month after treatment) or
outcomes (survivors, non-survivors).
Similarly, weight of patients was also extracted.
In this case trigger phrases were ”weight” and ”body-
weight”. Since tables can present a number of dif-
ferent measures related to weight, a stop list was in-
troduced, which had the role of discarding entries if
the stub contains a word from the list near the trigger
phrases. Stop list contained words like ”loss”, ”gain”
and ”change”. In this case, we were not able not de-
fine the range of values since values may be in differ-
ent measurement units (g, kg, lb) and a wide variety
of values is possible.
4 RESULTS
4.1 Table Decomposition Results
We have processed all 2517 PMC clinical trial docu-
ments. Our method extracted data from 3573 tables.
The corpus contained 55.24% of matrix, 0.76% of list
and 42.46% of sub-header tables. Since each table has
on average 80 cells, it would be impossible to evalu-
ate the whole dataset. We have chosen 100 random
tables from each type of tables and evaluated the algo-
rithm’s output for them manually by inspecting every
table and its cell structures for correctness. If at least
one XML cell structure is not read correctly, table is
labelled as incorrectly decomposed.
Table 1: Accuracy of table decomposition system.
Class Tables in dataset N. Eval. Accuracy
Matrix tables 1974 (55.24%) 100 89%
Super-row tables 1517 (42.46%) 100 81%
List tables 27 (0.76%) 27 77.7%
Multi-table tables 55 (1.54%) 55 49.1%
Total 3573 282 84.9%
In Table 1, we present the results of our eval-
uation. Matrix tables were easiest for decomposi-
tion and the accuracy would be even higher if our
dataset had perfect markup. Due to the non standard
XML labelling, our method in some cases was not
able to correctly recognize table type or borders of
navigational areas. Some of the mislabelling include
spanning cells (not using the attribute, but rather us-
ing multiple cells) and incorrect labelling of headers
with thead tags (incorrectly tagging something as a
header). Super-row and list tables performed slightly
HEALTHINF 2016 - 9th International Conference on Health Informatics
226