Extracting Patient Data from Tables in Clinical Literature

Case Study on Extraction of BMI, Weight and Number of Patients

Nikola Milosevic

, Cassie Gregson

, Robert Hernandez

and Goran Nenadic

1,3

School of Computer Science, University of Manchester, Manchester, U.K.

AstraZeneca plc, Cambridge, U.K.

Health e-Research Center, University of Manchester, Manchester, U.K.

Keywords:

Text Mining, Table Mining, Information Extraction, Natural Language Processing, Clinical Trials.

Abstract:

Current biomedical text mining efforts are mostly focused on extracting information from the body of research

articles. However, tables contain important information such as key characteristics of clinical trials. Here, we

examine the feasibility of information extraction from tables. We focus on extracting data about clinical

trial participants. We propose a rule-based method that decomposes tables into cell level structures and then

extracts information from these structures. Our method performed with a F-measure of 83.3% for extraction of

number of patients, 83.7% for extraction of patient’s body mass index and 57.75% for patient’s weight. These

results are promising and show that information extraction from tables in biomedical literature is feasible.

1 INTRODUCTION

The amount of published scientiﬁc research is accel-

erating: the number of published papers is growing at

a double-exponential pace (Hunter and Cohen, 2006).

MEDLINE contains over 25 million references

and

it is impossible to cope with this amount of published

research.

Text mining provides tools and methods to deal

with large numbers of articles in biomedicine. How-

ever, these efforts have been focused mainly on the

processing of unstructured text and most of them ig-

nored lists, tables and ﬁgures.

Tables are used for storing large amounts of fac-

tual or statistical data in a structured, concise and

human-readable way (Tengli et al., 2004). They also

provide a way for storing multidimensional data. The

visual layout of a table often describes relationships

between the items in the table. Because of the vari-

ety of layouts, it is challenging to perform analysis of

data in this form.

In biomedicine, important experimental informa-

tion, such as the settings and the results of experi-

ments, interactions between substances, drug side ef-

fects, information about arms and patients, are usu-

ally stored in tables. In PMC database, more than

72% of research articles contain tables. We manually

http://www.ncbi.nlm.nih.gov/pubmed

found that some of the documents in the database do

not contain the whole article in XML format (scanned

documents, containing only parts in XML). Also, we

calculated that the PMC articles contain on average

2.72 tables.

In this paper, we present a method for table de-

composition and a case study on extracting informa-

tion from tables in biomedical literature. The aim of

our study is to examine the feasibility of information

extraction about patients from tables in clinical litera-

ture. Our case study performed extraction of number

of patients, body mass indexes (BMI) and weight of

patients from tables.

2 BACKGROUND

Hurst (Hurst, 2000) was among the ﬁrst to examine

tables from the text mining perspective. He proposed

a model of tables with ﬁve components: graphical,

physical, functional, structural and semantic. Also,

Hurst created one of the ﬁrst table mining engines.

He split the process of table mining into three parts:

table detection, functional analysis and information

extraction.

The table detection step examines how to cor-

rectly detect tables in the documents. Work has

been done in detecting tables from PDF, HTML and

Milosevic, N., Gregson, C., Hernandez, R. and Nenadic, G.

Extracting Patient Data from Tables in Clinical Literature - Case Study on Extraction of BMI, Weight and Number of Patients.

DOI: 10.5220/0005660102230228

In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 5: HEALTHINF, pages 223-228

ISBN: 978-989-758-170-0

223

ASCII documents using Optical Character Recogni-

tion (Kieninger and Strieder, 1999), machine learn-

ing algorithms such as C4.5 decision trees (Ng et al.,

1999) and SVM (Son et al., 2008) or heuristics (Yildiz

et al., 2005).

The second step is functional analysis and it ex-

amines the purpose of areas of the table. The aim of

this step is to identify which cells contain raw data

and which contain navigational data. Approaches

using machine learning methods like C4.5 decision

trees (Chavan and Shirgave, 2011) or CRF (Wei et al.,

2006) were used.

The ﬁnal step is semantic processing. In this step,

relationships and semantics of the table elements are

analysed. Semantic processing of the tables are used

for information retrieval (Hearst et al., 2007; Divoli

et al., 2010), information extraction (Mulwad et al.,

2010; Wong et al., 2009) and question answering sys-

tems (Wei et al., 2006).

So far, no work has been conducted on extracting

information from tables in clinical literature.

3 METHOD

We aim to extract information from tables about par-

ticipants of the clinical trials such as their num-

ber, BMI and weight. The method we propose is

composed of two parts: table decomposition into

structures that are more suitable for further process-

ing and information extraction. We propose a way

to decompose table into cell-level data structures

while maintaining information about relationships be-

tween elements of the table. Table decomposition,

viewed through Hurst’s model, represent functional

and structural table analysis. The second part con-

siders information extraction from the tables, which

corresponds to semantic analysis in Hurst’s model.

Data

Our dataset had 2517 documents collected as a clin-

ical trial publications from PubMedCentral (PMC)

Out of these documents 568 had no XML presenta-

tion of tables. They had a reference to the image

of scanned table. The total number of tables in our

dataset was 4141.

Firstly, we conducted a manual analysis on a small

sample of 70 PMC documents with 217 tables. Based

on our analysis we were able to create rules to identify

structure, decompose tables in structured manner and

extract information.

http://www.ncbi.nlm.nih.gov/pmc/

3.1 Table Decomposition

Table decomposition contains ﬁve steps.

In the ﬁrst step, the algorithm is locating table with

its meta-data such as caption and footer. These data

are stored in particular XML tags.

In the second step, our algorithm locates headers

and stubs of the table. Cells that are inside the thead

tags are labelled as header cells. The left-most col-

umn cells are labelled as the stub cells. If this column

has row-spanning cells, then the following column is

also labelled as part of the stub. Row-spanning cells

are usually used to group and categorise other stub

cells in the following column. The ﬁrst column with

no row-spanning cells outside header will be the last

column labelled as the stub. Similarly, complex head-

ers with column-spanning cells are labelled, if there

is no thead tag. If there is no thead tags, our method

is checking whether the table does not have a header

by checking similarity of value types between ﬁrst

ﬁve rows. Since the table might have multiple lay-

ers of headers, ﬁve was the optimal number of rows

for this check, since it indicates in an unambiguous

way separation between types. If the cell in the ﬁrst

row has different type (i.e. text) than the following

rows (i.e. numeric), the ﬁrst cell is labelled as part of

header. If all ﬁve cells have values of same type, the

table has no header. Types of cells could be empty,

numeric (integer or ﬂoating point number), partially

numeric (number with special characters and punctu-

ations) and string.

In the third step, spanning cells (recognised by ap-

propriate XML attribute) are split and the content of

the cell is copied to all the newly created cells(Chen

et al., 2000).

The fourth step is classiﬁcation of the table by

number of dimensions. Navigational paths are read

differently for one, two or multi-dimensional tables.

Our algorithm identiﬁes three types of tables using

heuristics rules. List (one dimensional) tables con-

tain a list of items in one or more columns (space sav-

ing reasons). They can be recognized if it has only

one column, the header is spanning through all the

columns or if there is same header for all the columns.

Matrix (two dimensional) tables contain data ar-

ranged in simple matrix of cells (Example can be seen

in Figure 3). Super-row (multi-dimensional) tables

are similar to matrix tables, but the presence of super-

rows (Tengli et al., 2004) changes the way they are

read (Example can be seen in Figure 1). Super-rows

are usually presented as a row inside data part of the

table that is spanning through all columns or a row

with a value only in one cell.

In the last step, our method is iterating through

all data cells and trying to ﬁnd the correct navigation

HEALTHINF 2016 - 9th International Conference on Health Informatics

224

Figure 1: Example of the table (PMC29053) and the decomposition XML output for one cell from that table.

path. Navigation path is a path through the naviga-

tional cells (header, stub, super-rows) that logically

annotates the data from the data cell. In list tables

only the header value is part of navigation path. For

matrix tables, the algorithm has to read the header cell

in the same column as the given cell, the stub cell

in the same row as the cell and the header value for

stub’s column. Since the super-row table may have a

number of super-row levels in a tree-like structure, we

created a stack structure that stores current super-row

paths, as the algorithm iterates through the cells. For

this kind of table, our method reads a header value for

the stub (stub’s label), all levels of super-rows above

the item of interest, the stub value and the header

value above the cell.

Data retrieved from the tables are stored in the

XML elements (see Figure 1).

Figure 2: Workﬂow of table decomposition method.

A work-ﬂow diagram of our method can be seen

in Figure 2.

3.2 Table Information Extraction

We performed two case studies on information extrac-

tion from tables. The ﬁrst study’s objective was to

extract the total number of patients, while the second

had to extract BMI and weight of patients from a clin-

ical trial publication. In the second task, the partici-

pant group names had to be extracted together with

the appropriate mean BMI or weight. For example,

table shown in Figure 3 has two participant groups.

Figure 3: Example of a clinical trial demographic table that

contains information about patients BMI (PMC58836).

Extracted information will be: [Absolute Risk (n =

232): BMI: 27.4 (4.5)] and [NNT (n = 225): BMI:

27.0 (4.3)].

3.3 Extraction of Number of Trial

Participants

The number of participants is a numerical value and

there is a limited set of trigger words to indicate its

appearance in table. The number of patients could be

presented in different places in the table and it may

not be presented as a single (overall) number, but also

as a number of participants per each arm of trial.

The table caption usually presents the total num-

ber of the clinical trial participants. We extract the

number of participants using two rules. The ﬁrst rule

is looking for a number, followed by one of the trig-

ger words (subject, patient, person, individual, peo-

ple, infant) in either singular or plural in its vicinity.

The trigger word does not need to be the word next to

the number, since in some cases the authors may want

to specify the participants more (e.g. 16 1-month-old

infants, 1239 blood donors). The second rule is look-

ing for a pattern consisting of letter n, the equals sign

and a number (e.g. n=19).

There are several ways to store the number of clin-

ical trial participants in navigational cells of the ta-

ble. One way is to store the total number of patients

in a stub, while the other is storing it in the header.

Usually, in stubs and headers, the number of patients

Extracting Patient Data from Tables in Clinical Literature - Case Study on Extraction of BMI, Weight and Number of Patients

225

are presented in the form of mathematical expression

(e.g. n = 19). In stubs, we are often expecting the total

number of patients in one cell. Since header may have

values per arm in each column, we created a list of

candidates. Firstly, all the values are added to the list.

If the content of some cell contained the word ”over-

all”, ”total” or the phrase ”all patients”, that value is

considered as the total number of participants. How-

ever, if such cell does not exist, we check if the stub’s

header cell has a value for number of patients. If none

of this is the case, the values from the header columns

are summed (example of this can be seen in Figure 3).

Also, the number of patients may be placed in the

body of the table. Similarly to headers, data cells may

present the number of patients in parts (e.g. per arm),

as single total number, or, in some tables, they may

contain both partial and total numbers. Since the data

cells may contain only numerical values, looking for

trigger words and patterns has to be done in the ap-

propriate stub cells. We have deﬁned trigger phrases

which our method searches for in the stub (Number

of patients, Num. of participants, etc.). If found, val-

ues from the data cells are extracted and added to the

list of candidates. Headers also need to be analysed

(check if header value contain words ”overall”, ”to-

tal” or ”all patients”) in order to determine if there is

some cell presenting the total number of participants.

If there is no such column, the summed value repre-

sents the total number of participants.

3.4 Extracting Body Mass Index and

Weight

The second case study extracts information about

BMI and mean weight of trial participants. This task

is much more complex because we want to extract in-

formation, together with the participant group names

in which these values were measured.

For the BMI extraction, our approach is to look in

the stub of the table for trigger phrases ”body mass

index” or ”bmi”. If a table contains these trigger

phrases, values from the table body are extracted.

However, we also checked whether the value is in

the appropriate range (15-40). If the value is not in

this range, it does not represent mean BMI value,

but other value such as BMI change, standard devi-

ation, etc. If there is more then one column with

BMI values, the headers are probably the names of

the participant groups. To identify header cells that do

not represent participant group names, list of terms is

created with tokens such as ”range”,”p*”,”±”,”T”,”p-

value”,”p* value”,”%”,”signiﬁcance”. Appearance of

these words indicates that the column does not con-

tain BMI values.

Using these heuristics it is not possible to obtain

only arm names, but rather patients groups, since the

authors may create demographic tables where they di-

vide patients either by treatment (placebo, penicillin),

location (Paris, Toulouse), follow-up period (data on

enrolment, 1 week and 1 month after treatment) or

outcomes (survivors, non-survivors).

Similarly, weight of patients was also extracted.

In this case trigger phrases were ”weight” and ”body-

weight”. Since tables can present a number of dif-

ferent measures related to weight, a stop list was in-

troduced, which had the role of discarding entries if

the stub contains a word from the list near the trigger

phrases. Stop list contained words like ”loss”, ”gain”

and ”change”. In this case, we were not able not de-

ﬁne the range of values since values may be in differ-

ent measurement units (g, kg, lb) and a wide variety

of values is possible.

4 RESULTS

4.1 Table Decomposition Results

We have processed all 2517 PMC clinical trial docu-

ments. Our method extracted data from 3573 tables.

The corpus contained 55.24% of matrix, 0.76% of list

and 42.46% of sub-header tables. Since each table has

on average 80 cells, it would be impossible to evalu-

ate the whole dataset. We have chosen 100 random

tables from each type of tables and evaluated the algo-

rithm’s output for them manually by inspecting every

table and its cell structures for correctness. If at least

one XML cell structure is not read correctly, table is

labelled as incorrectly decomposed.

Table 1: Accuracy of table decomposition system.

Class Tables in dataset N. Eval. Accuracy

Matrix tables 1974 (55.24%) 100 89%

Super-row tables 1517 (42.46%) 100 81%

List tables 27 (0.76%) 27 77.7%

Multi-table tables 55 (1.54%) 55 49.1%

Total 3573 282 84.9%

In Table 1, we present the results of our eval-

uation. Matrix tables were easiest for decomposi-

tion and the accuracy would be even higher if our

dataset had perfect markup. Due to the non standard

XML labelling, our method in some cases was not

able to correctly recognize table type or borders of

navigational areas. Some of the mislabelling include

spanning cells (not using the attribute, but rather us-

ing multiple cells) and incorrect labelling of headers

with thead tags (incorrectly tagging something as a

header). Super-row and list tables performed slightly

HEALTHINF 2016 - 9th International Conference on Health Informatics

226

worse. We encountered a small number of tables that

actually presented several similar tables merged to-

gether (we called them multi-tables). We included

a simple algorithm that is able to recognize naviga-

tional paths in them based on presence of horizontal

lines. However, this algorithm was not good enough

to recognize navigational path with high performance.

Due to the small number of these tables, they did not

affect our overall performance. Overall accuracy of

table decomposition was 84.9%.

4.2 Number of Patients Extraction

Results

For the extraction of the number of patients, we pro-

cessed all documents in our dataset. The total num-

ber of participants was extracted from 758 documents.

For evaluation purposes we randomly selected 50

documents. Our system performed with a F-measure

of 83.3%. More detailed statistics can be seen in Ta-

ble 2.

Table 2: Performance of extracting total number of patients.

Precision 73.53%

Recall 96.15%

F-measure 83.3%

4.3 BMI, Weight and Patient Group

Name Extracting Results

For the extraction of BMI and weight, we selected

dataset that contains 113 documents, having in at least

one of the tables token related to BMI or weight.

We separately evaluated the patient group, weight and

BMI extraction. The results are shown in Table 3.

Results for BMI and weight are dependent on how

the participant groups were recognized, because each

extracted value is assigned to the participant group.

Participant groups were extracted with a F-measure

of 71.32%. They are hard to extract correctly be-

cause they may be formed from a wide range of con-

cepts (location, drug, treatment, time, etc.) and may

include acronyms or abbreviations. Complex tables,

with multiple levels of headers may create additional

complexity, since it might be hard to determine where

the name of the group ends and where technical or sta-

tistical separation of the table’s cells starts (ie. mean

and standard deviation columns).

BMI has a higher F-measure than participant

group extraction. This may look strange, because in

order to extract BMIs, the patient group has to be

extracted correctly as well. However, deﬁned BMI

range made a large contribution to discarding false

positives.

Our method for weight extraction performed with

high recall but with very low precision. This is due to

the fact that the method was matching trigger phrases,

but did not have a well crafted stop list, that could help

to distinguish actual patient weight from other weight

related concepts.

5 CONCLUSION

Information extraction from tables is not exten-

sively researched. However, in many ﬁelds, such as

biomedicine, it could be useful, due of the importance

of the data presented in tables. Information extrac-

tion from tables can use some of the established text

mining techniques, but due to the challenge of under-

standing the visual layouts, new approaches have to

be developed as well.

We developed a methodology for table decompo-

sition into cell-level data structures. Our method is

able to read table data with associated navigational

information. Using these structures, it is easier to

perform semantic analysis and information extrac-

tion. We performed a case study on extracting num-

ber of trial participants, BMIs and names of the par-

ticipant groups from clinical literature. Although we

used relatively simple rules for information extrac-

tion, our results are promising (F measure for BMI ex-

traction 83.7%, F measure for weight extraction over

57%). Our results indicated that some information

classes may be easier to extract, because it is possi-

ble to model expected values, while the others remain

a challenge.

The results of our case studies are comparable

with state-of-the-art methods in table information ex-

traction. However, not many works report informa-

tion extraction from tables. Hurst (Hurst, 2000) re-

ported the combined task of functional, structural and

relational analysis to have a F score of 83.13%. How-

ever, this task matches our table decomposition task,

which is just ﬁrst part of our information extraction

method. Gatterbauer et al. (Gatterbauer et al., 2007)

created generic information extraction system, but

they reported F measure of 52%. Tengli et al. (Tengli

et al., 2004) reported the best F measure of 91.4% for

information extraction from tables. However, they ap-

ply a method on The Common Data Set tables, which

is a standardized presentation format for higher edu-

cation data in the United States. Compared to these

tables, tables from PMC are not standardised in any

way.

The performance of our method is quite promising

Extracting Patient Data from Tables in Clinical Literature - Case Study on Extraction of BMI, Weight and Number of Patients

227

Table 3: Performance extracting BMI, weight and patient groups from PMC clinical trial documents (TP - true positives, FP

- false positives, FN - false negatives).

Class TP FP FN Precision Recall F-measure

BMI 72 22 6 76.6% 92.3% 83.7%

Participant group 153 93 27 61.45% 85% 71.32%

Weight 95 133 6 41.66% 94.05% 57.75%

and indicates that information extraction from tables

is a feasible task. However, there is a space for ad-

vancement. There is still the need for the human cu-

rators to control the system and correct mistakes. We

believe our system will reduce data curation time for

medical documents.

6 AVAILABILITY

Our code and annotated corpus is available on

GitHub

. Current code is a work in progress and

might be subject to changes. The documents user

were clinical trial articles from PubMedCentral

REFERENCES

Chavan, M. M. and Shirgave, S. (2011). A methodology

for extracting head contents from meaningful tables in

web pages. In Communication Systems and Network

Technologies (CSNT), 2011 International Conference

on, pages 272–277. IEEE.

Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. (2000). Mining ta-

bles from large scale html texts. In Proceedings of the

18th conference on Computational linguistics-Volume

1, pages 166–172. Association for Computational Lin-

guistics.

Divoli, A., Wooldridge, M. A., and Hearst, M. A. (2010).

Full text and ﬁgure display improves bioscience liter-

ature search. PloS one, 5(4):e9619.

Gatterbauer, W., Bohunsky, P., Herzog, M., Kr

upl, B., and

Pollak, B. (2007). Towards domain-independent in-

formation extraction from web tables. In Proceedings

of the 16th international conference on World Wide

Web, pages 71–80. ACM.

Hearst, M. A., Divoli, A., Guturu, H., Ksikes, A., Nakov,

P., Wooldridge, M. A., and Ye, J. (2007). Biotext

search engine: beyond abstract search. Bioinformat-

ics, 23(16):2196–2197.

Hunter, L. and Cohen, K. B. (2006). Biomedical lan-

guage processing: perspective whats beyond pubmed?

Molecular cell, 21(5):589.

Hurst, M. F. (2000). The interpretation of tables in texts.

PhD thesis, University of Edinburgh.

Kieninger, T. G. and Strieder, B. (1999). T-recs table recog-

nition and validation approach. In AAAI Fall Sym-

https://github.com/nikolamilosevic86/TableAnnotator

http://www.ncbi.nlm.nih.gov/pmc/

posium on Using Layout for the Generation, Under-

standing and Retrieval of Documents.

Mulwad, V., Finin, T., Syed, Z., and Joshi, A. (2010). Using

linked data to interpret tables. COLD, 665.

Ng, H. T., Lim, C. Y., and Koo, J. L. T. (1999). Learn-

ing to recognize tables in free text. In Proceedings of

the 37th annual meeting of the Association for Com-

putational Linguistics on Computational Linguistics,

pages 443–450. Association for Computational Lin-

guistics.

Son, J.-W., Lee, J.-A., Park, S.-B., Song, H.-J., Lee, S.-

J., and Park, S.-Y. (2008). Discriminating meaning-

ful web tables from decorative tables using a compos-

ite kernel. In Web Intelligence and Intelligent Agent

Technology, 2008. WI-IAT’08. IEEE/WIC/ACM Inter-

national Conference on, volume 1, pages 368–371.

IEEE.

Tengli, A., Yang, Y., and Ma, N. L. (2004). Learning table

extraction from examples. In Proceedings of the 20th

international conference on Computational Linguis-

tics, page 987. Association for Computational Lin-

guistics.

Wei, X., Croft, B., and McCallum, A. (2006). Table ex-

traction for answer retrieval. Information retrieval,

9(5):589–611.

Wong, W., Martinez, D., and Cavedon, L. (2009). Extrac-

tion of named entities from tables in gene mutation

literature. In Proceedings of the Workshop on Current

Trends in Biomedical Natural Language Processing,

pages 46–54. Association for Computational Linguis-

tics.

Yildiz, B., Kaiser, K., and Miksch, S. (2005). pdf2table: A

method to extract table information from pdf ﬁles. In

IICAI, pages 1773–1785.

HEALTHINF 2016 - 9th International Conference on Health Informatics

228