The Impact of Pre-processing on the Classiﬁcation of

MEDLINE Documents

Carlos Adriano Gonc¸alves

, C

elia Talma Gonc¸alves

2,3

, Rui Camacho

and Eug

enio Oliveira

LIAAD & FEUP - Faculdade de Engenharia da Universidade do Porto, Porto, Portugal

LIACC & FEUP - Faculdade de Engenharia da Universidade do Porto, Porto, Portugal

CEISE & ISCAP - Instituto Superior de Contabilidade e Administrac¸

ao, Porto, Portugal

Abstract. The amount of information available in the MEDLINE database

makes it very hard for a researcher to retrieve a reasonable amount of relevant

documents using a simple query language interface. Automatic Classiﬁcation of

documents may be a valuable technology to help reducing the amount of docu-

ments retrieved for each query. To accomplish this process it is of capital impor-

tance to use appropriate pre-processing techniques on the data. The main goal of

this study is to analyse the impact of pre-processing techniques in text Classiﬁ-

cation of MEDLINE documents. We have assessed the effect of combining dif-

ferent pre-processing techniques together with several classiﬁcation algorithms

available in the WEKA tool. Our experiments show that the application of prun-

ing, stemming and WordNet reduces signiﬁcantly the number of attributes and

improves the accuracy of the results.

1 Introduction

Molecular biology and biomedicine scientiﬁc publications are available (at least the

abstracts) in Medical Literature Analysis and Retrieval System On-line (MEDLINE).

MEDLINE is the U.S. National Library of Medicine’s (NLM) premier bibliographic

database that contains over 16 million references to journal articles in life sciences with

a concentration on biomedicine. A distinctive feature of MEDLINE is that the records

are indexed with NLM’s Medical Subject Headings (MeSH terms). MEDLINE is the

major component of PubMed [1], a database of citations of the National Library of

Medicine in the United States. PubMed comprises more than 19 million citations for

biomedical articles from MEDLINE and life science journals. The result of a MED-

LINE/PubMed search is a list of citations

to journal articles. The results of such search

is, quite often, a huge amount of documents, making it very hard for researchers to ef-

ﬁciently reach the most relevant documents for their queries. As this is a very relevant

and actual topic of investigation we investigate the use of Machine Learning-based text

classiﬁcation techniques to help in the identiﬁcation of a reasonable amount of rele-

vant documents in MEDLINE. In this study we will focus on assessing the effect of

different pre-processing techniques in the quality of classiﬁers available in the Weka

tool.

including authors, title, journal name, the abstract of the paper, keywords and MeSH terms

Adriano GonÃ

galves C., Talma GonÃ

galves C., Camacho R. and Oliveira E.

The Impact of Pre-processing on the Classiﬁcation of MEDLINE Documents.

DOI: 10.5220/0003028700530061

In Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (ICEIS 2010), page

ISBN: 978-989-8425-14-0

The rest of this paper is structured as follows. In Section 2 we overview the pre-

processing techniques used. In Section 3 we present the text classiﬁcation problem and

existing techniques. We report on related work in Section 4. Section 5 describes the ex-

periments and Section 6 the results obtained. Section 7 concludes the paper presenting

the conclusions and future work.

2 Pre-processing Techniques

Before applying any data analysis technique it is necessary to pre-process the collection

of documents. In our study we have used the following pre-processing techniques:

– Tokenization, the process of breaking a text into tokens. A token is a non-empty

sequence of characters, excluding spaces and punctuation.

– Lowercase conversion converts all terms into lower cases.

– Special character removal removes all the special characters (+, -, !, ?, ., ,, ;, :, {,

}, =, &, #, %, $, [, ], /, <, >, \, “, ”, |) and digits.

– Stop Word Removal removes words that are meaningless such as articles. con-

junction and prepositions (e.g., a, the, at, etc.). These words are meaningless for the

evaluation of the document content.

– Stemming is a widely used technique in text analysis. Stemming is the process of

removing inﬂectional afﬁxes of words reducing the words to their stem.

– Pruning discards terms either appearing rarely or “too frequently”. Terms that

rarely appear in a document or terms that appear too frequently do not contribute

to identify the topic of the document. The most common techniques are term fre-

quency and document frequency.

– Treating synonyms: the possibility to take care of synonyms may be seen as an-

other pre-processing technique. If two words or terms mean the same thing, e.g, if

they are synonyms we could replace them by one of them without taking the se-

mantic meaning of the term.

– Document representation: after the above listed “ﬁlters”, each document is en-

coded and stored as a standard vector of term weights.

In the Vector Space Model there are several variations to attribute the weights to the

terms. Two of the most common weights are: TF (Term Frequency) and TFIDF (Term

Frequency Inverse Term Frequency). We have used TFIDF in our study. [2] provides

the following deﬁnitions:

T F (term, document) = thefrequencyof termindocument (1)

and

IDF = log

numberofdocumentsincollection

numberofdocumentswithterm

+ 1. (2)

These pre-processing techniques are of utmost importance since they reduce sig-

niﬁcantly the number attributes that characterise each document attenuating the curse

of dimensionality. Stop word removal, stemming and pruning improves classiﬁcation

quality once they remove the meaningless data that leads to a reduction in the num-

ber of dimensions in the term space. Document representation concerns the estimation

of the importance of a speciﬁc term in the document. As the number of features (at-

tributes) are very huge in a collection of documents, the pre-processing techniques help

in reducing these huge number of features.

3 Text Classiﬁcation

Text Classiﬁcation attempts to automatically determine whether a document or part of a

document has particular characteristics of interest, usually based on whether the docu-

ment discusses a given topic or contains a certain type of information [3]. Text Classiﬁ-

cation involves two main research areas: Information Retrieval and Machine Learning.

The ﬁrst of step of Text Classiﬁcation is to transform documents into a suitable rep-

resentation for the Classiﬁer. For this, and before applying the Classiﬁer documents

must be pre-processed using Information Retrieval techniques mentioned in the previ-

ous section. Some other pre-processing techniques that can be applied to the collection

of documents, in order to reduce the huge amount of terms in the collection is called

Feature Selection.

A Feature Selection or feature extraction phase is needed to reduce the dimensionality

of the document.

There are three Classiﬁcation techniques: supervised learning, unsupervised learning

and semi-supervised learning.

Supervised Learning is based on a training set of examples, where the key idea is to

learn from a set of labelled examples (the training set).

In unsupervised learning there is no a priori output. The goal of unsupervised learning

is to learn a model that explains well the data. Usually, the result of unsupervised learn-

ing is a new explanation or representation of the observation data, which will then lead

to improved future decisions.

Semi-supervised learning makes use of both labelled and unlabelled data (typically a

small amount of labelled data and a large amount of unlabelled data. Semi-supervised

learning [4] [5] is a machine learning paradigm in which the model is constructed with

a small number of labelled instances and a large number of unlabelled instances. One

key idea in semi-supervised learning is to label unlabelled data using certain techniques

and thus increase the amount of labelled training data.

4 Related Work

The authors in [6] explore the effects of different text representation approaches on the

classiﬁcation performance of MEDLINE documents. In this study the authors use only

the title and combine a word representation, the bag-of-words approach with a phrase

representation, the bag-of-phrase. Due to previous studies [7] they decided to compare

this approach with the bag-of-words representation, bag-of-phrase representation and

an hybrid one. They also used the Support Vector Machine Algorithm. The experiments

were made with OHSUMED

data set. The authors achieved better performance results

with the hybrid approach.

In [8] the authors examined ways to represent text from two aspects related with

text representation in a vector space model: (1) what should a term be and (2) how to

weight a term. The authors evaluated their approach using a Support Vector Machine

algorithm. They found that representing text using a different approaches from the bag-

of-words representation does not show performance improvements. They also found

that the term weight slightly improved the performance.

In this paper we evaluate the impact of pre-processing techniques ( stop word re-

moval, stemming, WordNet and pruning ) in Classiﬁcation, that as far as our knowledge

have not yet been studied (concerning a MEDLINE data set).

5 Empirical Study

Our base of work will be a MEDLINE sample. The focuses of this research is to study

the impact of pre-processing techniques in a MEDLINE sample Data Set.

5.1 Data Set Characterisation

The data set that is subject of our study, is a MEDLINE sample that was downloaded

from the NLM site at (ftp://ftp.nlm.nih.gov/nlmdata/sample/MEDLINE/). This sample

has 53.2 MB, and contains 30000 citations. The sample is in the XML format. Each

citation contains several information namely: the pmid (the PubMed id), the journal

title, the PubMed date, the article title, the abstract of the paper if available, the list

of authors, the list of keywords and the list of Mesh terms. A MeSH (Medical Subject

Headings) is the (U.S.) National Library of Medicine’s controlled vocabulary thesaurus.

It consists of sets of terms naming descriptors in a hierarchical structure that permits

searching at various levels of speciﬁcity. A MeSH term is a medical subject heading,

or descriptor, as deﬁned in the MeSH thesaurus. We have made a pre-selection of 1098

papers using only the one’s containing at least one of the following mesh headings:

“Erythrocytes”, “Escherichia coli”, “Protein Binding” and “Blood Pressure” containing

approximately 250 documents per category.

5.2 Pre-processing Techniques Applied

The MEDLINE sample we used in our study was in the XML format. So the ﬁrst pre-

processing step was to read the XML ﬁle and to ﬁlter the information we need namely

the pmid (PubMed id), journal title, PubMed date, article title, abstract of the paper if

which is a database composed of 348,566 MEDLINE documents from 270 journals that are

classiﬁed under 14,321 mesh categories

available, list of authors, list of keywords and the list of Mesh terms. We have devel-

oped our application using the JAVA Programming Language which contains classes to

work with XML ﬁles and SQL, that are needed for our approach. We have also used

a MySQL Database to store all the information useful for further pre-processing and

Classiﬁcation. We have applied the following pre-processing techniques to our original

Data Set.

– Tokenization;

– Stop Words removal: we have used a set of 659 stop words;

– We have used WordNet [9] to search for synonyms of terms and replace each term

for the respective synonym (we have chosen the smallest one);

– Stemming: we used the Porter’s Stemmer Algorithm [10]

– We have implemented the standard term-frequency inverse document frequency

(TFIDF) function to assign weights to each term in the document.

Besides this most common pre-processing techniques we have applied some other

pre-processing features with the objective of reducing the number of attributes, namely:

– Pruning:

• We have removed bi-grams words that are meaningless (such as “iz”, “tk”,

etc.);

• We have also removed the words that appear less than 10 times in the all collec-

tion, because if their frequency is so low probably they are not discriminative

of the document content;

• We have also removed the terms that do not appear in the WordNet neither on

a medical dictionary of terms (The Hosford Medical Terms Dictionary [11]).

The main reason is to eliminate terms that are not valid (such as “tksx”) that

are meaningless and do not constitute a valid term.

As already mentioned we have used the TFIDF method because in this weighting

scheme terms that appear too rarely or too frequently are ranked lower than terms that

balance between the two extremes. And also because higher weight terms signify that

the term contributes better to classiﬁcation results.

Figure 1 summarises our approach.

5.3 Data Warehouse Techniques

Data Warehouse “is a subject-oriented, integrated, non- volatile and time-variant col-

lection of data in support of management’s decisions ” as deﬁned by [12]. Data Ware-

house techniques are used to represent the collection of documents as a set of vectors

that can be written as a matrix. A dimensional model comprises facts, dimensions and

measures [13]. The fact table represents the measure that is being tracked [14]. In our

study was used a Snowﬂake schema with a central fact table of terms. The ETL process

(Extracting, Transforming and Loading and Indexing) was made using a SQL database.

The Extract was made using the XML source ﬁles and populating the SQL database.

The Fact Term table contains measures like the number of occurrence of the term at the

title and also at the abstract (representing the T F = Term Frequency). Other measures

Fig. 1. Summary of steps.

introduced in the fact table was the T F ∗IDF (Term Frequency Inverse Document Fre-

quency) to calculate the “Weight” of a term in a document. To calculate these measures,

another measure must be used - the Document Frequency - representing the frequency

of the term in all the documents collection. This measure can be calculate using the

dimension term table. Our model (summarised at Figure 2) includes other dimension

tables like a document dimension to represent the citation document; a Time dimension

was used to represent the time dimension associated with the citation document; author

dimension that have to be represented using a group author due to the M ←→ M rela-

tion between the term fact table and the authors dimension. Using this techniques, the

calculation process that is represented by the “update metrics” in Figure 1 perform the

ﬁnal calculations and was made using only SQL instructions. With this representation,

the creation of the data set is simpliﬁed, since it can be done using SQL instructions to

join the records, according to our needs to produce the output data set. Also, the pruning

can be made with SQL instructions that limit the number of terms that should be used.

The last step is represented by the “sql2arff” process in Figure 1 and was applied the

following pruning technique through our Data Warehouse:

– < 10∗ To remove words that appear less than 10 times in the document collection.

– < 100∗ To remove words that appear less than 100 times in the document collec-

tion.

All the SQL instruction used by our Java programs are on the db.properties ﬁle in

Figure 1. If we decide to change the Mesh, the pruning to apply or any other character-

istics we simple need to change this ﬁle.

With this technique was possible to reduce the number of attributes from more than

1300 to less then 90. This can have a signiﬁcantly impact in processing the MEDLINE

database.

Fig. 2. Snowﬂake Schema.

6 Experimental Results

Table 1 lists the algorithms available in WEKA and used in our experiments. We have

generated several data sets combining the different pre-processing techniques (pruning,

stemming and WordNet). The quality of the classiﬁers were assessed by their Accuracy.

Table 1. Machine Learning algorithms used in the study. RF stands for Random Forest.

Algorithm Type

SMO Support Vector Machine

RF Ensemble algorithm

IBk K-nearest neighbours

BayesNet Bayes Network

j48 Decision tree

dtnb Decision table/naive bayes hybrid

The results obtained are in Table 2. In the Pruning column we present the thresholds

for term occurrence (terms that appear less than 10 and less than 100 times (df) in the

document collection and that were removed).

The ﬁrst experiences shows that the application of pruning, stemming and WordNet

reduces signiﬁcantly the number of attributes without affecting the accuracy of results.

Table 2. Accuracy Classiﬁcation Results. The majority class is 28%.

Pruning Stemming WordNet Attributes Acc

SMO RF IBk BayesNet j48 dtnb

< 10∗ no no 1304 76.95 84.09 46.66 86.85 81.33 76.28

< 100∗ no no 62 76.62 81.70 63.02 81.60 80.74 77.01

< 10∗ yes no 1103 80.57 84.85 49.71 86.85 84.57 80.57

< 100∗ yes no 90 77.31 85.51 61.10 85.03 84.17 79.88

< 10∗ no yes 916 76.38 84.95 48.19 87.14 83.90 79.42

< 100∗ no yes 85 80.24 85.20 61.45 85.97 83.77 79.77

< 10∗ yes yes 904 78.00 85.14 51.42 87.42 83.71 80.66

< 100∗ yes yes 91 77.71 85.80 57.33 86.38 82.66 80.47

We get the best results using the Random Forest (85.80%) using a pruning of df = 100,

using the stemming and the WordNet. The accuracy result are even best with only 91

terms than using 904 terms. The Bayes Network algorithm with a pruning of df = 10 ,

with Stemming and WordNet show the best result, but if we look at the result of using

pruning with df = 100 the accuracy drop from 87.42% to 86.38%.

The time taken to build the models are presented on in Table 3.

Table 3. Time to construct the models in seconds.

Pruning Stemming WordNet Attributes Time

SMO RF IBk BayesNet j48 dtnb

< 10∗ no no 1304 1.32 101.98 0.01 2.5 8.16 34.05

< 100∗ no no 62 0.8 2.39 0.01 0.38 0.56 3.19

< 10∗ yes no 1103 1.27 69.69 0.01 2.03 10.55 54.64

< 100∗ yes no 90 0.75 2.39 0.02 0.43 1.17 5.96

< 10∗ no yes 916 1.45 61.23 0.01 1.74 7.81 58.78

< 100∗ no yes 85 1.25 3.14 0.01 0.8 1.4 3.68

< 10∗ yes yes 904 1.77 59.97 0.01 1.6 8.16 34.05

< 100∗ yes yes 91 1.3 3.96 0.04 0.74 1.24 5.56

The time taken to build the models were obtained on a intel Core 2 Duo Processor

@ 2.40Ghz with 4096 GB. We can see that pruning has a positive impact in the time

taken to build the models specially on algorithms that take longer to execute.

7 Conclusions and Future Work

This paper focuses on the study of the impact of pre-processing techniques in classi-

fying MEDLINE documents. We have presented results of our empirical study of the

impact of the pre-processing techniques in text classiﬁcation. The amount of informa-

tion available on the MEDLINE database can (and probably will) be an issue. Through

the use of Information Retrieval and Data Warehouse techniques applied it was possi-

ble to reduce signiﬁcantly the time needed for the pre-processing without affecting the

accuracy. Although we are using a MEDLINE sample we have a data set with 30000

MEDLINE citations and we selected only 1098 citations with abstract representing

the 4 classes using only the one’s containing at least one of the following mesh head-

ings: “Erythrocytes”, “Escherichia coli”, “Protein Binding” and “Blood Pressure”. In

this sample we started with more than 1300 attributes that can be represented using a

fact table and some dimensions tables in our Snowﬂake schema data warehouse. The

pruning pre-processing was made using only SQL instructions on the Data Warehouse.

Unlike [6] we also have used the abstract from the papers. Like [8] we have used a

bag-of-words representation and we implemented the standard term-frequency inverse

document frequency (TFIDF) function to assign weights to each term. We have gener-

ated several data sets combining the different pre-processing techniques. We have made

a comparison table of the accuracy obtained using different pre-processing techniques

and different classiﬁcation algorithms. We have also presented the computation time re-

quired for the execution of the algorithms. The best accuracy result achieved (87.42%)

is thus promising.

As a future work and to achieve a better classiﬁcation we may: 1) incorporate more

information 2) optimise the MeSH terms selection for each document 3) test with other

MEDLINE sources, like the 2010 MEDLINE version and 4) using the full MED-

LINE database.

References

1. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church,

D. M., Dicuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Helmberg, W., Kapustin, Y., Ken-

ton, D. L., Khovayko, O., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Pruitt,

K. D., Schuler, G. D., Schriml, L. M., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov,

A., Starchenko, G., Suzek, T. O., Tatusov, R., Tatusova, T. A., Wagner, L., Yaschenko, E.:

Database resources of the national center for biotechnology information. Nucleic Acids Res

34 (2006)

2. Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval: basic terms and

concepts. Journal of Biomedical Discovery and Collaboration 1 (2006)

3. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Brieﬁngs

in Bioinformatics 6 (2005) pp. 57–71

4. Chapelle, O., et al.: Semi-supervised learning. Cambridge MIT Press (2006)

5. Zhu, X.: Semi-supervised learning literature survey (2006)

6. Meliha Yetisgen-Yildiz, W.P.: The effect of feature representation on medline document

classiﬁcation. AMIA Annu Symp Proc. (2005) 849–853

7. Sebastiani, F.: Machine learning in automated text categorization. Computing Surveys. 31(1)

(2002) 1–47

8. LAN, M., TAN, C.L., SU, J., LOW, H.B.: Text representations for text categorization : a

case study in biomedical domain. In: IJCNN

a07 : International Joint Conference on Neural

Networks. (2007)

9. Miller, G.A.: Wordnet: A lexical database for english. Communications of the ACM 38

(1995) pp. 39–41

10. Porter, M.F.: An algorithm for sufﬁx stripping (1997)

11. Hosford medical terms dictionary v3.0 (2010)

12. Inmon, W.: Building the Data Warehouse. . (2002)

13. Schonbach, C., Kowalski-Saunders, P., Brusic, V.: Data warehousing in molecular biology.

Brieﬁngs in Bioinformatics 1 (2000) 190–198

14. Chituc, C.M.: Data warehousing - lecture notes (2009)