Rule Management for Information Extraction

from Title Pages of Academic Papers

Atsuhiro Takasu

and Manabu Ohta

National Institute of Informatics, Tokyo, Japan

Okayama University, Okayama, Japan

Keywords:

Digital Library, Document Understanding, Information Extraction, CRF.

Abstract:

This paper discusses the problem of managing rules for page layout analysis and information extraction. We

have been developing a system to extract information from academic papers that exploits both page layout and

textual information. For this purpose, a conditional random ﬁeld (CRF) analyzer is designed according to the

layout of the object pages. Because various layouts are used in academic papers, we must prepare a set of

rules for each type of layout to achieve high extraction accuracy. As the number of papers in a system grows,

rule management becomes a big problem. For example, when should we make a new set of rules, and how

can we acquire them efﬁciently while receiving new articles? This paper examines two scores to measure the

ﬁtness of rules and the applicability of rules learned for another type of layout. We evaluate the scores for

bibliographic information extraction from title pages of academic papers and show that they are effective for

measuring the ﬁtness. We also examine the sampling of training data when learning a new set of rules.

1 INTRODUCTION

Information extraction is an important technology in

utilizing documents. It helps to extract various kinds

of metadata and to provide users with rich informa-

tion access. For example, bibliographic information

extraction from academic papers is useful to create

or reconstruct metadata. It can be used for linking

identical records stored in different digital libraries

as well as for faceted retrieval. Although many re-

searchers have studied bibliographic information ex-

traction from papers and documents (e.g., (Takasu,

2003; Peng and McCallum, 2004; Councill et al.,

2008)), it is still an active research area, and several

competitions have been held

For accurate information extraction, researchers

have developed various rule-based methods that can

exploit both logical structure and page layout. Docu-

ment systems such as digital libraries usually handle

various types of documents. Because the rules should

be tailored for each type of document, formulating

them requires effective and efﬁcient methods. Rule

management becomes harder as a system grows and

contains larger numbers of articles with more variable

layouts. For example, when receiving a set of articles,

http://www.icdar2013.org/program/competitions

We must determine whether we should make a new

set of rules or whether we can apply existing rules

as in transfer learning (Pan and Yang, 2010). In ad-

dition, the rules should be properly updated because

the layout of documents may sometimes change. To

maintain such document systems, we require a rule

management facility that can measure the ﬁtness of

rules and recompile rules when required.

We have been developing a digital library sys-

tem for academic papers (Takasu, 2003; Ohta and

Takasu, 2008). We are especially interested in ex-

tracting bibliographic metadata such as authors and

titles. In previous studies, we applied a conditional

random ﬁeld (CRF) (Lafferty et al., 2001) to analyze

and extract bibliographic information from title pages

of academic papers. In these studies, we observed that

rule-based methods can extract metadata with high

accuracy, but we required multiple sets of rules and

chose one according to page layout. In other words,

we can obtain enough homogeneously laid out pages

to learn a CRF that can analyze the pages with high

accuracy for the task of metadata extraction from aca-

demic pages.

The use of multiple sets of rules requires rule man-

agement functions, such as choosing the appropriate

set of rules for a particular document and deciding

when to make a new set of rules for a change of page

438

Takasu A. and Ohta M..

Rule Management for Information Extraction from Title Pages of Academic Papers.

DOI: 10.5220/0004827204380444

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 438-444

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

layout or a new page layout. This leads us to the study

of managing rules for page layout analysis and bib-

liographic information extraction from title pages of

academic papers. For this task, we ﬁrst propose a

method that uses two statistical measures calculated

using CRF. Then, we examine their effectiveness for

evaluating the ﬁtness of a CRF for a particular page

layout using three kinds of academic journals. The

experimental results show that the measures decrease

signiﬁcantly when a CRF is applied to the title page

of a journal that is different from the one used for

learning the CRF. This result indicates that the statis-

tical measures are effective for detecting page layout

changes.

2 PROBLEM DEFINITION

There are several kinds of information extraction

tasks for academic papers such as mathematical ex-

pressions, ﬁgures, and tables (Wang et al., 2004). This

paper addresses bibliographic information extraction

(Peng and McCallum, 2004), which is one of the fun-

damental tasks. This paper focuses on title page anal-

ysis, where we extract bibliographic information such

as title, authors, and abstract from a title page. Figure

1 depicts an example of a title page. As shown in the

ﬁgure, the task of bibliographic information extrac-

tion from the title page is to extract the red rectangles

shown and to apply labels, such as “title” and “au-

thors”, to them.

title

author

affiliation

keyword

abstract

E-mail

Figure 1: Example of page layout.

Because bibliographic information is located in a

two-dimensional space, some researchers have pro-

posed rules that can analyze components of a page

based on a page grammar (Krishnamoorthy et al.,

1992) and a 2D CRF (Zhu et al., 2005; Nicolas et al.,

2007). In another approach, a sequential analysis can

be applied after serializing components of the page

in the preceding step. For example, Peng et al. pro-

posed a CRF-based method of extracting bibliogra-

phies from the title pages and reference sections of

academic papers in PDF format (Peng and McCal-

lum, 2004). Councill et al. developed a CRF-based

toolkit for page analysis and information extraction

(Councill et al., 2008).

We adopted the latter approach. We label each

text line on the title page of an academic article as

an appropriate bibliographic element. For this pur-

pose, linear-chain CRFs (Lafferty et al., 2001) were

used. One linear-chain CRF was constructed for each

journal to achieve high information extraction accu-

racy. The layout of a journal’s title pages is, however,

sometimes redesigned, which causes serious deterio-

ration in information extraction accuracy. Therefore,

this paper addresses the following problems:

• to detect such changes in the title page layout of

academic papers, and

• to obtain a new CRF for analyzing pages in the

new layout.

3 RULE MANAGEMENT

3.1 System Overview

We are developing a digital library system that covers

various journals published in our country. Because

their bibliographic information is stored in multiple

databases, the system creates linkages by ﬁnding the

papers in the multiple databases and provides a test-

bed for scholarly information study such as citation

analysis and paper recommendation.

This system takes both newly published papers

and those published previously but not yet included

in the system. As stated in the previous section, we

use multiple CRFs to extract information from vari-

ous journals. The system chooses a CRF according

to the journal title and applies it to papers to extract

bibliographic information.

When the layout of a paper changes or a new jour-

nal is incorporated, we must judge whether we can

use an existing CRF in the system or whether we must

build a new CRF. The system supports rule mainte-

nance by:

• checking the ﬁtness of a CRF for given papers and

alerting the user if the CRF does not analyze them

with high conﬁdence, and

• supporting labeling of training data when a new

CRF is made.

3.2 The CRF

As described above, we adopted a linear-chain CRF

for extracting bibliographic information from title

RuleManagementforInformationExtractionfromTitlePagesofAcademicPapers

439

pages of academic papers. Suppose L denotes a set of

labels. For a token sequence x := x

···x

, a linear-

chain CRF derivesa sequence y := y

···y

of labels,

i.e., y ∈ L

. A CRF M deﬁnes a conditional probabil-

ity by:

P(y | x, M) =

Z(x)

exp

(

∑

i=1

∑

k=1

i−1

, y

, x)

)

(1)

where Z(x) is the partition constant. The feature func-

tion f

i−1

, y

, x) is deﬁned over consecutive labels

i−1

and y

, and the input sequence x. Each feature

function is associated with a parameter λ

giving the

weight of the feature.

In the learning phase, the parameter λ

is esti-

mated from labeled token sequences. In the predic-

tion phase, CRF assigns the label sequence, y

∗

to the

given-token sequence x that maximizes Eq. (1).

3.3 Metrics for Change Detection

To detect a layout change from a token sequence, we

use metrics that show how unlikely the test token se-

quence is generated from the model. This problem

is similar to the sampling problem in active sampling

(Saar-Tsechansky and Provost, 2004a).

In Eq. (1), the CRF calculates the likelihood based

on the order of the hidden label sequence and each

feature vector x

generated from the estimated hidden

label y

. A change of page layout may affect the order

of hidden labels as well as layout features in x

. This

leads to a decrease of the likelihood P(y

∗

| x, M) given

by Eqs. (1) and (2). A natural way to measure the

model ﬁtness is to use the likelihood. The CRF cal-

culates the hidden label sequence, y, that maximizes

the conditional probability given by Eq. (1). Higher

P(y

∗

| x, M) means more conﬁdent assignment of la-

bels, while lower P(y

∗

| x, M) means that the token

sequence makes it hard for the current CRF model to

assign labels.

The conditional probability is affected by the

length of the token sequence, x; therefore, we use

the following normalized conditional probability as a

model ﬁtness measure:

(x) :=

log(P(y

∗

| x))

|x|

, (2)

where |x| denotes the length of the token sequence, x.

We refer to the metric given by Eq. (2) as a normal-

ized likelihood.

The normalized likelihood is a kind of conﬁdence

measure when the model assigns labels to all tokens

in the sequence, x. The second measure is based on

the conﬁdence measure for assigning labels to a single

token in the sequence. For sequence x, let Y

denote a

random variable for assigning a label to the ith token

in x. For label l in a set L of labels, P(Y

= l) denotes

the marginal probability that label l is assigned to the

ith token. If the token has feature values clearly sup-

porting a speciﬁc label, for example, l ∈ L, P(Y

= l)

must be signiﬁcantly high and P(Y

= l

′

) (l

′

6= l) must

be low. Hence, the following entropy quantiﬁes this

feature:

∑

l∈L

−P(Y

= l)log(P(Y

= l)) . (3)

Lower entropy signiﬁes that the label of token x

more likely to be l. For the sequence x, we use the

following average entropy as another model ﬁtness

measure:

(x) :=

∑

∈x

∑

l∈L

−P(Y

= l)log(P(Y

= l))

|x|

. (4)

3.4 Change Detection

Suppose CRF M is used to label a token sequence ob-

tained from a title page. There are a couple of ways

to deﬁne the change detection problem. The most ba-

sic deﬁnition is as follows. Given a new token se-

quence x, determine whether the sequence is from the

same information source as that from which the cur-

rent CRF M is learned.

For the journal layout detection problem, one is-

sue of a journal usually contains multiple papers, so

the problem is deﬁned as detecting the change when

given a set {x

}

of token sequences. The rest of this

paper addresses this problem.

A token sequence x is judged to be a token se-

quence from the same information source if C(x) > σ

holds where σ is a threshold, where C is C

or C

deﬁned in Section 3.3. Otherwise, the layout has

changed.

3.5 Learning a CRF for a New Layout

Once we detect papers with a page layout that is dif-

ferent from those already known, we must derive a

new CRF for the detected papers. We apply the fol-

lowing active sampling technique (Saar-Tsechansky

and Provost, 2004b) to this task,

1. gather a signiﬁcant number of papers T without

labeling,

2. choose an initial small number of papers T

from

T, label them, and learn an initial CRF M

using

the labeled papers,

3. repeat until convergence:

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

440

(a) choose a small number of papers T

from the

pool T − ∪

t−1

i=0

using the CRF M

t−1

we ob-

tained in the previous loop,

(b) label the papers T

manually,

using the labeled papers ∪

i=0

The purpose of active sampling is to reduce the

cost of labeling required for learning the CRF. One

drawback is that we must delay learning the new CRF

until we have gathered enough papers in the new lay-

out in Step 1.

In active sampling, the sampling strategy for the

initial CRF in Step 2 and for updating the CRF in Step

3-(c) is important. For the initial CRF, we choose the

k papers in T with the lowest values of the metrics C

introduced in Section 3.3, where C is calculated us-

ing the CRFs that we have at that time. This strategy

means that we choose training papers for the initial

CRF having the most different layout from those that

we have so far.

In the tth update phase, we choose the k training

papers from T − ∪

t−1

i=0

with the lowest values of the

metrics C, where C is calculated using the CRF M

t−1

that we obtained in the previous step, instead of the

CRF for the original layout. This strategy means that

we choose training papers with a different layout from

those in ∪

t−1

i=0

4 EXPERIMENTAL RESULTS

4.1 Dataset

For this experiment, we used the dataset prepared for

our previous study (Ohta et al., 2010). It is taken from

the following three journals:

• Journal of Information Processing by the Informa-

tion Processing Society of Japan (IPSJ): We used

papers published in 2003 in this experiment. This

dataset contains 479 papers, most of them written

in Japanese.

• IEICE Transactions by The Institute of Electron-

ics, Information and Communication Engineering

in Japan (IEICE-E): We used papers published in

2003. This dataset contains 473 papers, all written

in English.

• IEICE Transactions by The Institute of Electron-

ics, Information and Communication Engineering

in Japan (IEICE-J): We used papers published in

2003 and 2004. This dataset consisted of 174 pa-

pers, most of them written in Japanese.

We used the following labels for the bibliographic

elements as in (Ohta et al., 2010).

• Title: We used separate labels for Japanese and

English titles because Japanese articles contained

titles in both languages.

• Authors: We used separate labels for Japanese and

English authors as in the title.

• Abstract: As for title and author, we used labels

for English and Japanese abstracts.

• Other: Title pages usually contain paragraphs of

articles such as those for introductory paragraphs

for the article. We assigned the label “other” to

lines in these paragraphs.

Note that different journals have different biblio-

graphic components on their title pages.

Because we used the chain-model CRF, the tokens

must be serialized. In this experiment, we regard each

line as a token. We used lines extracted by OCR and

serialized the lines according to the order generated

by the OCR system. We manually labeled each line

for training and evaluation.

4.2 Features of the CRF

Fifteen features were adapted as in (Ohta et al., 2010).

Among them, 14 are unigram features, and the re-

maining one is a bigram feature. They are also classi-

ﬁed into two other kinds of features: layout features,

such as location, size, and gaps between lines; and

linguistic features, such as the proportions of several

kinds of characters in the tokens and the appearance

of characteristic keywords that often appear in a spe-

ciﬁc bibliographic component such as “institute” for

afﬁliations. Table 1 summarizes the set of feature

templates. Their instances are automatically gener-

ated from training token sequences.

For example, an instance of the bigram feature

template < y(−1), y(0) > is:

i−1

, y

, x) =



1 if y

i−1

= title, y

= authors

0 otherwise

(5)

This bigram feature indicates that the author follows

the title in a token sequence, and the corresponding

parameter λ

shows how likely it is that this token

sequence occurs. CRF++ 5.8

(Kudo et al., 2004)

was used to learn and label the token sequence of each

title page.

4.3 Evaluation Metrics

For our experiments, we used two evaluation met-

rics. One was for evaluating the performance of the

http://code.google.com/p/crfpp/

RuleManagementforInformationExtractionfromTitlePagesofAcademicPapers

441

Table 1: Feature templates for bibliography labeling (Ohta et al., 2010).

Type Feature Description

unigram < i(0) > Current line ID

< x(0) > Current line abscissa

< y(0) > Current line ordinate

< w(0) > Current line width

< h(0) > Current line height

< g(0) > Gap between current and preceding lines

< cw(0) > Median of characters’ width in the current line

< ch(0) > Median of characters’ height in the current line

< #c(0) > # of characters in the current line

< ec(0) > Proportion of alphanumerics in the current line

< kc(0) > Proportion of kanji in the current line

< jc(0) > Proportion of hiragana and katakana in the current line

< s(0) > Proportion of symbols in the current line

< kw(0) > Presence of predeﬁned keywords in the current line

bigram < y(−1), y(0) > Previous and current labels

sequence analysis; i.e., its accuracy. It was deﬁned as

# successfully labeled sequences

# test sequences

. (6)

Note that a CRF was only regarded as having suc-

ceeded in labeling when it assigned correct labels to

all tokens in the token sequence. In other words, if a

CRF assigned an incorrect label to one token but cor-

rectly labeled all other tokens in a sequence, x, it was

regarded as having failed.

The other metric was the accuracy of change de-

tection. For the change detection, we ﬁrst learned a

CRF by using training data for each journal. In the

test phase, we mixed token sequences from two jour-

nals and let the CRF judge whether a token sequence

came from the journal used for learning or from the

other one. If a test token sequence was judged to

come from the same journal as that used for learn-

ing, we regarded the sequence as positive. Otherwise

it was regarded as negative.

The receiver operating characteristic (ROC) curve

was used for evaluation. That is, the mixed sets of test

token sequences were ranked according to the metrics

explained in Section 3.3. By regarding the top k token

sequences in the list as positive, we obtained the true

positive and false positive rates for each k. We plotted

the ROC curve by changing k.

4.4 Sequence Analysis Accuracy

We ﬁrst examined the accuracy of CRFs learned sep-

arately for each journal. Although their accuracies

were not the main concern of this paper, we measured

them as one of the basic statistics of the CRFs for this

experiment; they are also helpful for the later analysis

of the change detection.

We applied ﬁvefold cross-validation to evaluate

the sequence analysis accuracy. We ﬁrst learned

CRFs for each journal by using four out of ﬁve evenly

split datasets. Then, we evaluated the learned CRFs

using the remaining dataset as a test set. Table 2

shows the average accuracies deﬁned by Eq. (6). As

shown in the table, we obtained CRFs with various

levels of accuracy.

Table 2: Average Accuracy of CRFs.

IPSJ IEICE-E IEICE-J

0.947 0.891 0.752

4.5 Change Detection Performance

To measure the performance of the proposed metrics,

we made test data by mixing two test datasets. More

precisely, we applied each learned CRF described in

Section 4.4 to the set of sequences consisting of

• the test set of the journal used for learning the

CRF, and

• one test set from another journal.

For each pair of journals, Figure 2 depicts ROC

curves. Each panel contains the ROC curve by the

normalized likelihood (“likelihood”) and the average

entropy (“entropy”). For example, the ROC curve in

panel (a) in Figure 2 is the result of detecting token se-

quences of IEICE-E from those of IPSJ using the CRF

learned from labeled IPSJ token sequences. Similarly,

the ROC curve in panel (b) is the result of detecting

token sequences of IEICE-J from those of IEICE-E

using the CRF learned by labeled IEICE-E token se-

quences.

First, the ROC curves show that both the normal-

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

442

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

True Positive Fraction

False Positive Fraction

likelihood

entropy

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

True Positive Fraction

False Positive Fraction

likelihood

entropy

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

True Positive Fraction

False Positive Fraction

likelihood

entropy

(a) IPSJ to IEICE-E (b) IEICE-E to IEICE-J (c) IEICE-J to IEICE-E

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

True Positive Fraction

False Positive Fraction

likelihood

entropy

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

True Positive Fraction

False Positive Fraction

likelihood

entropy

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

True Positive Fraction

False Positive Fraction

likelihood

entropy

(d) IPSJ to IEICE-J (e) IEICE-E to IPSJ (f) IEICE-J to IPSJ

Figure 2: Change detection performance.

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 20 40 60 80 100 120 140

Accuracy

# training data

random

nlh

tieicee

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140

Accuracy

# training data

random

nlh

tipsj

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 20 40 60 80 100 120 140

Accuracy

# training data

random

nlh

tipsj

(a) IPSJ (b) IEICE-E (c) IEICE-J

Figure 3: CRF learning by active sampling.

ized likelihood and the average entropy are very ef-

fective for detecting token sequences from a different

journal from the one used for learning the CRF. The

ranked test token sequence lists according to these

metrics are clearly separated. Panels (a), (b), and (e)

in Figure 2 show that the ranked lists according to the

average entropy are perfectly separated into two jour-

nals: IPSJ and IEICE-E in (a), IEICE-E and IEICE-J

in (b), and IEICE-E and IPSJ in (e).

Second, the training journal used affects the

change detection. For example, compare panels (d)

and (f) in Figure 2. In both panels, the test data were

the same, i.e., the mixture of token sequences of IPSJ

and IEICE-J, but the CRF was learned from IPSJ in

(d) and from IEICE-J in (f). From the table 2, the

CRF of (d) is more accurate than that of (f), whereas

the detection accuracy of CRF (f) is better than that of

(d). This is an interesting phenomenon, and we plan

to analyze it further.

Third, both the normalized likelihood and the av-

erage entropy work very well for change detection;

we observed no signiﬁcant difference in detection ac-

curacy between them.

4.6 CRF Learning

To evaluate the method for learning CRFs, we ob-

served the accuracy of CRFs for three journals. More

precisely, for each journal

1. obtain a CRF M for another journal,

2. choose an initial training sample using M and ob-

tain an initial CRF M

3. repeat updating CRF to M

by choosing a training

sample using M

t−1

In this experiment, we ﬁxed the sample size to 10

in both the initial and update phases. The accuracy

of the CRF is measured by Eq. (6). For comparison,

we measured the accuracy of the following sampling

strategies:

• random: 10 training samples are randomly chosen

in both the initial and update phases,

• nlh: 10 training samples are randomly chosen in

the initial phase, and 10 training samples are cho-

sen according to the normalized likelihood in each

update phase,

RuleManagementforInformationExtractionfromTitlePagesofAcademicPapers

443

as well as the proposed method:

• journal: 10 training samples are chosen according

to the normalized likelihood of the CRF for jour-

nal in the initial phase, and 10 training samples

are chosen according to the normalized likelihood

in each update phase.

Because we obtained similar results for the metric av-

erage entropy, we show only the results for normal-

ized likelihood in this section.

Figures 3 (a), (b), and (c) respectively show the

accuracy of CRFs for journals IPSJ, IEICE-E, and

IEICE-J. Each graph in the ﬁgure plots the accuracy

of the CRF with respect to the size of training samples

by three sampling strategies.

First, we observed that both the proposed strategy

and nlh obtained accurate CRFs with fewer samples

than with random. This indicates that the sampling

strategy for the update phase is effective.

Second, when we compare the proposed strategy

and nlh, the proposed strategy obtains a slightly better

initial CRF; its accuracy is plotted at the training data

size of 10. This indicates that the sampling strategy

using a CRF designedfor another journal can improve

the active learning process.

5 CONCLUSIONS

We have examined two statistical measures obtained

using a linear-chain CRF for detecting layout changes

of title pages of academic papers and obtaining new

CRFs for extracting information from academic ti-

tle pages. The experiments revealed that both statis-

tical measures are very effective at detecting layout

changes. We also showed that the measures can be

used for active sampling to reduce the labeling cost of

training data.

We plan to extend this study in several directions.

First, it is unknown how the CRF’s sequence label-

ing accuracy affects the change detection accuracy.

To study this problem, we plan two kinds of exper-

iments: (1) controlling the labeling accuracy by the

size of training data, obtaining CRFs with various la-

beling accuracy, and comparing them for change de-

tection, and (2) applying our approach to more com-

plex sequence labeling problems.

In this paper, we used datasets that we prepared.

To make comparison easier, we plan to evaluate

the method using other open datasets such as the

ICDAR2009 layout dataset (Antonacopoulos et al.,

2009).

REFERENCES

Antonacopoulos, A., Bridson, D., Papadopoulos, C., and

Pletschacher, S. (2009). A realistic dataset for per-

formance evaluation of document layout analysis. In

ICDAR2009, pages 296 – 300.

Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit:

An open-source crf reference string parsing package.

In LREC, page 8.

Krishnamoorthy, M., Nagy, G., and Seth, S. (1992). Syntac-

tic segmentation and labeling of digitized pages from

technical journals. IEEE Computer, 25(7):10–22.

Kudo, T., Yamamoto, K., and Matsumoto, Y. (2004). Ap-

plying conditional random ﬁelds to Japanese morpho-

logical analysis. In EMNLP 2004.

Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-

ditional random ﬁelds: Probabilistic models for seg-

menting and labeling sequence data. In 18th ICML,

pages 282–289.

Nicolas, S., Dardenne, J., Paquet, T., and Heutte, L. (2007).

Document image segmentation using a 2d conditional

random ﬁeld model. In ICDAR 2007, pages 407 – 411.

Ohta, M., Inoue, R., and Takasu, A. (2010). Empirical

evaluation of active sampling for crf-based analysis of

pages. In IEEE IRI 2010, pages 13–18.

Ohta, M. and Takasu, A. (2008). CRF-based authors’ name

tagging for scanned documents. In JCDL’08, pages

272–275.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 20(10):1345 – 1359.

Peng, F. and McCallum, A. (2004). Accurate information

extraction from research papers using conditional ran-

dom ﬁelds. In HLT-NAACL, pages 329–336.

Saar-Tsechansky, M. and Provost, F. (2004a). Active sam-

pling for class probability estimation and ranking.

Machine Learning, 54(2):153–178.

Saar-Tsechansky, M. and Provost, F. (2004b). Active sam-

pling for class probability estimation and ranking.

Machine Learning, 54(2):153–178.

Takasu, A. (2003). Bibliographic attribute extraction from

erroneous references based on a statistical model. In

JCDL ’03, pages 49–60.

Wang, Y., Phillips, I. T., R.M.Robert, and Haralick, M.

(2004). Table structure understanding and its perfor-

mance evaluation. Pattern Recognition, 37(7):1479–

1497.

Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y.

(2005). 2D conditional random ﬁelds for web infor-

mation extraction. In ICML 2005.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

444