STACKED CONDITIONAL RANDOM FIELDS EXPLOITING

STRUCTURAL CONSISTENCIES

Peter Kluegl

1,2

, Martin Toepfer

, Florian Lemmerich

, Andreas Hotho

and Frank Puppe

Department of Computer Science VI, University of W¨urzburg, Am Hubland, W¨urzburg, Germany

Comprehensive Heart Failure Center, University of W¨urzburg, Straubm¨uhlweg 2a, W¨urzburg, Germany

Keywords:

CRF, Stacked graphical models, Structural consistencies, Collective information extraction, Rule learning.

Abstract:

Conditional Random Fields (CRF) are popular methods for labeling unstructured or textual data. Like many

machine learning approaches these undirected graphical models assume the instances to be independently

distributed. However, in real world applications data is grouped in a natural way, e.g., by its creation context.

The instances in each group often share additional consistencies in the structure of their information. This

paper proposes a domain-independent method for exploiting these consistencies by combining two CRFs in

a stacked learning framework. The approach incorporates three successive steps of inference: First, an initial

CRF processes single instances as usual. Next, we apply rule learning collectively on all labeled outputs of one

context to acquire descriptions of its speciﬁc properties. Finally, we utilize these descriptions as dynamic and

high quality features in an additional (stacked) CRF. The presented approach is evaluated with a real-world

dataset for the segmentation of references and achieves a signiﬁcant reduction of the labeling error.

1 INTRODUCTION

The vast availability of unstructured and textual data

increased the interest in automatic sequence labeling

and content extraction methods over the last years.

One of the most popular techniques are Conditional

Random Fields (CRF) and their chain structured vari-

ant linear chain CRF. CRFs model conditional proba-

bilities with undirected graphs and are trained in a su-

pervised fashion to discriminate label sequences. Al-

though CRFs and related methods achieve remarkable

results, there remain many possibilities to increase

their accuracy.

One aspect of improvement has been the relax-

ation of the assumption that the instances are inde-

pendent and identically distributed. Relational and

non-local dependencies of instances or interesting en-

tities have been in the focus of collective informa-

tion extraction. Due to the fact that these dependen-

cies need to be represented in the model structure,

approximate inference techniques like Gibbs Sam-

pling (Finkel et al., 2005) or Belief Propagation (Sut-

ton and McCallum, 2004) are applied. They achieved

signiﬁcant improvements, but at the cost of a compu-

tationally expensive inference. It has been shown by

several approaches that combined models based only

on local features and exact inference can match the

results of complex models while still being efﬁcient.

Kou and Cohen (Kou and Cohen, 2007) used stacked

graphical learning to aggregate the output of a base

learner and to add additional features based on related

instances to a stacked model. Another example is Kr-

ishnan and Manning (Krishnan and Manning, 2006)

who exploit label consistencies with a two-stage CRF.

However, all these approaches take only similar to-

kens or related instances into account while the con-

sistencies of the structure are ignored.

Semi-structured text like any other data is always

created in a certain context. This may often lead

to structural consistencies between the instances in

this creation context. While these instances are lo-

cally homogeneously distributed in one context, the

dataset is globally still heterogeneous and the struc-

ture of information is possibly contradictory. The bib-

liographic section of a scientiﬁc publication, for ex-

ample, applies a single style guide and its instances,

that are the references, share a very similar structure,

while their structure might differ for different style

guides. Previously published approaches, c.f., (Gul-

hane et al., 2010) represent structural properties di-

rectly in a higher-order model and thus suffer from an

computationally expensive inference and furthermore

apply a domain-dependent model.

In this paper, we propose a novel and domain-

240

Kluegl P., Toepfer M., Lemmerich F., Hotho A. and Puppe F. (2012).

STACKED CONDITIONAL RANDOM FIELDS EXPLOITING STRUCTURAL CONSISTENCIES.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 240-249

DOI: 10.5220/0003706602400249

 SciTePress

independent method for exploiting structural consis-

tencies in textual data by combining two linear chain

CRFs in a stacked learning framework. After the in-

stances are initially labeled, a rule learning method is

applied on label transitions within one creation con-

text in order to identify their shared properties. The

stacked CRF is then supplemented with high quality

features that help to resolve possible ambiguities in

the data. We evaluate our approach with a real world

dataset for the segmentation of references, a domain

that is widely used to assess the performance of in-

formation extraction techniques. The results show a

signiﬁcant reduction of the labeling error and conﬁrm

the beneﬁt of additional features induced online dur-

ing processing the data.

The rest of the paper is structured as follows: First,

Section 2 gives a short introduction in the background

of the applied techniques. Next, Section 3 describes

how structural consistencies can be exploited with

stacked CRFs. The experimental results are presented

and discussed in Section 4. Section 5 gives a short

overview of the related work and Section 6 concludes

with a summary of the presented work.

2 BACKGROUND

The presented method combines ideas of linear chain

Conditional Random Fields (CRF), stacked graphi-

cal models and rule learning approaches. Thus, these

techniques are shortly introduced in this section.

2.1 Linear Chain Conditional Random

Fields

Linear chain CRF (Lafferty et al., 2001) are a chain

structured case of discriminative probabilistic graph-

ical models. The chain structure ﬁts well with se-

quence labeling tasks, naturally reﬂecting the inher-

ent structure of the data while providing efﬁcient in-

ference. By modeling conditional distributions, CRFs

are capable of handling large numbers of possibly in-

terdependent features.

Let x be a sequence of tokens x = (x

, . . . , x

) re-

ferring to observations, e.g. the input text split into

lexical units, and y = (y

, . . . , y

) a sequence of la-

bels assigned to the tokens. Taking x and y as argu-

ments, let f

, . . . , f

be real valued functions, called

feature functions. To keep the model small, we re-

strict the linear chain CRF to be of Markov order one,

i.e. the feature functions have the form f

(x, y) =

∑

(x, y

t−1

, y

, t). A linear chain CRF of Markov or-

der one has K model parameters λ

, . . . , λ

∈ R, one

for each feature function, by which it assigns the con-

ditional probability

(y|x) =

exp

∑

t=1

∑

i=1

· f

t−1

, y

, x, t)

to y with a given observation x. The feature func-

tions typically indicate certain properties of the input,

e.g. capitalization or the presence of numbers, while

the model parameters weight the impact of the feature

functions on the inference. The partition function Z

normalizes P

(y|x) by summing over all possible la-

bel sequences for x. The properties of a token x

in-

dicated by feature functions usually consider a small

ﬁxed sized window around x

for a given state transi-

tion. In the following, we will use the terms feature

and feature function interchangeably.

2.2 Stacked Graphical Models

Stacked Graphical Models is a general meta learn-

ing algorithm, c.f., (Kou and Cohen, 2007; Krishnan

and Manning, 2006). First, the data is processed by

a base learner with conventional features represent-

ing local characteristics. Subsequently, every single

data instance is expanded by information about the

inferred labels of related instances. In a second stage,

a stacked learner is provided this extra information.

The process of aggregating and projecting the pre-

dicted information on instance features to support an-

other stacked learner can be repeated several times.

Stacked Graphical Models have two central ad-

vantages: The approach enables to model long range

dependencies among related instances and the infer-

ence for each learner remains effective. If the base

learner and the stacked learner are both linear chain

CRFs, then the inference time is only twice the time

of a single CRF plus the effort to determine the aggre-

gate information.

Similar to Wolpert’s Stacked Generalization

(Wolpert, 1992), Stacked Graphical Models use a

cross-fold technique in order to avoid overﬁtting.

That is, each instance x of the training data for the

stacked learner F

at level k, is classiﬁed by a model

for F

k−1

that was trained on data that did not contain

x. As a result, the stacked learners get to know real-

istic errors of their input components that would also

occur during runtime. However, Stacked Generaliza-

tion and Stacked Graphical Models are essentially dif-

ferent approaches. In short, Stacked Generalization

learns a stacked learner to combine the output of sev-

eral different base learners on a per instance basis. In

contrast, Stacked Graphical Models utilize a stacked

learner to aggregate and combine the output of one

STACKED CONDITIONAL RANDOM FIELDS EXPLOITING STRUCTURAL CONSISTENCIES

241

base learner on several instances, thus supporting col-

lective inference.

2.3 Rule Learning

In this paper, we propose to utilize rule-based descrip-

tions as an intermediate step of our general approach,

cf. Section 3. For this task we will transfer the data

into a tabular form of attribute value pairs and learn

rules on this data representation. While over the last

decades a large amount of rule learning approaches

have been proposed, we will concentrate on two main

approaches in this paper:

Ripper (Cohen, 1995) is probably the currently

most popular learning algorithm for learning a set of

rules. Ripper learns rules one at a time by growing

and pruning each rule and then adds them to a result

set until a stop criterion is met. After adding a rule

to the result, examples covered by this rule are then

removed from the training data. Ripper is known to

be on par regarding classiﬁcation performance with

other learning algorithms for rule sets, e.g., C 4.5, but

is computationally more efﬁcient.

As an alternative, we utilize Subgroup Discov-

ery (Kl¨osgen, 1996) (also called Supervised Descrip-

tive Rule Discovery or Pattern Mining) to describe

structural consistencies. In this approach, an exhaus-

tive search for the best k conjunctive patterns in the

dataset with respect to a pre-speciﬁed target concept

and a quality function, e.g., the F

measure, is per-

formed. Additionally different constraints on the re-

sulting patterns can be applied, e.g., on the maximum

number of describing attribute value pairs or the min-

imum support for a rule. While the resulting rules are

not intended to be used directly as a classiﬁer, a re-

lated approach using patterns based on improvement

of the target share and additional constraints has re-

cently been successfully applied as an intermediate

feature construction step for classiﬁcation tasks (Batal

and Hauskrecht, 2010).

3 METHOD

For introducing the proposed method, we ﬁrst moti-

vate the problem. Then, the stacked inference, the in-

duction of the structural properties and the parameter

estimation are presented.

3.1 Problem Description

Recap the inference formula of CRFs (c.f. Sec-

tion 2.1). From the model designers’ perspective,

the classiﬁcation process is mainly inﬂuenced by the

choice of the feature functions f

. The feature func-

tions need to provide valuable information to dis-

criminate labels for all possible kinds of instances.

This works well when the feature functions encode

properties that have the same meaning for inference

across arbitrary instances. For example in the domain

of reference segmentation, some special words have

a strong indicative meaning for a certain task: The

word identity feature “WORD=proceedings” always

suggests labeling the token as “Booktitle”. Thus, the

learning algorithm will ﬁx the corresponding weights

to high values, leading the inference procedure into

the right direction. Some features, however, violate

the assumption of a consistent meaning. Their va-

lidity depends on a special context or is restricted

through long range dependencies. In our example of

reference extraction, the feature that indicates colons

might suggest an author label if the document ﬁnishes

author ﬁelds with colons. However, other style guides

deﬁne a different structure of the author labels. Con-

sequently, the learning algorithm assigns the weights

to average the overall meaning. On the one hand,

this yields good generalization given enough training

data. On the other hand, averagingthe weights of such

features restricts them to stay behind their discrimina-

tive potentials. If we knew that a certain feature has a

special meaning inside the given context, we could do

better by increasing (or decreasing) the weights, dy-

namically adapting to the given context. This proce-

dure on some particular features cannot be performed

independently of the remaining weights. Hence, we

apply a different approach in this paper. Instead of

changing the model parameters, we learn the weights

of additional feature functions describing these struc-

tural and context dependent consistencies.

Structural consistencies can be found in many do-

mains, especially when several instances are created

within one creation context. Besides the already men-

tioned segmentation of references, where the knowl-

edge about the applied style guide can greatly in-

crease the accuracy, there are many other examples.

The segmentation of physicians’ letters severely de-

pends on the identiﬁcation of headlines. Each author

applies different and often contradictory layouts for

the headlines using word processing software. How-

ever, the headlines within a letter rely always on an

identical structure. By providing information about

this consistency additionally to common keywords,

the headlines can be accurately identiﬁed. As an-

other example, the labeling of interesting entities in

curriculum vitae, e.g., the employer, relies on large

dictionaries. Their incompleteness can be neglected

when exploiting the uniform composition of one cur-

riculum vitae. Besides these examples there are many

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

242

other domains like content extraction from websites

or recognition of handwriting that allow an increase

of accuracy by exploiting structural consistencies.

3.2 Stacked Inference

Sequence labeling methods like CRFs assign a se-

quence of labels y = (y

, . . . , y

) to a given sequence

of observed tokens x = (x

, . . . , x

). Figure 1 con-

tains two examples to illustrate the assignment. Let

crf(x, Λ, F) = y be the function that applies the CRF

model with the weights Λ = {λ

, . . . , λ

} and the set

of feature functions F = { f

, . . . , f

} on the input se-

quence x and returns the labeling result y. The set of

model weights must of course correspond to the set

of feature functions. Since the CRF processes this se-

quence of tokens in one labeling task, we call x an

instance. All instances together form the dataset D

which is split in a disjoint training and testing sub-

set. An information or entity consists often of several

tokens and are encoded by a sequence of equal la-

bels. We assume here that the given labels already

specify an unambiguous encoding. An instance it-

self may contain multiple entities speciﬁed by arbi-

trary amount of labels, one label for each token of

the input sequences. Furthermore, we assume that

the dataset D = {C

, . . . , C

} can be completely and

disjointly partitioned into subsets of instances x that

originate from the same creation context C

. Similar

to the relational template in (Kou and Cohen, 2007),

we imply that a trivial context template exists for the

assignment of the context set. Staying with the previ-

ous examples, the reference section of this paper de-

ﬁnes a context C with 22 instances.

Author Author Author Author Author Author

Date Date

.. .

Cohen

W . W .

(

1995

.. .

Author Author Author Author

Date Date Date Date

Title

.. .

McCallum

(

2003

)

Efﬁciently

.. .

Figure 1: The start of the fourth and the fourteenth reference

of the reference section of this paper displayed as a linear

chain with correct labels. The indices for x and y start at 1

for each instance.

In stacked graphical learning, several models can

be stacked in a sequence. Experimental results, e.g.,

of Kou (Kou and Cohen, 2007), have shown that this

approach already convergeswith a depth of two learn-

ers and no signiﬁcant improvementsare achievedwith

more iterations of stacking. Therefore, we only ap-

ply stacked graphical learning with CRF in a two-

stage approach like Krishnan and Manning (Krish-

nan and Manning, 2006). In order to extract enti-

ties collectively, we deﬁne the stacked inference task

on the complete set of instances x in one context C.

The two CRFs, however, label the single instances

within that context separately as usual. The follow-

ing algorithm summarizes the stacked inference com-

bined with online rule learning. Section 3.3 describes

the rule learning techniques for the identiﬁcation of

structural consistencies and how the “meta-features”

are induced. Details about the estimation of the

weights (e.g., Λ

) are discussed in Section 3.4.

1. Apply base CRF

Apply crf (x, Λ, F) = ˆy on all instances x ∈ C in

order to create the initial label sequences ˆy.

2. Learn structural consistencies

Create a tabular database T by combining all in-

stances x ∈ C, their corresponding labeled se-

quences ˆy and a feature set F

′

⊆ F. Learn classiﬁ-

cation rules for the target attributes and construct

a feature function f

∈ F

for each discovered

rule.

3. Apply stacked CRF

Apply crf(x, Λ∪ Λ

, F ∪ F

) = y again on all in-

stances x ∈ C in order to create the ﬁnal label se-

quences y.

3.3 Learning Structural Consistencies

during Inference

First, the overall idea how structural consistencies are

learned during the inference is addressed. The techni-

cal details are then described after a short example.

We apply rule learning techniques on all (probably

erroneously) label assignments ˆy∈ C of the base CRF.

The rules are learned in order to classify certain label

transitions and, thus, describe the shared properties of

the transition within the context C. The labeling er-

ror in the input data is usually eliminated by the gen-

eralization of the rule learning algorithm. The label

transition is optimally described by a single pattern

that covers the majority of transitions despite of erro-

neously outliers. The learned rules are then used as

binary feature functions in the same context C: They

return 1 if the the rule applies on the observed token

, and 0 otherwise. We gain therefore additional fea-

tures that indicate label transitions if the instances are

consistently structured. Even if the learned rules are

misleading due to erroneously input data or missing

STACKED CONDITIONAL RANDOM FIELDS EXPLOITING STRUCTURAL CONSISTENCIES

243

Author Author Author Author Author Author Author Author Author Author Title

.. .

Author Author Author Author

Date Date

Booktitle

.. .

Kl¨osgen

W .

(

1996

)

Explora

.. .

editors

Advances

.. .

Figure 2: Two excerpts of the ninth reference with erroneous labeling: The date (token x

to x

) and the begin of the title

(token x

and x

) was falsely labeled as author, e.g., due to the high weight of the colon for the end of an author. The editor

was additionally labeled as an author (up to token x

) and date (token x

and x

consistency of the instances, their discriminative im-

pact on the inference is yet weighted by the learning

algorithm of the stacked CRF.

This process is illustrated by a simple example

concerning the author label, but can also be applied

to any other label. Let the reference section of this

paper be processed by the base CRF that classiﬁed all

instances but one correctly as in Figure 1. For some

reasons the base CRF missed the date and editor and

misclassiﬁed the tokens x

to x

and the tokens x

in the ninth reference (c.f. Figure 2). The input of

the rule learning now consists of 22 transitions from

author to date whereas one transition is incorrect. In

this case, a reasonable result of the rule learning is

the rule “if the token x

is a period and followed by a

parenthesis, then there is a transition from author to

date at t”. Converted to a feature function, this rule

returns 1 for token x

and 0 for all other tokens of the

reference in Figure 2. The weight of this new fea-

ture function is then estimated by the stacked CRF.

Therefore, the stacked CRFs’ likelihood of an transi-

tion from author to date is increased at the token x

and decreased at the token x

due to the presence or

absence of the meta-features.

In general, any classiﬁcation method can be ap-

plied to learn indicators for the structural consisten-

cies. In this work, we restrict ourselves to tech-

niques for supervised descriptive rule discovery be-

cause their learning and inference algorithm are efﬁ-

cient and the resulting rules can be interpreted. This

allows studies about the properties of good descrip-

tions of structural consistencies. We disregarded the

usage of the Support Vector Machines (Cortes and

Vapnik, 1995) because several models need to be

trained and executed during the stacked inference.

For inducing the meta-features, ﬁrst one tabular

database T = (I, A) is created for each context C as

the input of the rule learning techniques described in

Section 2.3. The database is constructed using all in-

stances x ∈ C, their corresponding initially labeled

sequences ˆy and a feature set F

′

⊆ F. Each indi-

vidual of I corresponds to a single token of the in-

stances in C. The set of attributes A consists of the

possible labels and a superset of F

′

: When classiﬁ-

cation methods are applied on sequential data, the at-

tributes are also added for a ﬁxed window, e.g., the at-

tribute “WORD@-1=proceedings” indicates that the

token before the current individual equals the string

“proceedings”. Hence, this superset contains the fea-

ture functions F and additionally their manifestation

in window deﬁned by the window size w. The cells

in the tabular database T are ﬁlled with binary values.

They are set to true if the feature or label occurs at

this token and to false otherwise.

In a next step, the target attributes for the rule

learning are speciﬁed. In this work, we apply the tran-

sition of two different labels. Here, the target attribute

is set on all transitions of two dedicated labels in the

initially labeled result ˆy of the context C.

Finally, the set of learned rules are transformed to

the set of binary feature functions F

that return true,

if the condition of the respective rule applies.

3.4 Parameter Estimation

The weights of two models need to be estimated for

the presented approach: the parameters of the base

model and of the stacked model. The base model

needs to be applied on the training instances for the

estimation of the weights of the stacked model, i.e.,

step 1 and step 2 of the stacked inference in Sec-

tion 3.2 need to be performed on the training set. If

the weights of the base model are estimated as usual

using the labeled training instances, then it produces

unrealistic prediction on these instances and the meta-

features of the stacked model are over ﬁtted resulting

in a decrease of accuracy. Since the base model is op-

timized in this case on the training instances, it labels

these instances perfectly. The learned rules create op-

timal descriptions of the structural consistencies and

the stacked model assigns biased weights to the meta-

features. This is of course not reproducible when pro-

cessing unseen data.

The simple solution to this problem is a cross-

fold training of the base model for the training of the

stacked CRF as described in Section 2.2 and success-

fully applied by several approaches (Kou and Cohen,

2007; Krishnan and Manning, 2006). Training of the

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

244

base model in a cross-fold fashion is also a very good

solution for the presented approach, but we simply

decrease the accuracy of the model by reducing the

training iterations. Thus, only one model needs to be

trained for the learning phase of the stacked model.

For the testing phase or common application however,

a single base model learned with the default settings

is applied.

The model of the stacked CRF is trained depen-

dent on the base model and the creation context C

that are both applied to induce the new features on-

line during the stacked inference. The weights Λ =

{λ

, . . . , λ

} and Λ

= {λ

, . . . , λ

} of the stacked

CRF are estimated to maximize the conditional prob-

ability on the instances of the training dataset:



y|x, C, crf(x, Λ

′

, F)



exp

∑

t=1

∑

i=1

· f

t−1

, y

, x, t)

∑

t=1

∑

j=1

· f



t−1

, y

, x, t, C, crf (x, Λ

′

, F)



The resulting model still relies on the normal fea-

tures functions but is extended with dynamic and high

quality features that help to resolve ambiguities and

substitute for other missing features. These meta-

features possess the same meaning in the complete

dataset, but change their interpretation or manifes-

tation dependent on the currently processed creation

context. They provide overall a very good descrip-

tion of the structural consistencies and are often alone

sufﬁcient for a classiﬁcation of the entities.

A short example: The induced feature function for

the transition of the author to the date are set to very

high weights for the corresponding state transition of

the learned model. As illustrated in the example of

Section 3.3, this feature function returns 1 in the refer-

ence section of this paper for a token which is a period

and is followed by a parenthesis. In other reference

sections with a different style guide applied, the fea-

ture function for this state transition returns 1, if the

token is a colon and is followed by a capitalized word.

However,both examples refer only to exactly one fea-

ture function that dynamically adapts to the currently

processed context.

4 EXPERIMENTAL RESULTS

The presented approach is evaluated in the domain

of reference segmentation. The common approach is

to separately process the instances, namely the refer-

ences. Within these references, the interesting entities

need to be identiﬁed. Since all tokens of a reference

are part of exactly one entity, one speaks of a segmen-

tation task. In this section, we introduce the overall

settings and present the experimental results.

4.1 Datasets

All available and commonly used datasets for the seg-

mentation of references are a listing of references

without their creation context and are thus not ap-

plicable for the evaluation of the presented approach.

Therefore, a newdataset was manually annotated with

the label set of Peng and McCallum (Peng and Mc-

Callum, 2004) concerning the ﬁelds Author, Bookti-

tle, Date, Editor, Institution, Journal, Location, Note,

Pages, Publisher, Tech, Title and Volume. The re-

sulting dataset contains 566 references in 23 docu-

ments extracted only of complete reference sections

of real publications. The amount of instances is com-

parable to previously published evaluations in this do-

main, c.f., (Peng and McCallum, 2004; Councill et al.,

2008).

Two different sets of features are used in the ex-

perimental study: The basic features are applied for

all evaluated models and correspond to the features

of well-known evaluations in this domain, c.f., (Peng

and McCallum, 2004; Councill et al., 2008). For an

extensive deﬁnition of the set of basic features, we

refer to the dataset that contains all applied basic fea-

tures. Only a part of the basic features is used for the

induction of the meta-features, omitting ngram and

token window features. This restriction is justiﬁed

with their minimal expressiveness for the identiﬁca-

tion of the structure in relation to the increase of the

search space.

The dataset with all applied basic features can be

freely downloaded

4.2 Implementation Details

The machine learning toolkit Mallet

is used for an

implementationof the CRF in the presented approach.

For rule learning, we integrated two different meth-

ods. We chose a subgroup discovery implementation

because of the multifaceted conﬁguration options that

allow a deep study of the approach’s limits. Addi-

tionally, we applied an established association rule

learner Ripper (Cohen, 1995) for a comparable im-

plemention

http://www.is.informatik.uni-wuerzburg.de/staff/

kluegl peter/research/

http://mallet.cs.umass.edu

http://sourceforge.net/projects/vikamine/

http://sourceforge.net/projects/weka/

STACKED CONDITIONAL RANDOM FIELDS EXPLOITING STRUCTURAL CONSISTENCIES

245

Table 1: Overview of the evaluated models.

CRF A single CRF trained on the same data and features.

STACKED CRF A two-stage CRF approach. The predictions of the base CRF are added as

features to the stacked CRF.

STACKED+DESCRIPTIVE The default approach of stacked CRF combined with subgroup discovery for

rule learning. Only transitions between the labels Author, Title, Date and Pages

that commonly occur in most references are considered.

STACKED+RIPPER A stacked CRF combined with the association rule learner Ripper. Only the

four most common labels are addressed.

STACKED+MORE A stacked approach using subgroup discovery that additionally learns the tran-

sitions of the labels Booktitle, Journal and Volume.

STACKED+MAX A stacked approach using subgroup discovery that considers the transitions of

all labels for the rule learning task.

We used only the default parameters for the CRF

and all evaluated models were trained until conver-

gence. Only for the training of the stacked model, the

iterations of the base model was reduced to 50 itera-

tions. For the default conﬁguration of both rule learn-

ing tasks, we set the window size w = 1. Additionally

for the default setting of the subgroup discovery, we

used a quality function based on the F

measure, se-

lected only one rule for each description of a label,

restricted the length of the rules to maximal three se-

lectors, and set an overall minimum threshold of the

quality of a rule equal to 0.5.

The presented approach is overall easy to imple-

ment and only established standard methods are used.

Its inference is still efﬁcient in contrast to complex

models with approximate inference techniques.

4.3 Performance Measure

The performance is measured with commonly used

methods of the domain. Let tp be the number of

true positive classiﬁed tokens and deﬁne f n and f p

respectively for false negatives and false positives.

Since punctuations contain no target information in

this domain, only alpha-numeric tokens are consid-

ered. Precision, recall and F

are computed by:

precision =

tp+ f p

, recall =

tp+ f n

2· precision· recall

precision+recall

4.4 Results

The presented approach is compared to two base line

models in a ﬁve fold cross evaluation. Four different

settings of stacked CRFs combined with a rule learn-

ing technique are investigated. A detailed description

of all evaluated models is given in Table 1. The docu-

ments of the dataset are randomly distributed over the

ﬁve folds.

The results of the experimental study are depicted

in Table 2. Only marginal differences can be ob-

served between the two base line models CRF and

STACKED CRF. This indicates that the normal stack-

ing approach cannot exploit the structural consisten-

cies or gain much advantage of the predicted labels.

Table 2: F

scores averaged over the ﬁve folds.

average F

Base Line

CRF 91.3

STACKED CRF 91.8

Our Approach

STACKED+DESCRIPTIVE 94.0

STACKED+RIPPER 93.6

Variants of DESCRIPTIVE

STACKED+MORE 94.1

STACKED+MAX 93.6

All of our stacked models combined with rule

learning techniques signiﬁcantly outperform the base

line models using a one-sided, paired t-tests on the F

scores of the single references (p ≪ 0.01). Compar-

ing the results of STACKED+DESCRIPTIVE that only

considers the consistencies of four labels to the base

line CRF, our approach achieves an average error re-

duction of over 30% on the real-world dataset.

The lower F

scores of STACKED+RIPPER can be

explained by its learning algorithm. The Ripper im-

plementation applies a coverage-based learning in or-

der to create a set of rules, which together classify the

target attribute. This can lead to a reproduction of er-

rors of the predicted labels in the description of the

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

246

structure. In the domain of reference segmentation a

single description of the structure is preferable. How-

ever, in other domains where disjoint consistencies of

one transition can be found, a covering algorithm for

inducing the rules performs probably better.

The second conﬁguration with a subgroup discov-

ering technique STACKED+MORE considers the tran-

sition between seven labels and is able to slightly in-

crease the measured F

score compared to our default

model STACKED+DESCRIPTIVE. STACKED+MAX

that induces rules for all labels achieves only an aver-

age error reduction of 26% compared to a single CRF.

This is mainly caused by misleading meta-features for

rare labels. The task of learning consistencies from a

minimal amount of examples is error-prone and can

decrease the accuracy, especially if the examples are

labeled incorrectly.

Table 3 provides closer insights in the beneﬁt of

the presented approach using the author label as an

example. STACKED+DESCRIPTIVE is able is signiﬁ-

cantly improve the labeling accuracy for all folds but

one. The third fold contains an unfavorable distribu-

tion of style guides between the training and testing

set for the author. If the initial base CRF labels a la-

bel systematically incorrectly, then the rule learning

cannot induce any valuable and correct descriptions

of the structure. Nevertheless, an average error reduc-

tion of over 50% is achievedfor identifying the author

of the reference.

Table 3: F

scores of the author label.

CRF STACKED+ error

DESCRIPTIVE reduction

Fold 1 97.7 99.6 82.6%

Fold 2 97.0 99.2 73.3%

Fold 3 96.4 96.5 2.8%

Fold 4 97.1 98.8 58.6%

Fold 5 89.5 95.1 53.3%

average 95.5 97.8 51.6%

To our knowledge, no domain-independent ap-

proach was published that can be utilized for a com-

parable model. As comparison, we applied the skip-

chain approach of (Sutton and McCallum, 2004) with

factors for capitalized words and additionally for

identical punctuation marks, but no improvementover

the base line models could be measured. Furthermore,

the feature induction for CRFs (McCallum, 2003) was

integrated, but resulted counter-intuitively in a de-

crease of the accuracy.

The performance time of the presented approach

for one fold averaged over the ﬁve folds is sev-

eral times faster than a higher-order model with skip

edges, about nine times faster using the subgroup dis-

covery and about fourteen times faster using Ripper.

The difference in speed is less compared to previously

published evaluations (Kou and Cohen, 2007). This is

mainly caused by the fact that the rule learning is nei-

ther optimized for this task nor for the domain, e.g.,

by pruning the attributes.

The presented approach signiﬁcantly outperforms

the common CRF without any additional domain

knowledge, integrated matching methods with a bib-

liographic database or other jointly performed tasks

like entity resolution. Nevertheless, the approach

stays way behind its potential. The meta-features

specify most of the time a perfect classiﬁcation of the

boundaries and transitions, but the stacked CRF still

labels the entities erroneously. To provide the knowl-

edge on a simple feature level may not sufﬁcient to

adapt the model to the structure of the current creation

context.

5 RELATED WORK

In the following, we give a brief overview on re-

lated work coming from different domains with con-

text consistencies, attempts utilizing complex graph-

ical models and stacked graphical models for collec-

tive informationextraction, and approaches on feature

induction.

Especially for Named Entity Recognition (NER)

modelling long-distance dependencies is crucial. The

labeling of an entity is quite consistent within a given

document, however, conclusive discriminating fea-

tures are sparsely spread across the document. As a

consequence, leveraging predictions of one instance

to disambiguate others is essential. Bunescu et al.

(Bunescu and Mooney, 2004), Sutton et al. (Sutton

and McCallum, 2004) and Finkel et al. (Finkel et al.,

2005) extended the commonly applied linear chain

CRF to higher order structures. The exponential in-

crease in model complexity enforces to switch from

exact to approximate inference techniques. Stacked

graphical models (Kou and Cohen, 2007; Krishnan

and Manning, 2006) retain exact inference as well as

efﬁciency by using linear chain CRF.

In Kou and Cohen’s Stacked Graphical Models

framework (Kou and Cohen, 2007), information is

propagated by Relational Templates C. Although

each C may ﬁnd related instances and aggregate their

labels in a possibly complex manner, they utilize

rather simple aggregators, e.g. COUNT and EX-

ISTS. Likewise, the approach of Krishnan and Man-

ning (Krishnan and Manning, 2006) uses straightfor-

ward but for NER efﬁcient aggregate features, con-

STACKED CONDITIONAL RANDOM FIELDS EXPLOITING STRUCTURAL CONSISTENCIES

247

centrating on the label predictions of the same entity

in other instances. In contrast to Stacked Graphical

Models, they also include corpus-level features, ag-

gregating predictions across documents. In this paper,

we use data mining techniques to determine rich con-

text sensitively applied features. Rather than simply

transferring labels of related instances, e.g., by major-

ity vote aggregation, we exploit structural properties

of a given context. We represent the gathered context

knowledge by several meta features which are con-

ceptually independent of the label types.

A semi-supervised approach on exploiting struc-

tural consistencies of documents has been taken by

Arnold and Cohen (Arnoldand Cohen, 2008) who im-

prove domain adaption by conditional frequency in-

formation of the unlabeled data. They show that dif-

ferences in the frequency distribution of tokens across

different sections in biological research papers can

provide useful information to extract protein names.

Counting frequencies can be done efﬁciently and the

experimental results suggest that these features are

robust across documents. However, in general unla-

beled data is not enough to model the context struc-

ture, e.g., frequency information can be noisy or dif-

ferences in the frequency distribution may be caused

non-structural. We propose to mine the distributions

of predicted labels and their combinations with ob-

served features to capture context structure.

Yang et al. use structural consistencies for in-

formation extraction from web forums (Yang et al.,

2009). They employ Markov Logic Networks

(Richardson and Domingos, 2006) with formulas to

encode the assumed structural properties of a typical

forum page, e.g., characteristic link structures or tag

and attribute similarities among different posts and

sites. Since context structure is represented inside of

the graphical model, inference and learning have to

ﬁght model complexity. Another example for con-

tent extraction from websites that exploits related in-

stances is Gulhane et al. (Gulhane et al., 2010). They

assume two properties of web information: The val-

ues of an attribute distributed over different pages are

similar for equal entities and the pages of one web-

site share a similar structure due to the creation tem-

plate. In contrast to those two approaches, the work

presented in this paper relies on no structural knowl-

edge previously known about the domain.

McCallum contributed an improvement for CRF

applications through feature induction (McCallum,

2003). Based on a given training set useful combina-

tions of features are computed, reducing the number

of model parameters. The feature induction of our ap-

proach is performedonline during processing the doc-

ument and applies ﬂexible data mining techniques to

specify the properties of consistent label transitions.

Learning Flexible Features for Conditional Random

Fields (Stewart et al., 2008) is an approach by Stew-

ard et al. that also induces features. They propose

Conditional Fields Of Experts (CFOE) as CRF aug-

mented with latent hidden states.

6 CONCLUSIONS

We have presented a novel approach for collective

information extraction using a combination of two

CRFs together with rule learning techniques to induce

new features during inference. The initial results of

the ﬁrst CRF are exploited to gain information about

the structural consistencies. Then, the second CRF

is automatically adapted to the previously unknown

composition of the entities. This is achieved by

changing the manifestation of its features dependent

on the currently processed set of instances. To our

best knowledge, no similar and domain-independent

approach was published that is able to exploit the

structural consistencies in textual data. The results

on a real-world dataset for the segmentation of refer-

ences indicate a signiﬁcant improvement towards the

commonly applied models.

For future work, better ways to include the high

quality descriptions need to be found for exploiting

the structural consistencies. One possibility is the

Generalized Expectation Criteria of Mann and Mc-

Callum (Mann and McCallum, 2010) that allow to

learn with constraints that cover more expressive and

structural dependencies than the underlying model.

Another approach that can solve this problem is to use

a complex model for joint inference despite the more

expensive inference. Instead of performing two dif-

ferent task in the same inference for segmentation and

matching (Poon and Domingos, 2007; Singh et al.,

2009), a joint model can be used to infer the labeled

sequence together with the description of the struc-

tural properties.

REFERENCES

Arnold, A. and Cohen, W. W. (2008). Intra-document Struc-

tural Frequency Features for Semi-supervised Domain

Adaptation. In Proceeding of the 17th ACM con-

ference on Information and knowledge management,

pages 1291–1300. ACM.

Batal, I. and Hauskrecht, M. (2010). Constructing Clas-

siﬁcation Features using Minimal Predictive Patterns.

In Proceedings of the 19th ACM International Con-

ference on Information and Knowledge Management,

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

248

CIKM ’10, pages 869–878, New York, NY, USA.

ACM.

Bunescu, R. and Mooney, R. J. (2004). Collective Informa-

tion Extraction with Relational Markov Networks. In

Proceedings of the 42nd Annual Meeting on Associa-

tion for Computational Linguistics, ACL ’04, Strouds-

burg, PA, USA. Association for Computational Lin-

guistics.

Cohen, W. W. (1995). Fast Effective Rule Induction. In Pro-

ceedings of the Twelfth Int. Conference on Machine

Learning, pages 115–123. Morgan Kaufmann.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine Learning, 20(3):273–297.

Councill, I., Giles, C. L., and Kan, M.-Y. (2008). ParsCit:

an Open-source CRF Reference String Parsing Pack-

age. In Proceedings of the Sixth International Lan-

guage Resources and Evaluation (LREC’08), Mar-

rakech, Morocco. ELRA.

Finkel, J. R., Grenager, T., and Manning, C. (2005). In-

corporating non-local Information into Information

Extraction Systems by Gibbs Sampling. In Pro-

ceedings of the 43rd Annual Meeting on Association

for Computational Linguistics, ACL ’05, pages 363–

370, Stroudsburg, PA, USA. Association for Compu-

tational Linguistics.

Gulhane, P., Rastogi, R., Sengamedu, S. H., and Tengli,

A. (2010). Exploiting Content Redundancy for Web

Information Extraction. Proc. VLDB Endow., 3:578–

587.

Kl¨osgen, W. (1996). Explora: A Multipattern and

Multistrategy Discovery Assistant. In Fayyad, U.,

Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R.,

editors, Advances in Knowledge Discovery and Data

Mining, pages 249–271. AAAI Press.

Kou, Z. and Cohen, W. W. (2007). Stacked Graphical Mod-

els for Efﬁcient Inference in Markov Random Fields.

In Proceedings of the 2007 SIAM Int. Conf. on Data

Mining.

Krishnan, V. and Manning, C. D. (2006). An Effective two-

stage Model for Exploiting non-local Dependencies in

Named Entity Recognition. In Proceedings of the 21st

International Conference on Computational Linguis-

tics and the 44th annual meeting of the Association

for Computational Linguistics, ACL-44, pages 1121–

1128, Stroudsburg, PA, USA. Association for Compu-

tational Linguistics.

Lafferty, J., McCallum, A., and Pereira, F. (2001). Condi-

tional Random Fields: Probabilistic Models for Seg-

menting and Labeling Sequence Data. Proc. 18th In-

ternational Conf. on Machine Learning, pages 282–

289.

Mann, G. S. and McCallum, A. (2010). Generalized Ex-

pectation Criteria for Semi-Supervised Learning with

Weakly Labeled Data. J. Mach. Learn. Res., 11:955–

984.

McCallum, A. (2003). Efﬁciently Inducing Features of

Conditional Random Fields. In Nineteenth Confer-

ence on Uncertainty in Artiﬁcial Intelligence (UAI03).

Peng, F. and McCallum, A. (2004). Accurate Information

Extraction from Research Papers using Conditional

Random Fields. In HLT-NAACL, pages 329–336.

Poon, H. and Domingos, P. (2007). Joint Inference in In-

formation Extraction. In AAAI’07: Proceedings of the

22nd National Conference on Artiﬁcial intelligence,

pages 913–918. AAAI Press.

Richardson, M. and Domingos, P. (2006). Markov Logic

Networks. Machine Learning, 62(1-2):107–136.

Singh, S., Schultz, K., and McCallum, A. (2009). Bi-

directional Joint Inference for Entity Resolution

and Segmentation Using Imperatively-Deﬁned Factor

Graphs. In Proceedings of the European Conference

on Machine Learning and Knowledge Discovery in

Databases: Part II, ECML PKDD ’09, pages 414–

429. Springer-Verlag.

Stewart, L., He, X., and Zemel, R. S. (2008). Learning Flex-

ible Features for Conditional Random Fields. IEEE

Trans. Pattern Anal. Mach. Intell., 30(8):1415–1426.

Sutton, C. and McCallum, A. (2004). Collective Segmen-

tation and Labeling of Distant Entities in Information

Extraction. In ICML Workshop on Statistical Rela-

tional Learning and Its Connections to Other Fields.

Wolpert, D. H. (1992). Stacked Generalization. Neural Net-

works, 5:241–259.

Yang, J.-M., Cai, R., Wang, Y., Zhu, J., Zhang, L., and

Ma, W.-Y. (2009). Incorporating Site-level Knowl-

edge to Extract Structured Data from Web Forums. In

Proceedings of the 18th international conference on

World wide web, pages 181–190. ACM.

STACKED CONDITIONAL RANDOM FIELDS EXPLOITING STRUCTURAL CONSISTENCIES

249