They proposed a new strategy to update the weight
vector. They only updated the weight of misclassi-
fied samples from different distribution. Therefore,
we can use this strategy to semi-supervised methods.
More details present in the next section.
3 PROPOSED METHOD
To enable transfer learning, we use the unlabeled data
set that have the new page layout to run a role in
building the classification model. We call these data
target-domain data. Moreover, these target-domain
data do not have label. Therefore, we can not use
them to train a classifier. The labeled training data,
whose distribution may differ from the target-domain
data, perhaps because they are out-dated, are called
source − domaindata. The classifiers learned from
these data cannot classify the target-domain data well
due to different domains.
Formally, let X
t
be the target-domain data, X
s
be
the source-domain data, X = X
t
∪ X
s
be the domain-
data, and Y = {y
i
} be the set of category labels. The
training data set T contained labeled set T
s
, and
unlabeled set T
t
. T
s
represents the source-domain
data that T
s
= {(x
i
s
, y
i
s
)}, where x
i
s
∈ X
s
(i = 1, . . . , N).
T
t
represents the target-domain data that T
t
= {x
i
t
},
where x
i
t
∈ X
t
(i = 1, . . . , M). N and M are the sizes
of T
s
and T
s
, respectively. The combined training set
T = {(x
i
, y
i
)} is defined as follows
x
i
=
(
x
s
i
, i = 1, . . . , N;
x
t
i
, i = N + 1, . . . , N + M;
Here, T
s
corresponds to some labeled data from a
source-domain data that we try to reuse as much as we
can; however we do not know which part of T
s
is use-
ful to us. What we only have is a unlabeled data set
T
t
from target-domain data, and then use these data
to find out the useful part of T
d
. The problem that we
are trying to solve is: given an unlabeled data set from
target-domain data T
t
, a labeled data set from source-
domain data T
s
, the objective is to train an analyzer to
label each token with its type of bibliographic compo-
nent.
We now present our method, Trans f er − CRF,
which extends Unilateral − TrAdaBoost (Quang-
Hong and Takasu, 2014) for CRF. However,
Unilateral-TrAdaBoost is similar to most traditional
machine learning methods which need a few labeled
data to train. Therefore, it can not be used to
learn a new model automatically. In our extension
to Transfer-CRF, Transfer-CRF applied Unilateral-
TrAdaBoost’s learning strategy to filter only consis-
tency samples to build a good model. Thus, in our
extension, we use a mechanism to choose useful sam-
ples.
A formal description is presented in Algorithm 1.
we can see that at each iteration, if a training sample
from source-domain data is mistakenly predicted, it
may conflict with the target-domain data. Therefore
it will reduce its effect by remove from trainning set
or reduce its weight in training phase (here we remove
it from trainning set). The algorithm stoped when the
number of iteration is larger than a number that is in-
puted by user or we can not remove any sample from
source-domain data. The output of algorithm contain
the labeled data that is consistent with unlabeled data.
Therefore, we can use them to train a better analyzer.
Algorithm 1: Transfer-CRF.
1 . procedure TRANSFER −CRF(T
t
, T
s
, N)
2 . Input: Given two data set T
t
, T
s
to train, and
number of iteration K
3 . for i ← 1, K do
4 . Call semi-supervised CRF, providing the
combined training set T . Return with a hypothesis
h
i
: X → Y.
5 . if h
i
(x
i
∈ X
s
) 6= y
i
6 . Remove x
i
from T
s
7 . end if
8 . if can not remove any sample
9 . break for
10. end if
11. end for
12. Output: a new analyzer h
f
that is the last hy-
pothesis
4 EXPERIMENTS
This section examines the efficiency and the effective-
ness by evaluating the accuracy in labeling the unla-
beled data that have new layout.
4.1 Dataset
For this experiment, we used the same three journals
as in our previous study (M. Ohta and Takasu, 2010),
as follows:
- Journal of Information Processing by the Infor-
mation Processing Society of Japan (IPSJ): We
used papers published in 2003 in this experiment.
This dataset contains 479 papers, most of them
has been written in Japanese.
- English IEICE Transactions by the Institute of
Electronics, Information and Communication En-
gineers in Japan (IEICE-E): We used papers pub-
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
376