document d. Further, we assume that the distributions
of the labels for a given document are the same among
the domains on a latent subspace S using a projection
matrix P
S
, hence p
i
(y|P
S
· d) = p
j
(y|P
S
· d).
After an initial training of an SVM on the nearest
documents from the source domains to the target do-
main, we apply the model to all documents from the
source domains. For these documents we estimate a
confidence value for the prediction based on the prob-
abilistic outputs as described above. Based on this
value we choose those documents that result in a low
confidence in the prediction for retraining our SVM
model.
Since now we have different domains, we expect
that not all domains or all documents are similar use-
ful for the retraining. Even within one domain we ex-
pect some documents to be more useful for the train-
ing than others. That is why we include the distance
measure into our Active Learning strategy.
We use two criteria for choosing the documents
from the different source domains for the training.
First, as in confidence based Active Learning, we esti-
mate a confidence value for the prediction of a trained
model on the documents from the source domains
projected onto the latent space of the target domain.
Next, we integrate the distance of the documents from
the source domains to the latent space to estimate their
potential latent value for the training when we apply
the SVM on the target domain.
The equation σ(d) = (λ · γ( f , d) + (1 − λ) ·
δ(d, D
t
))
−1
defines the selection faction as the inverse
of the weighted sum of the confidence value γ and the
distance δ of the corresponding document to the target
domain. The larger this value is, the more similar is
the document to the target domain while on the other
hand the SVM is uncertain in its prediction of this
document. Among all documents from the domains,
except the target domain, the closest ones that are pre-
dicted with least confidence are chosen for retraining.
6 EXPERIMENTS
In this section we perform extensive experiments on
benchmark data sets to validate our proposed method.
In the first experiment we test how good the la-
tent subspace representations, respectively the em-
beddings into a latent subspace of documents from
the different domains, can be used for training a clas-
sifier. Using the vector space model, each documents
is represented by a large vector with each telling com-
ponent the number of times a certain word appears in
the document. LSA is used to find a latent subspace
that covers the most important parts of the documents.
Table 1: Accuracy on separating documents about organiza-
tions from documents about people, respectively places. We
use only data from the source subcategories for the training.
The Baseline is an SVM trained on the whole vector space
of the source domain. LSA means the SVM is trained on the
projection of the source domain onto the subspace extracted
from the target domain by LSA.
Data sets Baseline LSA
Org−People 80.3 82
Org−Places 61.7 70.8
We use the Reuters data set in the same config-
uration as done by (DXYY07). The documents are
about organizations, people and place. The task is to
distinguish the documents about organizations from
the documents about people, respectively places. The
training is done on a subset of subcategories and the
testing on different subcategories.
Table 1 shows the results of an SVM classifier
trained on a source domain of the subcategories that
contains only documents about organizations. For
comparison we use a simple baseline method. This
method uses no latent representation, but simply the
Bag-of-Words representation. The performance of the
SVM in these subspaces is much better compared to
the baseline. This shows already a potential benefit
of a projection onto a latent subspace to make the do-
mains more similar.
Next, we validate our Active Learning strategy.
We train the SVM on initial 300 documents from the
source domains that are closest to the latent subspace
extracted from the target domain. Next, we applied
the SVM to all remaining documents in the source do-
mains and calculate the selection factor s from Equa-
tion ??. The documents with highest selection factors
are chosen to be used for the retraining of the SVM.
Figure 1 show the increase in accuracy of a trained
SVM using our proposed Active Learning strategy.
We see that already after 600 respectively 900 doc-
uments we get better results as the baseline that has
been trained on the whole source data set.
In our second experiment, we investigate how
good our proposed methods performs when we have
more than one source domain. Here, we expect that
some domains might be better suited for the training
than other ones. Beside the Bag-of-Words representa-
tion together with LSA, we also tested the representa-
tion as sequence of words together with LDA. As dis-
tance measure we used for the Bag-of-Words repre-
sentation again the Euclidean distance and for the rep-
resentation as sequence of words the KL-divergence.
We used the Amazon review data set in the same
configuration as Blitzer et al. in (BMP06). The re-
view documents are about books (B), Dvds (D), elec-
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
300