as well as the proposed method:
• journal: 10 training samples are chosen according
to the normalized likelihood of the CRF for jour-
nal in the initial phase, and 10 training samples
are chosen according to the normalized likelihood
in each update phase.
Because we obtained similar results for the metric av-
erage entropy, we show only the results for normal-
ized likelihood in this section.
Figures 3 (a), (b), and (c) respectively show the
accuracy of CRFs for journals IPSJ, IEICE-E, and
IEICE-J. Each graph in the figure plots the accuracy
of the CRF with respect to the size of training samples
by three sampling strategies.
First, we observed that both the proposed strategy
and nlh obtained accurate CRFs with fewer samples
than with random. This indicates that the sampling
strategy for the update phase is effective.
Second, when we compare the proposed strategy
and nlh, the proposed strategy obtains a slightly better
initial CRF; its accuracy is plotted at the training data
size of 10. This indicates that the sampling strategy
using a CRF designedfor another journal can improve
the active learning process.
5 CONCLUSIONS
We have examined two statistical measures obtained
using a linear-chain CRF for detecting layout changes
of title pages of academic papers and obtaining new
CRFs for extracting information from academic ti-
tle pages. The experiments revealed that both statis-
tical measures are very effective at detecting layout
changes. We also showed that the measures can be
used for active sampling to reduce the labeling cost of
training data.
We plan to extend this study in several directions.
First, it is unknown how the CRF’s sequence label-
ing accuracy affects the change detection accuracy.
To study this problem, we plan two kinds of exper-
iments: (1) controlling the labeling accuracy by the
size of training data, obtaining CRFs with various la-
beling accuracy, and comparing them for change de-
tection, and (2) applying our approach to more com-
plex sequence labeling problems.
In this paper, we used datasets that we prepared.
To make comparison easier, we plan to evaluate
the method using other open datasets such as the
ICDAR2009 layout dataset (Antonacopoulos et al.,
2009).
REFERENCES
Antonacopoulos, A., Bridson, D., Papadopoulos, C., and
Pletschacher, S. (2009). A realistic dataset for per-
formance evaluation of document layout analysis. In
ICDAR2009, pages 296 – 300.
Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit:
An open-source crf reference string parsing package.
In LREC, page 8.
Krishnamoorthy, M., Nagy, G., and Seth, S. (1992). Syntac-
tic segmentation and labeling of digitized pages from
technical journals. IEEE Computer, 25(7):10–22.
Kudo, T., Yamamoto, K., and Matsumoto, Y. (2004). Ap-
plying conditional random fields to Japanese morpho-
logical analysis. In EMNLP 2004.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-
ditional random fields: Probabilistic models for seg-
menting and labeling sequence data. In 18th ICML,
pages 282–289.
Nicolas, S., Dardenne, J., Paquet, T., and Heutte, L. (2007).
Document image segmentation using a 2d conditional
random field model. In ICDAR 2007, pages 407 – 411.
Ohta, M., Inoue, R., and Takasu, A. (2010). Empirical
evaluation of active sampling for crf-based analysis of
pages. In IEEE IRI 2010, pages 13–18.
Ohta, M. and Takasu, A. (2008). CRF-based authors’ name
tagging for scanned documents. In JCDL’08, pages
272–275.
Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-
ing. IEEE Transactions on Knowledge and Data En-
gineering, 20(10):1345 – 1359.
Peng, F. and McCallum, A. (2004). Accurate information
extraction from research papers using conditional ran-
dom fields. In HLT-NAACL, pages 329–336.
Saar-Tsechansky, M. and Provost, F. (2004a). Active sam-
pling for class probability estimation and ranking.
Machine Learning, 54(2):153–178.
Saar-Tsechansky, M. and Provost, F. (2004b). Active sam-
pling for class probability estimation and ranking.
Machine Learning, 54(2):153–178.
Takasu, A. (2003). Bibliographic attribute extraction from
erroneous references based on a statistical model. In
JCDL ’03, pages 49–60.
Wang, Y., Phillips, I. T., R.M.Robert, and Haralick, M.
(2004). Table structure understanding and its perfor-
mance evaluation. Pattern Recognition, 37(7):1479–
1497.
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y.
(2005). 2D conditional random fields for web infor-
mation extraction. In ICML 2005.
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
444