Table 1: Extraction accuracy.
Method IPSJ IEICE-E IEICE-J
Chain 0.938 0.949 0.798
2D 0.962 0.964 0.855
• number of characters in the cell,
• proportion of alphanumerics,
• proportion of hiragana and katakana,
• proportion of symbols, and
• presence of predefined keywords.
3.4 Experimental Result
For comparison, we applied a chain-model CRF ex-
amined in (Ohta et al., 2010). The OCR we used
in this experiment made a character sequence from
scanned document image according to the result of its
layout analysis. In this experiment, we convertedeach
character sequences into a word sequence and applied
chain-model CRF.
Table 1 shows the extraction accuracy. ”Chain”
stands for the result when we used the chain model,
whereas ”2D” stands for the result of the proposed
method. As shown in the table, two dimensional CRF
achieved better performance than the chain model.
We obtained more improvement for the data set
”IEICE-J”. This is because the OCR often analyzed
the layout of ”IEICE-J” pages incorrectly. It resulted
in generating incorrectly ordered sequences and de-
graded the accuracy of the chain-model CRF. In con-
trast, two dimensional CRF is not affected by the or-
der of cells by OCR. Therefore, it can improve the
extraction accuracy.
4 CONCLUSIONS
This paper examines a two dimensional CRF for
extracting bibliographic components from scanned
page images of academic papers. We experimentally
showed that the proposed method is effective espe-
cially for the pages whose layout is incorrectly ana-
lyzed.
Currently we use two dimensional CRF that treats
matrices. With this model, we can assign a label to
each cell but we need a post-processing that extracts
logical components by merging cell. We plan to ex-
tend the model to treat tree structured data such as
XY-tree. It enables us to extract logical components
as well as labeling simultaneously. In this paper we
manually determined the augmented labels for merg-
ing cells into logical component. We are interested in
designing the augmented labels systematically.
REFERENCES
Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit:
An open-source crf reference string parsing package.
In Intl. Conf. on Language Resources and Evaluation
(LREC 2008), pages 661 – 667.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Con-
ditional random fields: Probabilistic models for seg-
menting and labeling sequence data. In International
Conference on Machine Learning (ICML 2001), pages
282 – 289.
Montreuil, F., Grosicki, E., Heutte, L., and Nicolas, S.
(2007). Unconstrained handwritten document lay-
out extraction using 2d conditional random fields. In
International Conference on Document Analysis and
Recognition (ICDAR 2009), pages 407 – 411.
Nicolas, S., Dardenne, J., Paquet, T., and Heutte, L. (2007).
Document image segmentation using a 2d conditional
random field model. In International Conference on
Document Analysis and Recognition (ICDAR 2007),
pages 407 – 411.
Ohta, M., Inoue, R., and Takasu, A. (2010). “Empirical
Evaluation of Active Sampling for CRF-based Anal-
ysis of Pages”. In International Conference on Infor-
mation Reuse and Integration (IEEE IRI2010), pages
13–18.
Takasu, A. (2008). “Information Extraction by Two Dimen-
sional Parser”. In Proc. IEEE Intl. Conf. on Tools with
Artificial Intelligence, pages 333–340.
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y.
(2005). 2d conditional random fields for web informa-
tion extraction. In International Conference on Ma-
chine Learning (ICML 2005).
PageAnalysisby2DConditionalRandomFields
567