rithms CutEveryWhere and SimpleDocumentType-
Classification. The forward versions of these merging
algorithm performed better than the backward varia-
tions. The best performing setup achieved a cut F-
score of 86% and a V-measure of 0.91%. This is a
satisfactory result to fulfill business needs of the bank-
ing sector and PaperClip is already being used in real
life.
REFERENCES
Agin, O., Ulas, C., Ahat, M., and Bekar, C. (2015). An
approach to the segmentation of multi-page document
flow using binary classification. In Proceedings of the
Sixth International Conference on Graphic and Image
Processing (ICGIP 2014), pages 944311–944311. In-
ternational Society for Optics and Photonics.
Chen, F., Girgensohn, A., Cooper, M., Lu, Y., and Filby,
G. (2012a). Genre identification for office document
search and browsing. International Journal on Doc-
ument Analysis and Recognition (IJDAR), 15(3):167–
182. note Iris: not-so relevant: text-based features but
not OCR-features.
Chen, S., He, Y., Sun, J., and Naoi, S. (2012b). Structured
document classification by matching local salient fea-
tures. In Proceedings of the 21st International Confer-
ence on Pattern Recognition (ICPR), pages 653–656.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas,
G. W., and Harshman, R. A. (1990). Indexing by latent
semantic analysis. Journal of the American Society for
Information Science, 41(6):391–407.
Gordo, A., Perronnin, F., and Valveny, E. (2012). Doc-
ument classification using multiple views. In Docu-
ment Analysis Systems (DAS), 2012 10th IAPR Inter-
national Workshop on, pages 33–37. IEEE.
Infantino, I., Maniscalco, U., Stabile, D., and Vella, F.
(2014). A fully visual based business document clas-
sification system. In Proceedings of the Science and
Information Conference (SAI), 2014, pages 339–344.
IEEE.
Klink, S. and Kieninger, T. (2001). Rule-based document
structure understanding with a fuzzy combination of
layout and textual features. International Journal on
Document Analysis and Recognition, 4(1):18–26.
Koster, C. H. A., Seutter, M., and Beney, J. (2003). Perspec-
tives of System Informatics: 5th International Andrei
Ershov Memorial Conference, PSI 2003, Akadem-
gorodok, Novosibirsk, Russia, July 9-12, 2003. Re-
vised Papers, chapter Multi-classification of Patent
Applications with Winnow, pages 546–555. Springer
Berlin Heidelberg, Berlin, Heidelberg.
Kumar, J., Ye, P., and Doermann, D. (2014). Structural
similarity for document image classification and re-
trieval. Pattern Recognition Letters, 43:119 – 126.
{ICPR2012} Awarded Papers.
Marinai, S. (2008). Introduction to document analysis and
recognition. In Machine learning in document analy-
sis and recognition, pages 1–20. Springer.
Matwin, S. and Sazonova, V. (2012). Direct comparison be-
tween support vector machine and multinomial naive
bayes algorithms for medical abstract classification.
JAMIA, 19(5):917.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Rosenberg, A. and Hirschberg, J. (2007). V-Measure: A
Conditional Entropy-Based External Cluster Evalua-
tion Measure. In Proceedings of EMNLP-CoNLL, vol-
ume 7, pages 410–420.
Rusi
˜
nol, M., Frinken, V., Karatzas, D., Bagdanov, A. D.,
and Llad
´
os, J. (2014). Multimodal page classification
in administrative document image streams. Interna-
tional Journal on Document Analysis and Recognition
(IJDAR), 17(4):331–341.
Schmidtler, M. A., Texeira, S. S., Harris, C. K., Samat, S.,
Borrey, R., and Macciola, A. (2014). Automatic doc-
ument separation. US Patent 8,693,043.
Sebastiani, F. (2002). Machine learning in automated
text categorization. ACM computing surveys (CSUR),
34(1):1–47.
Simon, M., Rodner, E., and Denzler, J. (2015). Fine-grained
classification of identity document types with only
one example. In Machine Vision Applications (MVA),
2015 14th IAPR International Conference on, pages
126–129. IEEE.
Tjong Kim Sang, E. and Veenstra, J. (1999). Representing
text chunks. In Proceedings of the ninth conference
on European chapter of the Association for Compu-
tational Linguistics, pages 173–179. Association for
Computational Linguistics.
Verberne, S., Vogel, M., D’hondt, E., et al. (2010).
Patent classification experiments with the linguistic
classification system lcs. In CLEF (Notebook Pa-
pers/LABs/Workshops).
ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods
478