Indexation of Document Images Using Frequent Items

Eugen Barbu; Pierre Heroux; Sebastien Adam; Eric Trupin

doi:10.5220/0002576001640174

Indexation of Document Images Using Frequent Items

Eugen Barbu, Pierre Heroux, Sebastien Adam, Eric Trupin

2005

Abstract

Documents exist in different formats. When we have document images, in order to access some part, preferably all, of the information contained in that images, we have to deploy a document image analysis application. Document images can be mostly textual or mostly graphical. If, for a user, a task is to retrieve document images, relevant to a query from a set, we must use indexing techniques. The documents and the query are translated in a common representation. Using a dissimilarity measure (between the query and the document representations) and a method to speed-up the search process we may find documents that are from the user point of view relevant to his query. The semantic gap between a document representation and the user implicit representation can lead to unsatisfactory results. If we want to access objects from document images that are relevant to the document semantic we must enter in a document understanding cycle. Understanding document images is made in systems that are (usually) domain dependent, and that are not applicable in general cases (textual and graphical document classes). In this paper we present a method to describe and then to index document images using frequently occurences of items. The intuition is that frequent items represents symbols in a certain domain and this document description can be related to the domain knowledge (in an unsupervised manner). The novelty of our method consists in using graph summaries as a description for document images. In our approach we use a bag (multiset) of graphs as description for document images. From the document images we extract a graph based representation. In these graphs, we apply graph mining techniques in order to find frequent and maximally subgraphs. For each document image we construct a bag with all frequent subgraphs found in the graph-based representation. This bag of “symbols” represents the description of the document.

References

Antonacopoulos A. Introduction to Document Image Analysis, 1996.
Nagy G. Twenty years of document analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:38-62, 2000.
Pavlidis, T., Algorithms or Graphics and Image Processing, Computer Science Press, 1982
Bagdanov A.D. and M. Worring, "Fine-grained Document Genre Classification Using First Order Random Graphs", Proc. 6th. Int'l Conf. on Document Analysis and Recognition (ICDAR 2001),79-90.
Washio T., Motoda H., State of the art of graph-based data mining. SIGKDD Explor. Newsl.vol. 5, no 1,pp. 59-68 ,2003.
Fung, B. C. M., Wang, K., & Ester M. Hierarchical Document Clustering Using Frequent Itemsets. Proceedings of the SIAM International Conference on Data Mining,2003.
Termier A., Rousset M., and Sebag M., “Mining XML Data with Frequent Trees”, DBFusion Workshop'02,pages 87-96,2002.
Blostein D., Zanibbi R., Nagy G., and Harrap R., “Document Representations”, GREC 2003
Khotazad A., and Hong Y.H., “Invariant Image recognition by Zernike Moments”, IEEE Transactions on Pattern Analysis and Machine Inteligence, Vol 12, No 5, May 1990
Milligan, G. W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika, 58(2),(1985)159-179.
Gordon A.D. “Classification 2ndEdition”, 1999.
L. Kaufmann and P. J. Rousseeuw. Clustering by means of medoids. In Statistical Data Analysis based on the L 1 Norm and Related Methods, pages 405-416, 1987.
Salvatore Tabbone, Laurent Wendling, Karl Tombre, “Matching of graphical symbols in line-drawing images using angular signature information” Int'l Journal on Document Analysis and Recognition, Vol. 6, No. 2, 2003, 115-125.
Yan, X., Han, J.: “Closegraph: mining closed frequent graph patterns”. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press (2003) 286-295.
Seno M., Kuramochi M., and Karypis G., PAFI, A Pattern Finding Toolkit,http://www.cs.umn.edu/karypis, 2003.
Dumais, S.T. , Improving the retrieval information from external resources, Behaviour Research Methods, Instruments and Computers, Vol. 23, No. 2, pp. 229-236, 1991.

Download

Paper Citation

in Harvard Style

Barbu E., Heroux P., Adam S. and Trupin E. (2005). Indexation of Document Images Using Frequent Items . In Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005) ISBN 972-8865-28-7, pages 164-174. DOI: 10.5220/0002576001640174

in Bibtex Style

@conference{pris05,
author={Eugen Barbu and Pierre Heroux and Sebastien Adam and Eric Trupin},
title={Indexation of Document Images Using Frequent Items},
booktitle={Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)},
year={2005},
pages={164-174},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002576001640174},
isbn={972-8865-28-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)
TI - Indexation of Document Images Using Frequent Items
SN - 972-8865-28-7
AU - Barbu E.
AU - Heroux P.
AU - Adam S.
AU - Trupin E.
PY - 2005
SP - 164
EP - 174
DO - 10.5220/0002576001640174