AColDSS: Robust Unsupervised Automatic Color
Segmentation System for Noisy Heterogeneous Document
Images
Louisa Kessi
1,2
, Frank Lebourgeois
1,2
and Christophe Garcia
1,2
1
Université de Lyon, CNRS, France
2
INSA-Lyon, LIRIS, UMR5205, F-69621, France
{louisa.kessi, franck.lebourgeois, christophe.garcia}@liis.cnrs.fr
Abstract. We present the first fully automatic color analysis system suited for
noisy heterogeneous documents. We developed a robust color segmentation
system adapted for business documents and old handwritten document with
significant color complexity and dithered background. We have developed the
first fully data-driven pixel-based approach that does not need a priori infor-
mation, training or manual assistance. The system achieves several operations
to segment automatically color images, separate text from noise and graphics
and provides color information about text color. The contribution of our work
is four-fold: Firstly, it does not require any connected component analysis and
simplifies the extraction of the layout and the recognition step undertaken by
the OCR. Secondly, it is the usage of color morphology to simultaneously seg-
ment both text and inverted text using conditional color dilation and erosion
even in cases where there are overlaps between the two. Thirdly, our system
removes efficiently noise and speckles from dithered background and automati-
cally suppresses graphical elements using geodesic measurements. Fourthly, we
develop a method to splits overlapped characters and separates characters from
graphics if they have different colors. The proposed Automatic Color Docu-
ment Processing System has archived 99 % of correctly segmented document
and has the potential to be adapted into different document images. The system
outperformed the classical approach that uses binarization of the grayscale image.
1 Introduction
Color document processing is an active research area with significant applications. In
recent years, there has been an increasing need for systems which are able to convert
pre-printed color documents into digital format automatically. Most of the time, the
color image is converted into a grayscale image. However, the performance decreases
when the segmentation fails. Nowadays, digitization systems can have to cope with
dithering documents, complex color background and linear color variations, which
amounts to not knowing if text is darker or lighter compared to the background, high-
lighting regions, corrective red overload on black text and not uniform color
text/graphics overlapping. Indeed, some dithered documents may not lead to a correct
automatic analysis. Smoothing most often permits to reduce dithering significantly
but can also seriously damage the text. Therefore, the color information is significant.
Then, a color-based segmentation could improve the process.