and Belaid, 2016). Descriptors extraction process is
tricky because it relies on set of regular expressions
and rules. In addition, prior knowledge on processed
document, is required. The second one, based on
deep learning, tries to learn the descriptors directly.
CNN (Convolutional Neural Network) has been used
advantageously in this case because of its efficiency
in visual feature learning in images. In contrast, for
textual descriptors, the literature is full of techniques
such as BoW (Bag Of Words), Word2Vec, Doc2Vec
and word embedding (Wiki, 2017). These models are
efficient for sentiment analysis and text classification.
In all our previous work, we used the classification
of page pairs in continuity and breaks. We continue
to do so because of its superiority over the page clas-
sification technique. But, we reinforce this technique
in different ways, first by a correct representation of
the content by using model languages as Doc2Vec,
Word2Vec and word embedding. Then we use deep
learning models like GRUs reinforced by an attention
mechanism.
The rest of the paper is organized as follows. Sec-
tion 2 describes the main techniques reported in the
literature. The proposed approach is described in sec-
tion 3 with the different models. In section 4, we will
give a description of the used datasets, while in sec-
tion 5, we will resume the experimental protocol and
the obtained results. We will conclude and give future
perspectives in section 6.
2 RELATED WORKS
The first works reported in the literature rather con-
sider the flow as a sequence of pages, and the doc-
uments are found by sub-sequence analysis. Proba-
bilistic models are used to model and recognize these
sub-sequences.
So, in our research team, Meilender (Meilender
and Belaid, 2009) used a method similar to the Vari-
able Horizon Models (VHM) or multi-grams used in
speech recognition. It consists in maximizing the
probability of the flow using all the Markov models
of the constituent elements (pages). Since the cal-
culation of the probability of all pages is NP com-
plete, the solution has led to the use of windows to
reduce the number of observations. In Schmidtler and
Amtrup (Schmidtler and Amtrup, 2017), single pages
are characterized by bag of words. According to the
authors, the discriminating features are located in the
first and last page of a document. Therefore they
model the document types by using three symbols:
start, middle and end. Multi-class SVMs are used and
their scores are mapped into probabilities. The prob-
able best sequence of documents is extracted by us-
ing an algorithm similar to the beam search algorithm
(Furcy and Koenig, 2005). Gordo et al. (Gordo et al.,
2013) use an approach combining the multiple pages
of a document into a single feature vector represent-
ing the document as a whole. Then, the most plau-
sible segmentation of a page flow into a multi-page
document sequence is achieved by optimizing a sta-
tistical model that represents the probability of each
segmented document of several pages belonging to a
particular class.
The second wave of work focuses on page pairs
and tries to find out if they represent document bound-
aries. In (Daher and Belaid, 2014), a feature extrac-
tion process is used to construct the pair page descrip-
tor which summarizes the pair page relation in term
of rupture and continuity. This system classifies the
pair descriptor into rupture or continuity using Sup-
port Vector Machine (SVM) and Multi-Layer Percep-
tron (MLP). In the continuation of (Daher and Be-
laid, 2014), (Karpinski and Belaid, 2016) used a rule
based system to detect ruptures and continuities in a
hierarchy of documents from records (simple page),
to technical documents, fundamental documents and
cases (set of documents belonging to the same per-
son). For each level, the system descriptors are first
extracted and then compared between pairs of pages
or documents. These descriptors can be section num-
bers, page numbers, dates, salutation and conclusion
formulas. The technique in (Hamdi et al., 2017) and
(Hamdi et al., 2018), uses Doc2Vec model to real-
ize the segmentation task. At first, the Doc2Vec is
trained to learn the documents pages representation.
While sweeping through the stream pages, the system
calculates the cosine distance between the page pairs.
Finally the system compares this distance with a fixed
threshold to determine if the pair represents a rupture
or a continuity.
More recently, one can find in the state of the art
deep learning techniques with convolutional neuronal
models for the classification of documents, as (Gallo
et al., 2016; Harley et al., 2015; Noce et al., 2016;
Wiedemann and Heyer, 2017). While the first two use
only textual information, the last two use textual and
visual information. In (Wiedemann and Heyer, 2017),
for the visual content, VGG16 is employed to learn
document visual features. As for the textual content,
a CNN of text data (Kim, 2014) is used. Then both
results are combined to decide the segmentation type.
These last methods naturally led to reflect on the
representation of documents to feed these types of
models. The word embedding (mapping of words
into numerical vector spaces) has proved to be an in-
credibly important method enabling various machine
Use of Language Models for Document Stream Segmentation
221