vowel and other vowels are denoted by diacritics
(Ding, et al., 2018). In Javanese, diacritics could be
written above, under, on the left or right side of the
main characters. Hence, its segmentation challenges
lie not only on the overlapped character strokes and
diacritics but also on the identification which diacritic
belongs to which character. In our preliminary study,
we have experimented segmentation by Projection
Profile (PP). However, the application of Projection
Profile per se turned out to give unsatisfactory results.
Therefore, we proposed to refine it by means of
detecting outlier of pixel density. The hybrid of pixel
outlier detection and PP has proved to increase the
line segmentation accuracy by 32.84%.
2 RELATED WORKS
Various character segmentation methods have
been proposed in OCR and in many text recognition
among images (Asif, et al., 2017; Karthikeyan, et al.,
2013). However, Projection Profile methods and
Connected Component Analysis (CCA) are among
those which are frequently applied due to their low
computational cost and simplicity. The application of
these methods have technically slight differences
owing to the text media, for example texts written on
palm leaves (Kesiman, et al., 2016),
books/manuscripts (Mei, et al., 2013), or car license
plates (Karthikeyan, et al., 2013), and the
characteristics of the characters themselves. Apart
from applying PP and PCA which base their
segmentation on the pixel intensity, some researches
make use of character features as a basis of
segmentation as found in (Budhi & Adipranata, 2015;
Lacerda & Mello, 2013), or a hybrid between feature-
based and Projection Profile as in (Lue, et al., 2010).
A more intricate approach is the recognition-based
segmentation which adapts prior knowledge to screen
all possible segmentation schemes (Mei, et al., 2013;
Inkeaw, et al., 2018; Dhali, et al., 2017).
Due to its writing system, segmenting Asian
characters presents its own challenges and therefore
needs additional techniques. In many cases of
segmenting Devanagari characters, the text is
commonly decomposed into lines, words and
characters (Srivastav & Sahu, 2016; Mehul, et al.,
2014) due to the presence of shirorekha, which is a
straight line connecting each character in a word. In
segmenting Chinese characters, Mei et. al.
implemented 3 stages of segmentation (Mei, et al.,
2013). In the first stage, vertical white space was used
as a delimiter of any character. Then, two or more
segmented characters would be merged if it is
identified to have a connection region, which is
recognized by means of a vertical projection. The last
stage is fine-gained segmentation, in which the under-
segmentation cases are splitted based on the defined
rules (Mei, et al., 2013).
To segment Javanese characters, Widiarti et al.
applied the projection profile for line segmentation
and Moving Average Algorithm to refine the vertical
projection (Widiarti, et al., 2014). Meanwhile, Budhi
and Adipranata in (Budhi & Adipranata, 2015) made
use of attributes of a character image and
skeletonizing to segment Javanese characters.
Having object of Baliness characters written on palm
leaf, Kesiman et al. divided their segmentation system
into 4 subtasks. The first subtask is brushing character
area of gray level images with minimum filtering to
overcome the problem of semi-sapce areas (Kesiman,
et al., 2016). The second subtask is to apply the
average block projection profile, followed by the
selection of the candidate area for segmentation path.
The fourth subtask deals with the construction of non-
linear segmentation path by implementing the
multistage graph search algorithm (Kesiman, et al.,
2016).
3 METHODS
The proposed algorithm consists of two main
steps i.e. the background removal and the text
segmentation. The background removal was
performed through image binarisation. We simply
applied the binarisation function provided in Python
and used threshold t=160 for the current data samples.
This process is to separate the necessary pixels
forming the foreground part – the text which is
commonly marked as 1 -- from its background
(marked as 0). Following it is the so-called
segmentation process which was applied to the
foreground pixels only.
The segmentation process was designed to
comprise three stages of line segmentation and a stage
of character segmentation. Such segmentation model
is aimed to adjust the nature of the manuscript image
and the Javanese script to the Projection-Profile
Cutting (PPC). From this point forward, we would
like to use the term ‘pass’ to refer to a ‘stage’ in our
line segmentation.
3.1 Vertical Pass 1: Finding the Line
Candidates
The first pass is to find the possible text line
candidates from the binarised image. PPC is used here
by calculating the binary vertical histogram of
horizontal foreground pixel projection along the