periods of years: middle of the Meiji era (1883-1897),
late of the Meiji era (1898-1912), and the Taisho era
(1912-1925). In total, we have nine classes such as
gShunyodo in the middle of Meijih. Training sets are
prepared for each class with the total training set size
of 10, 50, 100, 200, 300, 400, and 500 rows. When the
training set size is more than 100 rows, no significant
difference in fitness values are observed. Therefore,
we use 100 rows (each 10 rows from 10 books) for
each learning phase.
In this method, the parameters for genetic pro-
gramming include the number of individuals, the
number of possible generations, the crossoverrate and
the mutation rate. When the number of individuals is
varied in 1,000 from 1,000 to 5,000, the fitness values
of more than 3,000 individuals are almost converged.
Therefore we use 3,000 individuals. As for other pa-
rameters, the upper limit of the number of generation
is 200, the crossover rate is 0.8 and the mutation rate
is 0.2 for the empirical reason.
We perform experiments 10 times for each class.
Table 1 shows the best agreement rate of pixel val-
ues for each class between the training image and the
images that ruby characters have been removed from.
Table 1: The best agreement rate for each classi%j.
Middle
Meiji Era
Late
Meiji Era
Taisho
Shunyodo 98.8 98.9 98.8
Hiyosido 98.0 98.5 97.5
Shinshindo 98.5 98.6 98.5
In all classes, the agreement rates are around 98%.
Furthermore, we compare the ruby character re-
moval ratios by the proposal method with the linearly
separating method using black pixel projection his-
tograms. The cutting positions on the black pixel
projection histogram are decided using a discriminant
analysis method. The ruby removal ratio of the av-
erage of 9 classes is about 98.5% by the proposal
method and is about 82.9% by the linearly separat-
ing method, respectively. Therefore, the result means
that the proposal method is superior to the linearly
separating method using black pixel projection his-
tograms. Formulas (2) are an example generated with
gShunyodo in the middle of the Meiji erah.
y = ((8/3) + ((width average− (cos((2∗ π
∗x/(((4− (cos((2∗ π ∗ x/((sin((2∗ π ∗ x/
(((5+ 3)/2)) − π))/2)) − π/2))/1))/2))
(2)
−π/2))/(8/3))) − (cos((2∗ π ∗ x/
(((width average+ 4)/2)) − π/2))/(7/5))))
Figures 12 show the curves representing formula (2)
with the images that ruby characters have been re-
moved from.
Figure 12: The curve by formula (2) and the ruby character
removal result.
We have investigated publishers for early-modern
Japanese printed books and found that the number of
publishers exceeds ten thousands. Most of them are
so small that some printing offices seem to be shared
for use by many small publishers. It means that pub-
lisher based classification is not efficient when it is
applied to all early-modern Japanese printed books
since a huge number of publishers would make the
classification extremely difficult.
4.3.2 Classification by Row Characteristic
Not using the information of publisher and year
added to the books, we make use of characteristic
of rows calculated from the books. We take notice
the ratio between width and height of Kanji charac-
ters. It is calculated with the averages of widths and
heights of Kanji characters in each row. Let f repre-
sent width/height, and we classify the books using f.
Training sets are 900 rows in total as same as 4.2.1.
Values of f are approximately between 1.4 and 1.8,
where most values are between 1.5 and 1.7. We di-
vide the range f values into eleven intervals with over-
lapping: [-:1.4], [1.35:1.45], [1.4:1.5], [1.45:1.55],
[1.5:1.6], [1.55:1.65], [1.6:1.7], [1.65:1.75], [1.7:1.8],
[1.75:1.85], and [1.8:-]. Although the classifications
for [-:1.4], [1.35:1.45], [1.7:1.8], [1.75:1.85], and
[1.8:-] do not have 100 rows of training set, they are
at least over 50 rows and we judge they do not af-
fect the experiment so much. Table 2 shows the best
agreement rate between target images and images af-
ter ruby character removal for each class.
The agreement rates are not lower than 99% for all
classes. By scrutinizing all the generated formulas
closely, it turns out that the generated formulas for the
classes of [-:1.4], [1.35:1.45], [1.4:1.5], [1.45:1.55],
[1.5:1.6], [1.55:1.65] and [1.6:1.7] are the same. For-
mula (3) is the mathematical expression after the
scrutiny.
y = width average+ 6
(3)
By contrast, the formulas generated for classes
of [1.65:1.75], [1.7:1.8], [1.75:1.85], and [1.8:-] are
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
642