Japanese Scene Character Recognition using Random Image Feature

and Ensemble Scheme

Fuma Horie

and Hideaki Goto

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

Cyberscience Center, Tohoku University, Sendai, Japan

Keywords:

Random Image Feature, Japanese Scene Character Recognition, Synthetic Scene Character Data, Ensemble

Voting Classiﬁer, Multi-Layer Perceptron.

Abstract:

Scene character recognition is challenging and difﬁcult owing to various environmental factors at image cap-

turing and complex design of characters. Japanese character recognition requires a large number of scene

character images for training since thousands of character classes exist in the language. In order to enhance

the Japanese scene character recognition, we utilized a data augmentation method and an ensemble scheme

in our previous work. In this paper, Random Image Feature (RI-Feature) method is newly proposed for im-

proving the ensemble learning. Experimental results show that the accuracy has been improved from 65.57%

to 78.50% by adding the RI-Feature method to the ensemble learning. It is also shown that HOG feature

outperforms CNN in the Japanese scene character recognition.

1 INTRODUCTION

Recognition of text information in the scene, which is

often referred to as scene character recognition, has

some important applications such as automatic driv-

ing system and automatic translation. Scene charac-

ter recognition is more difﬁcult in comparison with

printed character recognition as there are various fac-

tors such as rotation, geometric distortion, uncon-

trolled lighting, blur, noise and complex design of

characters in the scene images. Japanese scene char-

acter recognition requires a large number of training

data since thousands of character classes exist in the

language. However, collecting a large number char-

acter image samples in real scenes is a hard task.

Some previous researches introduced a data aug-

mentation method using Synthetic Scene character

Data (SSD) which is randomly generated by some

particular algorithms such as ﬁlter processing, mor-

phology operation, color change, and geometric dis-

tortion from the font sets of printed characters (Jader-

berg et al., 2014)(Ren et al., 2016)(Jiang and Goto,

2017)(Horie and Goto, 2018). Jader et al. and Ren

et al. have shown that the accuracy of the deep neu-

ral network model can be improved by adding SSD

to the training data. It has been proved that the aug-

mentation methods are effective for improving the ac-

curacy of the scene character recognition. Figure 1

shows some examples of the Japanese characters in

natural scenes. In our previous work (Jiang and Goto,

2017)(Horie and Goto, 2018), we developed a train-

ing datasets consisting of both Real Scene character

Data (RSD) and SSD. The ensemble scheme is used

to improve the generalization ability of the classiﬁer.

For further improvements of the generalization

ability, Random Image Feature (RI-Feature) method

is newly proposed in this paper. The RI-Feature

method is to randomly process an image before ex-

tracting character features and it is applied to each

classiﬁer by different parameters. It is expected that

the RI-Feature method will make the generalization

ability higher. Moreover, we propose a new ensemble

scheme using Multi-Layer Perceptron (MLP) in this

paper. Experimental results show the effectiveness of

RI-Feature method and MLP.

Convolutional Neural Network (CNN) has

achieved a remarkable performance in various image

recognition tasks including also the scene character

recognition. However, the CNN needs a large number

of high-quality training data. It is thought that CNN

is not able to achieve high accuracy in the scene char-

acter recognition when it suffers from the shortage

of training data. Especially, Japanese scene character

datasets currently available are far from enough

to train the CNN. Some previous scene character

recognition systems use HOG feature since it has

414

Horie, F. and Goto, H.

Japanese Scene Character Recognition using Random Image Feature and Ensemble Scheme.

DOI: 10.5220/0007341904140420

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 414-420

ISBN: 978-989-758-351-3

been found that the HOG feature outperforms the

other features (SIFT, DAISY, SURF, etc.) (Tian et al.,

2016). In this paper, we compare the performances

of CNN and HOG by some experiments using SSD.

This paper is organized as follows. Section II de-

scribes the ensemble scheme and RI-Feature method.

Section III shows the process of experiments and the

results. Conclusions and future work are given in sec-

tion IV.

2 JAPANESE SCENE

CHARACTER RECOGNITION

USING SSD AND ENSEMBLE

SCHEME

2.1 Flow of the Recognition System

SSD and ensemble scheme were utilized in our pre-

vious work (Horie and Goto, 2018). The new sys-

tem proposed in this paper is an extended version

of the previous system, and Random Image Feature

(RI-Feature) method is newly introduced. Figure 2

shows the ﬂow of our recognition system. Let T be

the number of classiﬁers. T subsets are created from

the original font dataset by the bootstrap sampling

(Breiman, 1996). Each subset consists of K images.

These subsets are converted to an SSD set. RI-Feature

sets are extracted from the generated SSD. Finally, T

classiﬁers are created by learning the RI-Feature sets.

At the recognition stage, T RI-Features are extracted

from a query image, and each RI-Feature is put into

each classiﬁer. The answers obtained from every clas-

siﬁer are combined by plurality voting.

2.2 Synthetic Scene Character

Generator

Synthetic Scene character Data (SSD) is used in order

to increase the training data and to enhance the recog-

nition system in this paper. Some previous researches

have shown the effectiveness of SSD for the scene

character recognition (Jaderberg et al., 2014)(Ren

et al., 2016)(Jiang and Goto, 2017)(Horie and Goto,

2018). SSD is generated through some processes such

as distortion, color change, morphology operation,

background blending and various ﬁlters.

The SSD sets are created from image subsets. The

i-th subset S

is sampled from an original image set S

by the bootstrap method. Normaly,

= {s

}, s

∈ S

= {s

where s

is randomly sampled with replacement.

The SSD are generated by the following process-

ing in this paper.

• Afﬁne Transformation.

The process is deﬁned by the following matrix:



1 0 C

0 1 C

0 0 0



0 0 1



1 0 −C

0 1 −C

0 0 1



, a

∈ [0.9, 1.1], a

, a

∈ [−0.1, 0.1],

where (C

) is the center coordinate of the im-

age. a

, a

and a

are chosen from uniformly-

random numbers.

• Gaussian Filter.

3×3 matrices are used as the kernels of Gaussian

ﬁlter, and they are deﬁned by the following for-

mula:

K(x, y) =

2πσ

exp



−

+ y

2σ



, σ ∈ [0, 10].

(1)

σ is chosen with uniformly-random numbers.

• Morphology Operation.

3×3 matrices are used as kernels of morphology

operation. The operation mode is selected ran-

domly from dilation, erosion, and none.

• Color Change.

We chose 20 colors frequently appeared in various

scene character images. Two colors are selected

randomly as the background and foreground. The

channel intensities of the output image are calcu-

lated by the following formula;

′

(i, j) =



L(i, j) ×

− R

255

+ R

+ 0.5



′

(i, j) =



L(i, j) ×

− G

255

+ G

+ 0.5



′

(i, j) =



L(i, j) ×

− B

255

+ B

+ 0.5



(2)

where L(i, j) is the character image before pro-

cessing, L

′

(i, j), L

′

(i, j), and L

′

(i, j) are

the matrices representing the processed images,

, G

, and B

are the foreground colors, R

, G

and B

are the background colors, respectively.

• Random Filter (RF).

Random ﬁlter is proposed in our previous paper

(Horie and Goto, 2018). The kernels of random

ﬁlter are deﬁned by the following formula;

K = (k

n,n

)

1≤n≤3

, k

n,n

∈ R,

∑

m=1

∑

n=1

m,n

= 1, (3)

where k

n,n

is randomly selected.

Japanese Scene Character Recognition using Random Image Feature and Ensemble Scheme

415

Figure 1: Example of Japanese scene characters.

Training stage:

Recognition stage:

Figure 2: Flow of ensemble scheme.

Each process is applied to each image using differ-

ent parameters. Normally, all SSD are different with

each other.

Figure 3 shows the ﬂow of the above SSD gener-

ation. Figure 4 shows some examples of SSD gener-

ated by the above processing.

2.3 Random Image Feature

Ensemble scheme has been used to improve the gen-

eralization ability of classiﬁers in some previous work

(Jiang and Goto, 2017)(Horie and Goto, 2018). We

have shown that the ensemble scheme effectively im-

proves the accuracy of scene character recognition.

For further improvement of the recognition accuracy,

we propose RI-Feature method.

RI-Feature method applies some random image

processing to the character image before extracting

Figure 3: Flow of SSD generation.

the character features. The random processing is ap-

plied for each classiﬁer using different parameters.

Let D

be the i-th SSD set. The processing is as fol-

lows.

• Multi-Scale Resizing (MSR).

Images of D

are resized to the following size;











16 (1 ≤ i ≤

)

32 (

≤ i ≤

)

64 (

≤ i ≤ T )

, (4)

where T is the number of classiﬁers.

MSR was proposed in the previous paper(Jiang

and Goto, 2017). It was demonstrated that MSR

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

416

Figure 4: Examples of SSD used in the training.

effectively improve the ensemble recognition ac-

curacy.

• Random Filter (RF).

Images of D

are calculated by RF using the fol-

lowing kernel;

= (k

n,n

)

1≤n≤3

, k

n,n

∈ R,

where

∑

m=1

∑

n=1

m,n

= 1. (5)

RF is used more than once and by combined with

Mean Filter (MF). We expect that RF is useful for

adding some variations to the features as it intro-

duces various effects to the image. MF is expected

to simulate image blur and also to reduce image

noise.

• Random Afﬁne (RA).

Images of D

are deformed by afﬁne transforma-

tion using the following matrix;







1 0 C

0 1 C

0 0 0













0 0 1













1 0 −C

0 1 −C

0 0 1







(6)

, a

∈ [0.8, 1.2], a

, a

∈ [−0.2, 0.2],

where (C

) is the center coordinate of the im-

age.

, a

and a

are randomly determined.

RI-Feature method is considered to make an over-

all correlation of classiﬁers smaller. Ensemble learn-

ing theory is considered in some researches. Tumor

and Ghosh indicated the following formula (Tumor

and Ghosh, 2016).

err

add

(H) =

1 + θ(T − 1)

err

add

(h), (7)

where err

add

is the overall error rate of classiﬁers,

err

add

(h) is the mean of error rates of classiﬁers, θ

is an overall correlation of classiﬁers, and T is the

number of classiﬁers. The overall error rate becomes

smaller by making err

add

(h) or θ smaller or increas-

ing T . For example, Random Forest is the ensem-

ble learning considered an overall correlation of clas-

siﬁers (Breiman, 2001). It is expected that the RI-

Feature method makes θ smaller by adding some ﬂuc-

tuations to the input data.

2.4 Recognition Stage

To recognize a query image, T RI-Features are ex-

tracted from the query image at ﬁrst. Some differ-

ent combinations of random parameters and kernels,

, K

, a

, and a

are used to extract the i-

th RI-Feature. Second, the RI-Features are put into

each classiﬁer, and each classiﬁer produces the class

label as output. An answer vector is created as follows

when the i-th classiﬁer outputs an answer r

= (a

, a

, ..., a



1 if j = r

0 if j ̸= r

, (8)

where N

is the number of classes. The ﬁnal answer

R is calculated by the plurality voting as follows.

∑

i=1

, A

= (a

, a

, ..., a

R = argmax

{ f (x) | f (x) = a

}. (9)

3 PERFORMANCE EVALUATION

OF ENSEMBLE SCHEME

3.1 Experimental Environment

Experimental environment is as follows:

• CPU: Intel Core i7-3770 (3.4 GHz)

• Memory: 16 GB

• Development language: C/C++, Python

3.2 Dataset

We have created a new Japanese scene character

dataset which is based on the dataset compiled in

Japanese Scene Character Recognition using Random Image Feature and Ensemble Scheme

417

Table 1: Parameters of a HOG feature.

Image size Cell size Block size Orientation Dimension

16×16 2 16 5 320

32×32 4 32 5 320

64×64 8 64 5 320

CNN

HOG-3LP

Figure 5: The architecture of CNN and HOG-3LP.

Table 2: Results of Comparison between CNN and HOG.

Method Recognition accuracy [%]

CNN 63.21

HOG-3LP 66.21

(Horie and Goto, 2018) for testing. The dataset con-

sists of Hiragana, Katakana and Kanji (1,400 images

and 523 classes) taken in real scenes. All charac-

ter images are in color and in arbitrary size. Seven

Japanese fonts (3,107 classes, total 21,749 characters)

are used for training. The training dataset does not in-

clude real scene characters since it is difﬁcult to col-

lect characters of all classes in Japanese.

3.3 Comparison of CNN and HOG

Although CNN has been reported to achieve high-

level accuracy in character recognition, it is thought

that CNN is not effective in a situation of learning

only SSD. In order to conﬁrm it, the following two

architectures are compared.

• CNN: CNN based on LeNet(LeCun et al., 1998)

which consists of 512 nodes in F6 layer and 3,107

in the output layer.

• HOG-3LP: Three-Layer Perceptron (3LP) which

consists of 320 nodes in the input layer, 512 in the

hidden layer and 3,107 in the output layer, and the

learning HOG feature of 320 dimension. ReLU is

used as the activation function in the hidden layer

(Glorot et al., 2011).

Figure 5 shows the architectures. The parameters

of the HOG feature are shown in Table 1. 434,980

grayscale synthetic scene character images are used

as training data in either cases.

Table 2 shows the results. It is shown that CNN

is inferior to HOG. Thus, we use the HOG feature

hereinafter.

3.4 Evaluation of RI-Feature

The following RI-Feature structures are compared in

order to evaluate the effects of the RI-Feature method.

• MSR (Horie and Goto, 2018)

• MSR-RF

• MSR-RA

• MSR-RF-MF-RF-MF-RF

• MSR-RA-RF-MF-RF-MF-RF

Regarding the parameters of the ensemble learning,

T = 90 and K = 9000 are used in all cases. Nearest

Neighbor Search (NNS) is utilized as the classiﬁer of

the ensemble scheme. We have chosen NNS in order

to see the system’s behavior in an environment which

is as simple as possible in this early stage of devel-

opment. Although using some other classiﬁers would

be quite interesting, it should be included in our future

work.

Table 3 shows the results. Particularly, the combi-

nation of RF and MF greatly improves the accuracy.

This is probably because the MF effectively decreases

the image noise. Moreover, it is thought that the RA

makes the ensemble learning robust against geometric

distortions.

Figure 6 shows the evaluation results about the

number of classiﬁers. Our system using the RI-

Feature outperforms the previous system in a condi-

tion of T > 15.

3.5 Improvement of Classiﬁers

It is expected that more SSD make the recognition

system better. Time complexity and space complex-

ity of NNS are both O(n), where n is the number

of image samples. On the other hand, the complex-

ities of Support Vector Machine (SVM) and Multi-

Layer Perceptron (MLP) are O(1). Thus, SVM and

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

418

Table 3: Comparison among methods of different RI-Feature.

RI-Feature method Recognition accuracy [%]

MSR (Horie and Goto, 2018) 65.57

MSR-RF 70.14

MSR-RA 71.64

MSR-RF-MF-RF-MF-RF 76.00

MSR-RA-RF-MF-RF-MF-RF 78.50

0 10 20 30 40 50 60 70 80 90

Recognition accuracy [%]

The number of classifiers T [-]

MSR

MSR-RA-RF-MF-RF-MF-RF

Figure 6: The evaluation about the number of classiﬁers.

Table 4: Evaluation of different classiﬁers in the ensemble

learning.

Classiﬁer Recognition accuracy [%]

NNS 78.50

SVM 77.93

3LP 80.71

MLP are able to learn a large number of training data.

The previous methods (Jiang and Goto, 2017)(Horie

and Goto, 2018) utilized SVM in the ensemble learn-

ing. We have introduced MLP in order to improve the

recognition accuracy.

Following three classiﬁers are compared.

• NNS: Nearest Neighbor Search

(K = 9, 000, T = 90)

• SVM: Linear Support Vector Machine (C = 1)

(K = 200, 000, T = 90)

• 3LP: Three-Layer Perceptron

(K = 200, 000, T = 90)

Figure 7 shows the architecture of 3LP. ReLU is used

as the activation function in the hidden layer (Glorot

et al., 2011). MSR-RF-MF-RF-MF-RF-MF-RF-RA

is used as the RI-Feature method in all cases.

Table 4 shows the results. It is shown that 3LP is

superior to NNS and SVM. It is considered that the

number of training data and the kind of classiﬁer are

important for the ensemble scheme. Figure 8 shows

that our proposed system is able to recognize char-

acters which has some rotation, blur, lighting, noise

and various fonts. It is shown that the correct an-

swers of some incorrectly recognized characters are

included in the second or third candidate. We ex-

pected that some of the characters are correctly rec-

ognized by combining the natural language process-

ing or any other processes. Our system can not rec-

ognize some characters having great geometric distor-

tion, complex background, and extraordinary design.

4 CONCLUSION

We have proposed RI-Feature method for improve-

ment of the ensemble scheme proposed in our previ-

ous work. RI-Feature method is to randomly process

an image before extracting the character features. We

have also proposed to introduce MLP in the ensemble.

Experimental results have shown that HOG out-

performs CNN in the case of using only SSD. It

is also shown that the accuracy has been improved

from 65.57% to 78.50% by the newly introduced RI-

Feature method in the ensemble scheme.

Our future work includes to examine the appro-

priate feature in the ensemble scheme learning the

Japanese synthetic scene characters.

REFERENCES

Breiman, L. (1996). Baggin predictors. Machine Learning,

24:123–140.

Breiman, L. (2001). Random forests. Machine Learning,

45:5–32.

Glorot, X. et al. (2011). Deep sparse rectiﬁer neural net-

works. In Proceedings of Machine Learning Re-

search, volume 15, pages 315–323.

Horie, F. and Goto, H. (2018). High-accuracy japanese

scene character recognition using synthetic scene

characters and multi-scale voting classiﬁer. In

DAS2018 Short Paper.

Jaderberg, M. et al. (2014). Synthetic data and artiﬁcial neu-

ral networks for natural scene text recognition. Work-

shop on Deep Learning, NIPS.

Jiang, L. and Goto, H. (2017). Ensemble classiﬁer with

dividing training scheme for chinese scene character

recognition. In 2017 International Conference on Im-

age and Vision Computing New Zealand (IVCNZ).

LeCun, Y. et al. (1998). Gradient-based learning applied to

document recognition. In Proc. of the IEEE.

Japanese Scene Character Recognition using Random Image Feature and Ensemble Scheme

419

Figure 7: The architectures of 3LP.

Correctly recognized characters

Incorrectly recognized characters

Figure 8: Recognition examples.

Ren, X. et al. (2016). A cnn based scene chinese text recog-

nition algorithm with synthetic data engine. arXiv

preprint arXiv: 1604.01891.

Tian, S. et al. (2016). Multilingual scene character recogni-

tion with co occurrence of histogram of oriented gra-

dients. Pattern Recognition, 51:125–134.

Tumor, K. and Ghosh, J. (2016). Theoretical foundations of

linear and order statistics combiners for neural pattern

classiﬁers. In Technical Report TR-95-02-98, Com-

puter and Vision Research Center, University of Texas,

Austin.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

420