dimensional, non-linear mappings from collection of
examples makes them candidates for image recogni-
tion tasks. In the traditional model of pattern recog-
nition, a hand-designed feature extractor gathers rel-
evant information from the input and eliminates ir-
relevant variabilities. A trainable classifier then cat-
egorizes the resulting feature vectors into classes. In
this scheme standard fully connected multi-layer net-
works can be used as classifiers. A potentially more
interesting scheme is relating on learning in the fea-
ture extractor itself as much as possible.
CNN combines three architectural ideas to ensure
some degree of shift, scale and distortion invariance:
local receptive fields, shared weights and temporal
sub-sampling. The CNN architecture used in our ex-
periments is inspired by Yann LeCun (Y. LeCun and
Haffner, 1998). CNN output layer consists of Eu-
clidean radial basis function (RBF) neurons. The rea-
son for using RBF neurons is to connect distributed
codes layer with classification classes in output layer.
Classification confidence score is very important in
the decision making support systems. We can deter-
mine it in the following way. Neurons in the out-
put layer can be considered as centers of individ-
ual classes of clusters defined through values of the
synaptic weights. We can use the following Gaussian
function for our purpose:
ϕ(y
i
) = e
−
y
i
σ
, (1)
where y
i
is the output of ith neuron in the output layer
and σ is the radius of the cluster defined through dis-
tributed codes. It is clear that values of the ϕ(y
i
) func-
tion are from the interval [0, 1] and we can consider
these values as a confidence score. For the output,
which is close to the distributed code of the corre-
sponding class, confidence will be close to 1 and vice
versa.
Boosting
Boosting refers to a general and provably effective
method of producing a very accurate prediction rule
by combining rough and moderately inaccurate rules
of thumb in a manner similar to that suggested above.
Boosting has its roots in a theoretical framework for
studying machine learning called the ”Probably Ap-
proximately Correct (PAC)” learning model due to
Valiant (Valiant, 1984). Valiant was first to pose
the question of whatever a ”weak” learning algorithm
which performs just slightly better that random guess-
ing in the PAC model can be boosted into an ac-
curate ”strong” learning algorithm. Our inspiration
was AdaBoost algorithm introduced by Freund and
Schapire in 1995 (Freund and Schapire, 1999). Prac-
tically, AdaBoost has many advantages. It requires
no prior knowledge about the weak learner and so
can be flexibly combined with any method for finding
weak hypotheses. Finally, it comes with a set of the-
oretical guarantees given sufficient data and a weak
learner that can reliably provide only moderately ac-
curate weak hypotheses. On the other hand, the ac-
tual performance of boosting on a particular problem
is clearly dependent on the data and the weak learner.
Consistent with theory, boosting can fail to perform
well given insufficient data, overly complex weak hy-
potheses or weak hypotheses which are too weak.
Boosting seems to be especially susceptible to noise.
Boosting approach was also used to improve classifi-
cation performance on convolutional neural networks
(Y. LeCun and Haffner, 1998).
Ensemble of Classifiers
The proposed incrementallearning system using Con-
volutional Neural Networks described in this section
was inspired by the AdaBoost algorithm (Freund and
Schapire, 1999) and Learn++ algorithm (R. Polikar
and Udpa, 2001). Learn++ algorithm was designed
for incremental learning of supervised neural net-
works, such as multilayer perceptrons to accommo-
date new data without access to previously trained
data in the learning phase. Algorithm generates an
ensemble of weak classifiers, each trained using a dif-
ferent distribution of training samples. Outputs of
these classifiers are then combined using Littlestone‘s
majority-voting scheme to obtain the final classifica-
tion rule. Ensemble of classifiers can be optimized
for improving classifier accuracy or for incremental
learning or new data (Polikar, 2007). Combining en-
semble of classifiers is geared towards achieving in-
cremental learning besidesimproving the overall clas-
sification performanceaccording to the boosting. Pro-
posed architecture is based on this intuition: each
new classifier added to the ensemble is trained using
a set of examples drawn according to a distribution,
which ensures that examples that are misclassified by
the current ensemble have a high probability of being
sampled (examples with high error rates are precisely
those that are unknown).
Incremental Learning
We can describe the learning algorithm inspired
by (R. Polikar and Udpa, 2001) by assuming
following inputs. Denote training data S
k
=
[(x
1
, y
1
), .. . , (x
m
, y
m
)], where x
i
are training samples
and y
i
are corresponding correct labels for m sam-
ples randomly selected from the database Ω
k
, where
k = 1, 2, . . . , K. Algorithm calls weak learner repeat-
edly to generate multiple hypotheses using different
IJCCI 2009 - International Joint Conference on Computational Intelligence
548