Well, to combat overfitting, we could simply seek to
have more training data, but suppose that we only
have access to the data that has been collected for us
and there, instead of seeking to collect more data, we
could try to create data from those that we typically
have if we want to learn to recognize cats from a car
image we can slightly rotate the zoom out of the zoom
small changes in the white balance or make a mirror
image of the photo in order to obtain new data which
we can still reasonably think of as images of cars, we
then speak of an increase in data or data augmenta-
tion.
In this article, we will take a closer look at
two techniques for addressing the overfitting problem
ealy-stopping and dropout. The rest of this paper is
organized as follows. In Section 2 we briefly compare
and position our solution with other proposals find in
the literature. Section 3 describes the problem han-
dled. In Section 4, we describe our proposed method
that can be potentially applied to convolutional neu-
ral network. Section 5 describes the Dropout Neu-
ral Network Model. Finally, we carry out extensive
experiments with standard datasets and different net-
work architectures to validate the effectiveness
2 STATE-OF-THE-ART
Without any attempt at being exhaustive, here we
point out a few connections between dropout and pre-
vious literature (Shaeke and Xiuwen, 2019) (Senen-
Cerda and Sanders, 2020) :
Sochastic Gradient Descent Method: The first de-
scription of a dropout algorithm was in (Hinton et al.,
2012) as an effective heuristic for algorithmic regu-
larization. The authors used the standard stochastic
gradi- ent descent procedure to train neuronal stall
networks on mini-batches of training cases, but they
modified the penalty term which was normally used to
keep weights from getting too big. At the time of the
test, they used the ”average network” which contains
all units hidden but with their the outgoing weights
have been halved to compensate for the fact that twice
as many of them are active. In practice, this gives very
similar performance to the average over a large num-
ber of dropout networks.
Regularized Dropout for Neural Networks: R-drop
is a simple yet very effective regularization method
built upon dropout, by minimizing the bidirectional
KL-divergence of the output distributions of any pair
of sub models sampled from dropout in model train-
ing (Liang et al., 2021). Concretely, in each mini-
batch training, each data sample goes through the for-
ward pass twice, and each pass is processed by a dif-
ferent sub model by randomly dropping out some hid-
den units. R-Drop forces the two distributions for the
same data sample outputted by the two sub models to
be consistent with each other, through minimizing the
bidirectional Kullback-Leibler (KL) divergence be-
tween the two distributions (Liang et al., 2021).
Implicit and Explicit Regularization. In a recent
work, (Wei et al., 2020) disentangle the explicit and
implicit regularization effects of dropout; i.e. the reg-
ularization due to the expected bias that is induced
by dropout, versus the regularization induced by the
noise due to the randomnessin dropout. They propose
an approximation of the explicit regularizer for deep
neural networks and show it to be effective in prac-
tice. Their generalization bounds, however, are lim-
ited to linear models and require weights to be norm
bounded (Arora et al., 2020).
3 PROBLEM DESCRIPTION
Overfitting is a concept in data science, which oc-
curs when a statistical model fits exactly against its
training data, but badly in the test set. When this
happens, the algorithm unfortunately cannot perform
accurately against unseen data, defeating its purpose
(IBM
Cloud Education, 2021). When machine learn-
ing algorithms are con- structed, they leverage a sam-
ple dataset to train the model. However, when the
model trains for too long on sample data or when the
model is too complex, it can start to learn the “noise,”
or irrelevant informa- tion, within the dataset. When
the model memorizes the noise and fits too closely
to the training set, the model becomes “overfitted,”
and it is unable to gen- eralize well to new data. If
a model cannot generalize well to new data, then it
will not be able to perform the classification or pre-
diction tasks that it was intended for. Low error rates
and a high variance are good in- dicators of overfit-
ting. In order to prevent this type of behavior, part of
the training dataset is typically set aside as the “test
set” to check for overfitting. If the training data has a
low error rate and the test data has a high error rate, it
signals overfitting. There are too many feature dimen-
sions, model assumptions, and parameters, too much
noise, but very few training data. As a result, the fit-
ting function perfectly predicts the training set, while
the prediction result of the test set of new data is poor.
4 HOW TO AVOID OVERFITTING
Much has been written about overfitting and the
bias/variance tradeoff in neural nets and other ma-