estimated descent direction is not, in fact, an ascent
direction. The difference between descent and ascent
may easily be within the gradient estimation error —
the batch-based gradient is always a sample estimate,
with standard deviation depending on the unknown
variance of the individual derivatives within the train-
ing set. By contrast, optimizing over the training set
itself, the training set gradient is computed determin-
istically, with zero deviation. Then, the descent direc-
tion is certain to lead to a decrease in loss.
The explicit task of the optimization algorithm is
to minimize the loss over the training set. If the goal
of optimizing over the whole (explicitly unknown)
population is adopted, the appropriate means would
be biased estimates that can have lower errors over the
population, such as ridge regression for linear prob-
lems (van Wieringen, 2023). The biased estimate the-
ory provides substantial results concerning this goal
but also shows that it is difficult to reach because of
unknown regularization parameters, which can only
be determined with computationally expensive exper-
iments using validation data sets.
Even if the loss is extended with regularization
terms to enhance the model’s performance on the
whole population (represented by a validation set), the
optimized regularized fit is reached at the minimum of
the extended loss function once more over the given
training set. Thus, as mentioned above, it is incorrect
from the optimization algorithm’s viewpoint to com-
pare the precision of the training set gradient with that
of the batches, which are subsamples drawn from the
training set. The former is precise, while the latter are
approximations.
The related additional argument frequently cited
is that what is genuinely sought is the minimum for
the population and not for the training set. However,
this argument is somewhat misleading. There is no
method for finding the true, exact minimum for the
population only based on a subsample such as the
training set — the training set is the best and only
information available. Also, the loss function val-
ues used in the algorithm to decide whether to accept
or reject a solution are values for the given training
set. Examples in (Hrycej et al., 2023) show that no
law guarantees computing time savings through in-
cremental learning for the same performance.
2.3 Convexity Around the Minimum Is
not Exploited
Another problem is that in a specific environment of
the local minimum, every smooth function is con-
vex — this directly results from the minimum defi-
nition. Then, the location of the minimum is not de-
termined solely by the gradient; the Hessian matrix
also captures the second derivatives. Although using
an explicit estimate of the Hessian is infeasible for
large problems with millions to billions of parame-
ters, there are second-order algorithms that exploit the
curvature information implicitly. One of them is the
well-known conjugate gradient algorithm (Hestenes
and Stiefel, 1952; Fletcher and Reeves, 1964), thor-
oughly described in (Press et al., 1992), which re-
quires only the storage of an additional vector with
a dimension equal to the length of the plain gradi-
ent. However, batch sampling substantially distorts
the second-order information more than the gradi-
ent (Goodfellow et al., 2016). This leads to a con-
siderable loss of efficiency and convergence guaran-
tee of second-order algorithms, which is why they are
scarcely used in the neural network community, pos-
sibly sacrificing the benefits of their computing effi-
ciency.
Second-order algorithms cannot be used with the
batch scheme for another reason. They are usually de-
signed for continuous descent of loss values. Reach-
ing a specific loss value with one batch cannot guar-
antee that this value will not become worse with
another batch. This violates some assumptions for
which the second-order algorithms have been devel-
oped. Mediocre computing results with these algo-
rithms in the batch scheme seem to confirm this hy-
pothesis.
3 SUBSTITUTING THE
TRAINING SET BY A SUBSET
To summarize the arguments in favor of batch-
oriented training, the batch-based procedure is justi-
fied by the assumption that the gradients for individ-
ual batches are roughly consistent with the gradient
over the whole training set (epoch). So, a batch-based
improvement is frequently enough (but not always,
depending on the choice of the metaparameters) also
an improvement for the epoch. This is also consistent
with the computing experience record. On the other
hand, one implicitly insists on optimizing over the
whole training set to find an optimum, as one batch
is not expected to represent the training set fully.
Batch-oriented gradient optimization hypothe-
sizes that the batch-loss gradient approximates the
training set gradient and the statistic population gra-
dient well enough.
By contrast, the hypothesis followed here is re-
lated but essentially different. It is assumed that the
optimum of the loss subset is close to the optimum of
the training set.
KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval
244