Efﬁcient Neural Network Training via Subset Pretraining

Jan Sp

orer

1,∗ a

, Bernhard Bermeitinger

2,∗ b

, Tomas Hrycej

1,∗

, Niklas Limacher

1,∗∗

and Siegfried Handschuh

1,∗ c

Institute of Computer Science, University of St.Gallen (HSG), St.Gallen, Switzerland

Institute of Computer Science in Vorarlberg, University of St.Gallen (HSG), Dornbirn, Austria

∗

{ﬁrstname.lastname}@unisg.ch,

∗∗

{ﬁrstname.lastname}@student.unisg.ch

Keywords:

Deep Neural Network, Convolutional Network, Computer Vision, Efﬁcient Training, Resource Optimization,

Training Strategies, Overdetermination Ratio, Stochastic Approximation Theory.

Abstract:

In training neural networks, it is common practice to use partial gradients computed over batches, mostly

very small subsets of the training set. This approach is motivated by the argument that such a partial gradient

is close to the true one, with precision growing only with the square root of the batch size. A theoretical

justiﬁcation is with the help of stochastic approximation theory. However, the conditions for the validity of

this theory are not satisﬁed in the usual learning rate schedules. Batch processing is also difﬁcult to combine

with efﬁcient second-order optimization methods. This proposal is based on another hypothesis: the loss

minimum of the training set can be expected to be well-approximated by the minima of its subsets. Such

subset minima can be computed in a fraction of the time necessary for optimizing over the whole training set.

This hypothesis has been tested with the help of the MNIST, CIFAR-10, and CIFAR-100 image classiﬁcation

benchmarks, optionally extended by training data augmentation. The experiments have conﬁrmed that results

equivalent to conventional training can be reached. In summary, even small subsets are representative if the

overdetermination ratio for the given model parameter set sufﬁciently exceeds unity. The computing expense

can be reduced to a tenth or less.

1 INTRODUCTION

Neural networks as forecasting models learn by ﬁt-

ting the model forecast to the desired reference out-

put (e.g., reference class annotations) given in the data

collection called the training set. The ﬁtting algorithm

changes model parameters in the loss function’s de-

scent direction, measuring its forecast deviation. This

descent direction is determined using the loss function

gradient.

The rapidly growing size of neural networks (such

as those used for image or language processing) mo-

tivates striving for a maximum computing economy.

One widespread approach is determining the loss

function gradient from subsets of the training set,

called batches (or mini-batches). Different batches

are alternately used to cover the whole training set

during the training.

There are some arguments supporting this proce-

dure. (Goodfellow et al., 2016, Section 8.1.3) refers

https://orcid.org/0000-0002-9473-5029

https://orcid.org/0000-0002-2524-1850

https://orcid.org/0000-0002-6195-9034

to the statistical fact that random standard deviation

decreases with the square root of the number of sam-

ples. Consequently, the gradient elements computed

from a fraction of

/K training samples (with a given

positive integer K) have a standard deviation equal to

the factor

√

K multiple of those computed over the

whole training set, which seems to be a good deal.

Another frequent justiﬁcation is with the help of

stochastic approximation theory. The stochastic ap-

proximation principle applies when drawing training

samples from a stationary population generated by a

ﬁxed (unknown) model. (Robbins and Monro, 1951)

discovered this principle in the context of ﬁnding the

root (i.e., the function argument for which the func-

tion is zero) of a function g(x) that cannot be directly

observed. What can be observed are randomly ﬂuc-

tuating values h(x) whose mean value is equal to the

value of the unobservable function, that is,

E [h(x)] = g(x) (1)

The task is to ﬁt an input/output mapping to data

by gradient descent. For the parameter vector w of

this mapping, the mean of the gradient h(w) with re-

spect to the loss function computed for a single train-

242

Spörer, J., Bermeitinger, B., Hrycej, T., Limacher, N. and Handschuh, S.

Efﬁcient Neural Network Training via Subset Pretraining.

DOI: 10.5220/0012893600003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 1: KDIR, pages 242-249

ISBN: 978-989-758-716-0; ISSN: 2184-3228

ing sample is expected to be equal to the gradient g(w)

over the whole data population. The local minimum

of the loss function is where the gradient g(w) (i.e.,

the mean value of h(w)) is zero. (Robbins and Monro,

1951) have proven that, under certain conditions, the

root is found with probability one (but without a con-

crete time upper bound).

However, this approach has some shortcomings.

For different batches, the gradient points in different

directions. So, the descent for one batch can be an

ascent for another. To cope with this, (Robbins and

Monro, 1951) formulated the convergence conditions.

They require that if the update rule for the parameter

vector is

t+1

= w

−c

h(w

) (2)

which corresponds to the usual gradient descent with

step size c

, the step size sequence c

has to satisfy the

following conditions:

∞

∑

t=1

= ∞ (3)

and

∞

∑

t=1

< ∞ (4)

Condition Eq. (3) is necessary for the step not to

vanish prematurely before reaching the optimum with

sufﬁcient precision. Condition Eq. (4) provides for

decreasing step size. With a constant step size, the so-

lution would inﬁnitely ﬂuctuate around the optimum.

This is because, in the context of error minimization,

its random instance h(w) will not diminish for indi-

vidual samples, although the gradient g(w) = E[h(x)]

will gradually vanish as it approaches the minimum.

Finally, the gradients of individual samples will not

vanish even if their mean over the training set is zero.

At the minimum, g(w) = 0 will result from a balance

between individual nonzero vectors h(w) pointing to

various directions.

2 SHORTCOMINGS OF THE

BATCH ORIENTED APPROACH

The concept of gradient determination using a subset

of the training set is mostly satisfactory. However,

several deﬁciencies from theoretical viewpoints sug-

gest an enhancement potential.

2.1 Violating the Conditions of the

Stochastic Approximation

The conditions Eq. (3) and Eq. (4) for convergence

of the stochastic approximation procedure to a global

(or at least local) minimum result from the stochas-

tic approximation theory. Unfortunately, they are al-

most always neglected in the neural network train-

ing practice. This may lead to a bad convergence (or

even divergence). The common Stochastic Gradient

Descent (SGD) with a ﬁxed learning step size vio-

lates the stochastic approximation principles. How-

ever, even popular sophisticated algorithms do not

satisfy the conditions. The widespread (and success-

ful) Adam (Kingma and Ba, 2015) optimizer uses a

weight consisting of the quotient of the exponential

moving average derivative and the exponential mov-

ing average of the square of the derivative

t+1,i

= w

t,i

−

t,i

∂E (w

t,i

)

∂w

t,i

= β

t−1,i

+ (1 −β

)

∂E (w

t−1,i

)

∂w

t−1,i

t,i

= β

t−1,i

+ (1 −β

)



∂E (w

t−1,i

)

∂w

t−1,i



(5)

with metaparameters c, β

, and β

, network weights

t,i

, and the loss function E. β

is the decay factor

of the exponential mean of the error derivative, β

the decay factor of the square of the error derivative,

and c is the step length scaling parameter. Their val-

ues have been set to the framework’s sensible defaults

c = 0.001, β

= 0.9, and β

= 0.999 in the following

computing experiments.

Normalizing the gradient components by the mov-

ing average of their square via

t,i

is the opposite of

the decreasing step size required in the stochastic ap-

proximation theory. If the gradient becomes small (as

expected at the proximity of the minimum), the nor-

malization increases them. This may or may not be

traded off by the average gradient m

t,i

2.2 “Good” Approximation of Gradient

Is not Sufﬁcient

The quoted argument of (Goodfellow et al., 2016,

Section 8.1.3) that the standard deviations of gradi-

ent components are decreasing with the square root

√

K of number of samples used while the computing

expense is increasing with K is, of course, accurate

for independent samples drawn from a population.

However, this relationship is only valid if the

whole statistical population (from which the training

set is also drawn) is considered. This does not account

for the nature of numerical optimization algorithms.

The more sophisticated among them follow the de-

scent direction. The gradient’s statistical deviation is

relatively small with respect to the statistical popula-

tion, but this does not guarantee that a so-determined

Efﬁcient Neural Network Training via Subset Pretraining

243

estimated descent direction is not, in fact, an ascent

direction. The difference between descent and ascent

may easily be within the gradient estimation error —

the batch-based gradient is always a sample estimate,

with standard deviation depending on the unknown

variance of the individual derivatives within the train-

ing set. By contrast, optimizing over the training set

itself, the training set gradient is computed determin-

istically, with zero deviation. Then, the descent direc-

tion is certain to lead to a decrease in loss.

The explicit task of the optimization algorithm is

to minimize the loss over the training set. If the goal

of optimizing over the whole (explicitly unknown)

population is adopted, the appropriate means would

be biased estimates that can have lower errors over the

population, such as ridge regression for linear prob-

lems (van Wieringen, 2023). The biased estimate the-

ory provides substantial results concerning this goal

but also shows that it is difﬁcult to reach because of

unknown regularization parameters, which can only

be determined with computationally expensive exper-

iments using validation data sets.

Even if the loss is extended with regularization

terms to enhance the model’s performance on the

whole population (represented by a validation set), the

optimized regularized ﬁt is reached at the minimum of

the extended loss function once more over the given

training set. Thus, as mentioned above, it is incorrect

from the optimization algorithm’s viewpoint to com-

pare the precision of the training set gradient with that

of the batches, which are subsamples drawn from the

training set. The former is precise, while the latter are

approximations.

The related additional argument frequently cited

is that what is genuinely sought is the minimum for

the population and not for the training set. However,

this argument is somewhat misleading. There is no

method for ﬁnding the true, exact minimum for the

population only based on a subsample such as the

training set — the training set is the best and only

information available. Also, the loss function val-

ues used in the algorithm to decide whether to accept

or reject a solution are values for the given training

set. Examples in (Hrycej et al., 2023) show that no

law guarantees computing time savings through in-

cremental learning for the same performance.

2.3 Convexity Around the Minimum Is

not Exploited

Another problem is that in a speciﬁc environment of

the local minimum, every smooth function is con-

vex — this directly results from the minimum deﬁ-

nition. Then, the location of the minimum is not de-

termined solely by the gradient; the Hessian matrix

also captures the second derivatives. Although using

an explicit estimate of the Hessian is infeasible for

large problems with millions to billions of parame-

ters, there are second-order algorithms that exploit the

curvature information implicitly. One of them is the

well-known conjugate gradient algorithm (Hestenes

and Stiefel, 1952; Fletcher and Reeves, 1964), thor-

oughly described in (Press et al., 1992), which re-

quires only the storage of an additional vector with

a dimension equal to the length of the plain gradi-

ent. However, batch sampling substantially distorts

the second-order information more than the gradi-

ent (Goodfellow et al., 2016). This leads to a con-

siderable loss of efﬁciency and convergence guaran-

tee of second-order algorithms, which is why they are

scarcely used in the neural network community, pos-

sibly sacriﬁcing the beneﬁts of their computing efﬁ-

ciency.

Second-order algorithms cannot be used with the

batch scheme for another reason. They are usually de-

signed for continuous descent of loss values. Reach-

ing a speciﬁc loss value with one batch cannot guar-

antee that this value will not become worse with

another batch. This violates some assumptions for

which the second-order algorithms have been devel-

oped. Mediocre computing results with these algo-

rithms in the batch scheme seem to conﬁrm this hy-

pothesis.

3 SUBSTITUTING THE

TRAINING SET BY A SUBSET

To summarize the arguments in favor of batch-

oriented training, the batch-based procedure is justi-

ﬁed by the assumption that the gradients for individ-

ual batches are roughly consistent with the gradient

over the whole training set (epoch). So, a batch-based

improvement is frequently enough (but not always,

depending on the choice of the metaparameters) also

an improvement for the epoch. This is also consistent

with the computing experience record. On the other

hand, one implicitly insists on optimizing over the

whole training set to ﬁnd an optimum, as one batch

is not expected to represent the training set fully.

Batch-oriented gradient optimization hypothe-

sizes that the batch-loss gradient approximates the

training set gradient and the statistic population gra-

dient well enough.

By contrast, the hypothesis followed here is re-

lated but essentially different. It is assumed that the

optimum of the loss subset is close to the optimum of

the training set.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

244

Figure 1: Optima for a subset and the whole training set.

Even if the minimum approximation is imperfect,

it can be expected to be a very good initial point for

ﬁne-tuning the whole training set so that a few iter-

ations may sufﬁce to reach the true minimum. This

principle is illustrated in Fig. 1. The subset loss func-

tion (red, dotted) is not identical to the training set

loss function (blue, solid). Reaching the minimum

of the subset loss function (red cross) delivers an ini-

tial point for ﬁne-tuning on the training set (blue cir-

cle). This initial point is close to the training set loss

minimum (blue cross) and is very probably within a

convex region around the minimum. This motivates

using fast second-order optimization methods such as

the conjugate gradient method (Press et al., 1992).

Another beneﬁt of such a procedure is that for a

ﬁxed subset, both the gradient and the intermediary

loss values are exact. This further justiﬁes the use of

second-order optimization methods.

Of course, the question is how large the subset has

to be for the approximation to be sufﬁciently good.

As previously noted, a smooth function is always lo-

cally convex around a minimum. If the approximate

minimum over the training subset is in this environ-

ment, conditions for efﬁcient minimization with the

help of second-order algorithms are satisﬁed. Then,

a fast convergence to the minimum over the whole

training set can be expected.

Consequently, it would be desirable for the mini-

mum of the subset loss to be within the convex region

of the training set loss.

4 SETUP OF COMPUTING

EXPERIMENTS

The following computing experiments investigate

the support of these hypotheses. The experimental

method substitutes the training set with representative

subsamples of various sizes. Subsequently, a short

ﬁne-tuning on the whole training set has been per-

formed to ﬁnalize the optimum solution. The model

is trained on the subsamples for exactly 1,000 epochs,

while the ﬁne-tuning on the whole training set is lim-

ited to 100 epochs.

4.1 Benchmark Datasets Used

The benchmarks for the evaluation have been cho-

sen from the domain of computer vision. They are

medium-sized problems that can be run for a sufﬁ-

cient number of experiments. This would not be pos-

sible with large models such as those used in language

modeling.

For the experiments, three well-known image

classiﬁcation datasets MNIST (LeCun et al., 1998),

CIFAR-10, and CIFAR-100 (Krizhevsky, 2009) were

used. MNIST contains grayscale images of handwrit-

ten digits (0–9) while CIFAR-10 contains color im-

ages of exclusively ten different mundane objects like

“horse”, “ship”, or “dog”. CIFAR-100 contains the

same images; however, they are classiﬁed into 100

ﬁne-grained classes. They contain 60,000 (MNIST)

and 50,000 (CIFAR-10 and CIFAR-100) training ex-

amples. Their respective preconﬁgured test split of

each 10,000 examples are used as validation sets.

While both CIFAR-10 and CIFAR-100 are evenly dis-

tributed among the classes, MNIST is roughly evenly

distributed. We opted to proceed without mitigating

the slight class imbalance.

4.2 The Model Architecture

The model is a convolutional network inspired by the

VGG architecture (Simonyan and Zisserman, 2015).

It uses three consecutive convolutional layers with

the same kernel size of 3 ×3, 32/64/64 ﬁlters, and

the ReLU activation function. Each is followed by a

maximum pooling layer of size 2 ×2. The last fea-

ture map is ﬂattened, and after a classiﬁcation block

with one layer of 64 units and the ReLU activation

function, a linear dense layer classiﬁes it into a soft-

max vector. All trainable layers’ parameters are ini-

tialized randomly from the Glorot uniform distribu-

tion (Glorot and Bengio, 2010) with a unique seed

per layer such that all trained models throughout the

experiments have an identical starting point. The bi-

ases of each layer are initialized by zeros. The num-

ber of parameters for the models differs only because

MNIST has one input channel, while CIFAR-10 and

CIFAR-100 have three, and CIFAR-100 has 100 class

output units instead of 10.

Efﬁcient Neural Network Training via Subset Pretraining

245

4.3 Preventing Underdetermination of

Model Parameters

An important criterion is that the training set size is

sufﬁcient for this procedure. The size of the training

subsets (as related to the number of model parame-

ters) must be large enough for the model not to be

underdetermined. This should be true for most of the

subsets tested so that we can fairly compare subsets

that are a relatively small fraction of the training set.

As a criterion for this, the overdetermination ratio of

each benchmark candidate has been evaluated (Hrycej

et al., 2023):

Q =

(6)

with K being the number of training examples, M be-

ing the output vector length (usually equal to the num-

ber of classes), and P being the number of trainable

model parameters.

This formula justiﬁes itself by ensuring that the

numerator KM equals the number of constraints to be

satisﬁed (the reference values for all training exam-

ples). This product must be larger than the number of

trainable parameters for the system to be sufﬁciently

determined. (Otherwise, there are inﬁnite solutions,

most of which do not generalize.) This is equivalent

to the requirement for the overdetermination ratio Q

to be larger than unity. It is advisable that this is sat-

isﬁed for the training set subsets considered, although

subsequent ﬁne-tuning on the whole training set can

“repair” a moderate underdetermination.

The two datasets MNIST and CIFAR-10 have ten

classes. This makes the number of constraints KM

in Eq. (6) too small for subsets with b > 4. This

is why these two datasets have been optionally ex-

panded by image augmentation. This procedure im-

plies slight random rotations, shifts, and contrast vari-

ations. So, the number of training examples has been

increased tenfold by augmenting the training data.

With CIFAR-100 containing 100 classes, this prob-

lem does not occur, and it was not augmented.

4.4 Processing Steps

The processing steps for every given benchmark task

and a tested algorithm have been the following:

• The number of subsets b such that a subset is

the fraction

/b of the training set has been de-

ﬁned. These numbers have been: b ∈ B with

B =

{

2,4,8, 16,32,64,128

}

. With a training set

size K, a subset contains

/b samples. For exam-

ple, a value of b = 2 results in a subset with half

of the samples from the original training set.

• All b subsets of size

/b have been built to sup-

port the results statistically. Each subset B

,i =

1,... ,b, consists of training samples with index

i selected so that the subsets partition the entire

training set. The number of experiments is exces-

sive for larger values of b, so only ﬁve random

subsets are selected. All randomness is seeded

such that each experiment receives the same sub-

set.

• For every b ∈ B and every i = 1,.. . ,b, the sub-

set loss E

has been minimized using the selected

training algorithm. The number of epochs has

been set to 1,000. Additionally, the losses for

the whole training set (E

BT bi

) and validation set

BV bi

) have been evaluated. Subsequently, a ﬁne-

tuning on the whole training set for 100 epochs

has been performed, and the metrics for the train-

ing set (E

T bi

) and validation set (E

V bi

) have been

evaluated. In summary, the set of loss character-

istics E

, E

BT bi

, E

BV bi

, E

T bi

, and E

V bi

have repre-

sented the ﬁnal results.

• For comparison, the typical training on the origi-

nal training set is given by choosing b = 1.

The conjugate gradient algorithm would be the fa-

vorite for optimizing the subset (because of its rela-

tively small size) and ﬁne-tuning (because of its ex-

pected convexity in the region containing the initial

point delivered by the subset training). Unfortunately,

this algorithm is unavailable in deep learning frame-

works like Keras. This is why the popular Adam algo-

rithm has been used. For reproducibility and remov-

ing additional hyperparameters, a ﬁxed learning rate

of 0.001 was employed for all training steps.

5 RESULTS

5.1 Dataset MNIST

The results for the non-augmented MNIST dataset are

depicted in Figs. 2 and 3. The training has two phases:

1. the subset training phase, in which only a fraction

of the training set is used; and

2. the ﬁne-tuning phase, in which the optimized pa-

rameters from the subset training phase are ﬁne-

tuned by a (relatively short) training over the

whole training set.

On the x-axis, fractions of the complete training set

are shown as used for the subset training. The axis is

logarithmic, so the variants are equally spaced. These

are

/2,

/4,

/8,

/16,

/32,

/64, and

/128, as well as the

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

246

baseline (fraction equal to unity). This baseline cor-

responds to the conventional training procedure over

the full training set.

The plotted magnitudes in Figs. 2 and 3 refer to

• the loss or accuracy reached for the given subset

(Subset pre-training);

• the loss or accuracy over the whole training set in

the subset training optimum (Tr.set pre-training);

• the loss or accuracy over the validation set in the

subset training optimum (Valid.set pre-training);

• the loss or accuracy over the whole training set at-

tained through ﬁne-tuning (Tr.set ﬁne-tuning); and

• the loss or accuracy over the validation set in the

ﬁne-tuning optimum (Valid.set ﬁne-tuning).

All of them are average values over the individual

runs with disjoint subsets.

The dotted vertical line marks the subset fraction

with overdetermination ratio Eq. (6) equal to unity. To

the left of this line, the subsets are underdetermined;

to the right, they are overdetermined.

Both loss (Fig. 2) and accuracy (Fig. 3) suggest

similar conjectures:

• The subset training with small subsets leads to

poor training set and validation set losses. This

gap diminishes with the growing subset fraction.

• Fine-tuning largely closes the gap between the

training and validation sets. The optimum value

for the training set tends to be lower for large frac-

tions (as they have an “advance” from the subset

training, but this does not lead to a better valida-

tion set performance. The baseline loss (the right-

most point) exhibits the highest validation set loss.

The overdetermination ratio delivers, together with

the mentioned vertical lines in Fig. 2 and Fig. 3, an

additional ﬁnding: The gap between the performance

on the subset and on the whole training set after the

subset training is very large for Q < 1 (the left side of

the plot) and shrinks for Q > 1 (the right side).

The results for the augmented data are depicted in

the plots Fig. 4 and Fig. 5. As the augmented data

are more challenging to ﬁt, their performance char-

acteristics are generally worse than those of the non-

augmented dataset. However, an important point can

be observed: the performance after the pre-training

(particularly for the validation set) does not differ to

the same extent as it did with non-augmented data.

As with the non-augmented dataset, the baseline loss

(the rightmost point) exhibits the highest validation

set loss. There, the difference between the lowest and

the highest subset losses has been tenfold, while it is

roughly the same for all subset fractions with the aug-

mented data.

Figure 2: Dataset MNIST (not augmented), loss optima for

a pre-trained subset and the whole training set in depen-

dence from the subset size (as a fraction of the training set).

Figure 3: Dataset MNIST (not augmented), accuracy op-

tima for a pre-trained subset and the whole training set in

dependence from the subset size (as a fraction of the train-

ing set).

Figure 4: Dataset MNIST (augmented), loss optima for a

pre-trained subset and the whole training set in dependence

from the subset size (as a fraction of the training set).

The ten times larger size of the augmented dataset

leads to overdetermination ratios Q mostly (except for

the fraction

/128) over unity. Then, even the small-

fraction subsets generalize acceptably (which is the

goal of sufﬁcient overdetermination).

Efﬁcient Neural Network Training via Subset Pretraining

247

Figure 5: Dataset MNIST (augmented), accuracy optima

for a pre-trained subset and the whole training set in depen-

dence from the subset size (as a fraction of the training set).

Figure 6: Dataset CIFAR-10 (non-augmented), loss optima

for a pre-trained subset and the whole training set in depen-

dence from the subset size (as a fraction of the training set).

Figure 7: Dataset CIFAR-10 (augmented), loss optima for a

pre-trained subset and the whole training set dependent on

the subset size (as a fraction of the training set).

5.2 Dataset CIFAR-10

The results for non-augmented CIFAR-10 data are de-

picted in Fig. 6, those for augmented data in Fig. 7.

Due to the size of CIFAR-10 being close to MNIST,

the overdetermination ratios are also very similar.

Figure 8: Dataset CIFAR-100 (non-augmented), loss op-

tima for a pre-trained subset and the whole training set in

dependence from the subset size (as a fraction of the train-

ing set).

Figure 9: Training time relative to the conventional training

in dependence from the subset size (in percent).

Since CIFAR-10 is substantially harder to classify,

losses and accuracies are worse.

The accuracy is a secondary characteristic (since

the categorical cross-entropy is minimized), and its

explanatory power is limited. For this and the

space reasons, accuracies will not be presented for

CIFAR-10 and CIFAR-100.

Nevertheless, the conclusions are similar to those

from MNIST. The gap between the subset’s and the

entire training set’s ﬁne-tuning performance dimin-

ishes as the subset grows. This gap is large with non-

augmented data in Fig. 6 because of low to overde-

termination ratios Q but substantially smaller for aug-

mented data in Fig. 7 where overdetermination ratios

are sufﬁcient. The veriﬁcation set performance after

both training phases is typically better with subsets of

most sizes than with the whole training set.

5.3 Dataset CIFAR-100

The results for (non-augmented) CIFAR-100 data are

depicted in Fig. 8. This classiﬁcation task differen-

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

248

tiates 100 classes so that there are only 500 exam-

ples per class. Optimum losses for this benchmark

are higher than for the previous ones.

For small subset fractions, the representation of

the classes is probably insufﬁcient. This may explain

the large gap between the subset loss and the train-

ing set loss after the subset training with small subset

fractions. These may contain, on average, even as few

examples per class as four. Nevertheless, the loss for

the validation set with various subset sizes is close to

the baseline loss for the conventional full-size train-

ing.

6 CONCLUSION

The experiments presented support the concept of

subset training. We demonstrated the following ele-

ments.

• The subset training leads to results comparable

with conventional training over the whole training

set.

• The overdetermination ratio Q (preferably above

unity) should determine the subset size. Never-

theless, even underdetermined subsets may lead to

a good ﬁne-tuning result, although they put more

workload on the ﬁne-tuning (to close the large

generalization gap).

• To summarize, even small subsets can be repre-

sentative enough to approximate the training set

loss minimum well whenever the overdetermina-

tion ratio sufﬁciently exceeds unity.

The most important achievement is the reduction of

computing expenses. Most optimizing iterations are

done on the subset, where the computational time per

epoch is a fraction of that for the whole training set. In

our experiments with ten times more subset training

epochs than ﬁne-tuning epochs, the relative comput-

ing time in percent of the baseline is shown in Fig. 9.

Computational resource savings of 90 % and more are

possible.

This empirical evaluation using ﬁve benchmarks

from the CV domain is insufﬁcient for making gen-

eral conclusions. Large datasets such as ImageNet

are to be tested in the future. They have been omit-

ted because many experiments are necessary to pro-

duce sufﬁcient statistics. Furthermore, these experi-

ments can be extended to language benchmarks and

language models.

It is also important to investigate the behavior

of the second-order optimization algorithms such

as conjugate gradient (Hestenes and Stiefel, 1952;

Fletcher and Reeves, 1964). Their strength can de-

velop only with a sufﬁcient number of iterations. This

is an obstacle if very large training sets are a part of

the task. Appropriately chosen subsets can make such

training feasible and help to reach good performance

even with models of moderate size.

REFERENCES

Fletcher, R. and Reeves, C. M. (1964). Function minimiza-

tion by conjugate gradients. The Computer Journal,

7(2):149–154.

Glorot, X. and Bengio, Y. (2010). Understanding the difﬁ-

culty of training deep feedforward neural networks. In

Proceedings of the 13th International Conference on

Artiﬁcial Intelligence and Statistics, volume 9, pages

249–256, Sardinia, Italy. PMLR.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. Adaptive computation and machine learn-

ing. The MIT Press, Cambridge, Massachusetts.

Hestenes, M. R. and Stiefel, E. (1952). Methods of conju-

gate gradients for solving linear systems, volume 49.

NBS Washington, DC.

Hrycej, T., Bermeitinger, B., Cetto, M., and Handschuh,

S. (2023). Mathematical Foundations of Data Sci-

ence. Texts in Computer Science. Springer Interna-

tional Publishing, Cham.

Kingma, D. P. and Ba, J. (2015). Adam: A Method for

Stochastic Optimization. 3rd International Confer-

ence on Learning Representations.

Krizhevsky, A. (2009). Learning Multiple Layers of Fea-

tures from Tiny Images. Dataset, University of

Toronto.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proceedings of the IEEE, 86(11):2278–2324.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-

nery, B. P. (1992). Numerical recipes in C (2nd ed.):

the art of scientiﬁc computing. Cambridge University

Press, USA.

Robbins, H. and Monro, S. (1951). A Stochastic Approx-

imation Method. Annals of Mathematical Statistics,

22(3):400–407. 3515 citations (Crossref) [2021-08-

17].

Simonyan, K. and Zisserman, A. (2015). Very Deep Con-

volutional Networks for Large-Scale Image Recogni-

tion. ICLR.

van Wieringen, W. N. (2023). Lecture notes on ridge re-

gression. arXiv:1509.09169 [stat].

Efﬁcient Neural Network Training via Subset Pretraining

249