CrossSiam: k-Fold Cross Representation Learning

Kaiyu Suzuki

, Yasushi Kambayashi

and Tomofumi Matsuzawa

Department of Information Sciences, Tokyo University of Science, Japan

Department of Computer Information Engineering, Nippon Institute of Technology, Japan

Keywords:

Representation Learning, Machine Learning, Neural Networks, Multi-agent Intelligent Systems, Robustness.

Abstract:

One of the most important tasks for multi-agents such as drones is to automatically make decisions based on

images captured by on-board cameras. These agents must be highly accurate and reliable. For this purpose,

we applied k-fold cross validation to the task of classifying images using deep learning, which is a method that

compares and evaluates models appropriately model of a given problem; this technique is easy to understand

and easy to implement, and it produces results in lower bias estimates. However, k-fold cross validation

reduces the amount of data per neural network, which reduces the accuracy. In order to address this problem,

we propose CrossSiam. CrossSiam is a one of the representation learning methods to train feature encoders to

mimic the embedding space of the validation data of each neural network. We show that the proposed method

has a higher classiﬁcation accuracy than the ParaSiam (baseline). This approach can be very important in the

ﬁeld where reliability is required, such as automated vehicles and drones in disaster situations.

1 INTRODUCTION

In recent years, multi-agent system comprising ve-

hicles, such as robots and drones, have been attract-

ing attention in many domains, including disaster re-

lief. These artiﬁcial agents should be able to make

their own decisions with high accuracy and reliability.

Deep learning is currently the focus of considerable

research attention as an accurate decision-making

method based on images and videos; it is the state

of the art in several tasks, for which multi-agents are

expected to perform (Suzuki et al., 2021). In particu-

lar, for image classiﬁcation tasks where a large dataset

is available, AlexNet (Krizhevsky et al., 2012), VGG

(Simonyan and Zisserman, 2014), ResNet (He et al.,

2016), and EfﬁcientNet (Tan and Le, 2019) have

achieved excellent accuracy outperforming other con-

ventional models. The reliability is a major issue

of deep learning. Because it mostly performs black-

box decision making; moreover, it is difﬁcult to cor-

rect mistakes by post-processing due to the complex-

ity. Therefore, more reliable method than those deep

learning methods is required in many tasks.

k-Fold cross validation is a common and simple

method to improve the reliability of machine learning.

It extracts sub-datasets consisting of k train/validation

data by splitting the training dataset into k groups.

This method does not use the test data while train-

ing, but allows the model to evaluate the training re-

sults through validation. Improving robust decisions

with multiple models and multi-agents is effective for

both accuracy and computational cost. Because it is

possible to have multiple models on different agents.

However, k-fold cross validation is rarely applied to

deep learning for the following two reasons.

• It increases the computational cost in proportion

to the number of number of divisions of the train-

ing dataset.

• It decreases the number of training data per

model, thus, decreasing the accuracy.

In particular, the decrease in accuracy resulting from

the decrease in number of training data can be fatal

when the dataset is small.

In this study, we propose a learning method called

CrossSiam. We show that this method can mitigate

the accuracy degradation caused by the application of

k-fold cross validation to deep learning.

2 RELATED WORK

2.1 Representation Learning

Representation learning is a method for obtaining

valuable representations for classiﬁcation and other

purposes through self-supervised learning.

The major representation learning methods, such

as SimCLR (Chen et al., 2020) and BYOL (Grill et al.,

Suzuki, K., Kambayashi, Y. and Matsuzawa, T.

CrossSiam: k-Fold Cross Representation Learning.

DOI: 10.5220/0010972500003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 1, pages 541-547

ISBN: 978-989-758-547-0; ISSN: 2184-433X

541

(a) SimCLR (b) BYOL

Figure 1: Major Siamese representation learning architec-

tures.

2020), share a common approach. First, x

, x

are ob-

tained by applying two data augmentations to a sam-

ple x. Second, z

, z

, which are feature representa-

tions corresponding to x

, x

, are obtained from the

neural network. Finally, the neural network learns to

make the two obtained feature representations z

, z

consistent. To achieve this policy, each method adopts

a different technique for obtaining the network and

loss functions (Figure 1).

SimCLR (Figure 1a) is one of the earliest of these

methods and is trained via the following algorithm.

, x

= aug(x), aug(x)

, z

= g

( f

)), g

( f

))

loss = L

simclr

, z

) + L

simclr

, z

)

where L

simclr

, z

) is deﬁned in Equation 1.

−

∑

i=1

log

exp(sim(z

, z

))

∑

j6=i

∑

k∈[1,2]

exp(sim(z

, z

))

. (1)

Here, aug denotes random data augmentation, and

sim denotes the cosine similarity. In addition, f

is a

neural network such as ResNet (He et al., 2016) with-

out the classiﬁcation layer, and g

is a neural network

consisting of one to three layers.

BYOL (Figure 1b) is an advanced version of Sim-

CLR and MoCo (He et al., 2020). There are two

signiﬁcant differences between BYOL and SimCLR.

The ﬁrst is that ξ, which is the model parameter

updated by a moving average of θ, as proposed by

MoCo. This implies that instead of predicting each

z directly, prediction of z

output is learned using the

model parameter ξ. The second modiﬁcation is the

addition of a network h

to predict z and train the out-

put p to predict z‘. The algorithm is as follows:

, x

= aug(x), aug(x)

, z

= g

( f

)), g

( f

))

, z

= g

( f

)), g

( f

))

, p

= h

), h

)

loss = sim(p

, sg(z

)) + sim(p

, sg(z

)),

Here, sg denotes the stop gradient.

2.2 k-Fold Cross Validation

Figure 2: k-Fold Cross Validation.

k-Fold cross validation is a common evaluation

method in machine learning. The following steps are

used for training and evaluation (Figure 2):

1. Divide the training data D

train

into k groups

, S

, ..., S

2. Let S

be training data and the rest be validation

data.

3. Train and evaluate the model θ

based on

train/validation data

4. Apply Steps 2–3 for all i.

This method improves the reliability of machine

learning. For example, in hyperparameter search,

when optimizing on a ﬁxed set of train/validation

data, there is a possibility of overﬁtting on that combi-

nation. k-Fold cross validation prevents overﬁtting by

optimizing on k types of train/validation data. Cross

validation can prevent overﬁtting by optimizing on k

types of train/validation data.

2.3 Semi-supervised Semantic

Segmentation with Cross Pseudo

Supervision

This method is proposed in (Chen et al., 2021) for

semi-supervised semantic segmentation. It utilizes

unlabeled dataset for training in addition to super-

vised semantic segmentation. In this method, the

unlabeled data is split into two parts and two corre-

sponding neural networks are prepared, each of which

is used as training data. The output of the neural net-

work that is not included in the training data is trans-

formed and used as the pseudo-label for the unlabeled

data. This method enables more robust learning than

existing methods.

In our study, we extend this method to self-

supervised learning, i.e., we consider the classiﬁca-

tion task for the case where the labeled dataset is zero.

SDMIS 2022 - Special Session on Super Distributed and Multi-agent Intelligent Systems

542

Furthermore, we train not only 2-fold but also 5-fold,

because a larger number of partitions increases the

amount of training data per model and the expected

accuracy, while increasing the computational com-

plexity.

3 METHOD

In applying k-fold cross validation to representation

learning, we denote the kth neural net as θ

, the train-

ing dataset as D

k,train

, and the validation dataset as

k,val

3.1 Baseline

ParaSiam, the baseline method of representation

learning with k-fold cross validation, has a loss func-

tion similar to BYOL (Grill et al., 2020).

x ∈ D

k,train

, x

= aug(x), aug(x)

, z

= g

( f

)), g

( f

))

, z

= g

( f

)), g

( f

))

, p

= h

), h

)

loss =

∑

i=1,2

∑

j=1,2

sim(p

, sg(z

))

where k indicates the ID of the network and corre-

sponding training data of subset, θ

and ξ

are the kth

model parameters.

3.2 CrossSiam

CrossSiam is the proposed method to improve repre-

sentation learning with k-fold cross validation. Al-

though the loss function is similar to BYOL and

ParaSiam, the target of prediction is other network

validation data like (Chen et al., 2021).

x ∈ D

k,train

∩ D

l,val

(k 6=)

, x

= aug(x), aug(x)

, z

= g

( f

)), g

( f

))

, z

= g

( f

)), g

( f

))

, p

= h

), h

)

loss =

∑

i=1,2

∑

j=1,2

sim(p

, sg(z

))

where l is one of IDs of the other networks. This func-

tion indicates that z

, the target of kth network predic-

tion, is the output of the lth network, with the con-

straint that the input x is the kth training data and the

validation data of the lth network.

4 EXPERIMENTS

In this study, we trained models to learn represen-

tation with ParaSiam and CrossSiam(the proposed

method). To evaluate these models, ﬁrst, after the

representation learning, we calculate the accuracy of

classiﬁcation when only the ﬁnal layer is trained with

the ﬁxed embedding space. Second, we calculate the

difference between embedding space of D

i,train

, D

i,val

and D

test

in CrossSiam and ParaSiam, respectively,

and consider whether there is any unfair leakage from

validation data to the model, in other words learning

from validation data. The details of the training con-

ﬁgurations are shown in APPENDIX.

4.1 Linear Evaluation

Linear evaluation is a common evaluation method for

representation learning, such as BYOL (Grill et al.,

2020). It measures the classiﬁcation accuracy of a full

connection layer added to a ﬁxed embedding space to

determine whether the embedding space learns mean-

ingful features. In other words, we evaluate whether

the linear separability of the embedding space can

learn without labels.

While applying k-fold cross validation, the same

segmented training data in representation learning is

used for learning linear evaluation. The network and

learning details are the same. The detailed conﬁgura-

tion is shown in APPENDIX.

4.2 Compared Distribution

One of the roles of validation data is to monitor the

training data for overﬁtting. Conversely, in cases

where validation data cannot monitor overﬁtting, the

following two conditions may occur:

• Training data and validation data are too close

• Validation data and the test data are too far apart

CrossSiam learns to predict the embedding space

of validation data for one network as it is predicted

by another network. Although this method improves

accuracy, it has the risk of leaking validation data

through network predictions.

To evaluate the possibility of unfair leaks,

we compare embedding space of D

i,train

, D

i,val

i,train

, D

test

, D

i,val

, D

test

In this study, we use Fr

echet distance (FD) (Dow-

son and Landau, 1982) as a method to compare two

different embedding distributions:

FD(X, Y) = ||µ

− µ

+ Tr(Σ

+ Σ

+ 2(Σ

)

) (2)

CrossSiam: k-Fold Cross Representation Learning

543

where X , Y are embedding spaces from two different

datasets, µ

, µ

are the mean features of X, Y , Σ

, Σ

are the covariance matrices of X, Y.

4.3 Splitting Dataset for k-Fold Cross

Validation

In k-fold cross validation, the dataset is divided into

k subsets. Then, if the labels in each subset are not

equal, the learning is adversely affected.

In this study, when the dataset is divided by k-fold

cross validation, each subset is constrained to have

equal number of included classes. This allows us to

separate the class inequality problem from the evalu-

ation of the proposed method.

4.4 Dataset and Model Speciﬁcations

We employed CIFAR10 as the dataset. It contains

50,000 training data and 10,000 test data for 10

classes of images with a size 32 × 32. We chose

ResNet-18 (He et al., 2016) as the base CNN model.

We trained the model from scratch with a minibatch

size of 512 for 800 ×

k−1

epochs. We include more

details in the Appendix. We set 2 and 5 as the number

of folds.

5 RESULTS AND DISCUSSION

5.1 Linear Evaluation

Table 1: Linear Evaluation.

network folds linear evaluation

ParaSiam 2 87.99 ± 0.12%

CrossSiam 2 91.22 ± 0.19%

ParaSiam 5 91.41 ± 0.18%

CrossSiam 5 91.60 ± 0.12%

We show the results of linear evaluation in Table

1. Each item shows the training model, the number of

folds, and the mean and variance of the classiﬁcation

accuracy for k models, respectively.

We compare ParaSiam and CrossSiam. When the

number of folds is two, the classiﬁcation accuracy of

CrossSiam is higher than that of ParaSiam. This re-

sult shows the usefulness of CrossSiam. Conversely,

when the number of folds is ﬁve, CrossSiam has a

slightly smaller improvement in classiﬁcation accu-

racy than ParaSiam.

The reason why the improvement is greater in 2-

fold than in 5-fold validations can be explained by the

following reasons.

1. CrossSiam is especially effective when the

amount of data is small

2. CrossSiam is affected by the batch size when cal-

culating validation data

In the future, we plan to conduct experiments to prove

these hypotheses.

5.2 FD

We show the results of the FD in Table 2. Each

item shows the learning model, the number of folds,

FD between D

train

and D

test

, FD between D

train

and D

val

, and FD between D

val

and D

test

, and we

show the results of The ratio of FD(D

train

, D

val

) and

FD(D

val

, D

test

) to FD(D

val

, D

test

) in Table 3.

To investigate whether validation data has lost its

ability to monitor overﬁtting or not, we focused on the

following points:

• Whether training data and validation data are too

close or not, and

• Whether validation data and the test data are too

far apart or not.

The ﬁrst item is indicated by FD(D

k,train

, D

k,val

). The

second item is indicated by FD(D

k,val

, D

k,test

In experiments (Table 2), FD(D

k,train

, D

k,val

) of

CrossSiam is smaller than the one of ParaSiam at 2-

fold validation, and FD(D

k,train

, D

k,val

) of CrossSiam

is larger than the one of ParaSiam at 5-fold validation.

From this, it is difﬁcult to show the relationship be-

tween the distances of D

k,train

and D

k,val

) in ParaSiam

and CrossSiam at any fold.

Conversely, FD(D

k,val

, D

k,test

) of CrossSiam is

larger compared to that of ParaSiam at both 2-fold and

5-fold validation. This indicates that there is a possi-

bility of a leak from validation to network training in

CrossSiam.

We should evaluate the method with

FD(D

k,train

, D

k,val

) and FD(D

k,val

, D

k,test

) ratio for

FD(D

train

, D

test

), because accuracy of classiﬁcation

is generally proportional to afﬁnity between D

train

and D

test

) (Deng and Zheng, 2021). Table 3 shows

that FD(D

k,train

, D

k,val

) ratio of CrossSiam is smaller

compared to that of ParaSiam and FD(D

k,val

, D

k,test

)

ratio of CrossSiam is larger compared to the one of

ParaSiam at both 2-fold and 5-fold validation.

These results do not rule out the possibility of

overﬁtting. Conversely, it is unclear whether the

validation data can be used as a monitoring indi-

cator when overﬁtting actually occurs. Decreasing

FD(D

k,train

, D

k,val

) and increasing FD(D

k,val

, D

k,test

)

are the requirement for overﬁtting. In the future,

we will investigate whether D

k,val

follows D

k,train

k,test

when overﬁtting actually occurs.

SDMIS 2022 - Special Session on Super Distributed and Multi-agent Intelligent Systems

544

Table 2: FD(·) between each datasets.

network folds D

train

, D

test

train

, D

val

, D

test

ParaSiam 2 0.00529 0.00497 0.00035

CrossSiam 2 0.00110 0.00032 0.00071

ParaSiam 5 0.00196 0.00186 0.00064

CrossSiam 5 0.00078 0.00044 0.00079

Table 3: FD(·) ratio for FD(D

train

, D

test

) between each datasets.

network folds D

train

, D

val

, D

test

ParaSiam 2 0.939 0.066

CrossSiam 2 0.289 0.642

ParaSiam 5 0.948 0.323

CrossSiam 5 0.564 1.008

6 FUTURE WORK

6.1 Training Method Adapted to k-Fold

Cross Validation

In this study, k-fold cross validation is used for both

CrossSiam and ParaSiam, and they are trained sim-

ilarly to the method without validation data. Con-

versely, there are many learning methods that use val-

idation data. For example, early stopping terminates

learning when the loss or accuracy of the validation

data and training data are far apart.

In CrossSiam(the proposed method), the embed-

ding spaces of the validation data are targeted, and

the in-distribution and out-of-distribution of them is

available. This method cannot be used in ParaSiam

(baseline), because the target data is also the train-

ing data and is likely to be judged as in-distribution.

We will conduct experiments to take advantage of this

feature of CrossSiam, and show greater accuracy im-

provement and robustness to out-of-distribution.

6.2 Multi-agents Intelligence with

Different Decision Criteria

In the case of a multi-agent system such as unmanned

robots and drones, it is necessary for the agents them-

selves to decide the behavior of swarms from cam-

era images. If this decision is made in CrossSiam,

there are two methods to improve reliability when

each drone makes a decision based on a model learned

by k-fold cross validation.

• Prediction of a single drone assembly by k differ-

ent models.

• Prediction of k kinds of drones through each

model.

The ﬁrst aforementioned method improves the

recognition accuracy of the model. However, all

agents behave similarly; hence, if one drone misses

a decision, it is highly likely that other drones will

also miss the decision. The computational cost is also

k times higher.

The second method does not improve the recogni-

tion accuracy of the model by itself. However, even

if one drone misses a decision due to ﬂuctuations in

learning, other drones may not miss the decision. In

other words, an ensemble of multiple drones becomes

possible. This method can be more robust than that

of an ensemble with multiple drones, which make the

same decision in the same situation.

However, the system will be unreliable if it is un-

able to completely control the movement of the net-

work. It is desirable to learn the ensemble so that the

output of each model is as consistent as possible. The

proposed method can realize an ensemble system and

reduce the output of each network to a common crite-

rion.

7 CONCLUSIONS

In this study, we propose a method to improve the

reliability of multi-agents such as unmanned robots

and drones by making autonomous decisions based on

camera information. First, we apply k-fold cross val-

idation to representation learning as a baseline. This

method splits the dataset into k parts and performs

training and evaluation based on a combination of k

sub-datasets. Conversely, dividing the dataset reduces

the amount of training data, which reduces the accu-

racy of the network. To improve the accuracy, we

proposed CrossSiam. By mimicking the embedding

space of each other’s validation data, this method is

able to train robustly and accurately.

CrossSiam: k-Fold Cross Representation Learning

545

To show the usefulness of CrossSiam, we com-

pared it with ParaSiam, a network that simply applied

k-fold cross validation to representation learning. We

conducted two experiments for evaluation. The ﬁrst

involved linear evaluation, which measures the classi-

ﬁcation accuracy when embedding space is ﬁxed and

only 1 fully connected layer is used for supervised

learning. The experimental results showed that the

proposed method achieved higher accuracy than the

baseline for 2-fold and 5-fold cross validation. In par-

ticular, the accuracy of 2-fold CrossSiam was much

higher than that of 2-fold ParaSiam. The second in-

volved Fr

echet distance (FD), the distribution differ-

ence between each dataset like training, validation

and test. The experimental results show that distance

between the embeddings of training data and vali-

dation data is smaller and distance between the em-

beddings of validation data and test data is larger for

CrossSiam than for ParaSiam. This means that, unde-

sirably, CrossSiam has the requirement of leaking ver-

iﬁcation data to the network compared to ParaSiam.

However, it is unclear whether leakage actually oc-

curs. In the future, we will conduct experiments to see

if each validation data can be used to suppress overﬁt-

ting. We also show that CrossSiam can be trained on

datasets with a high percentage of out-of-distribution.

We will experiment to show the suitability of this ap-

proach for autonomous control of multiple drones.

REFERENCES

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).

A simple framework for contrastive learning of visual

representations. In III, H. D. and Singh, A., editors,

Proceedings of the 37th International Conference on

Machine Learning, volume 119 of Proceedings of Ma-

chine Learning Research, pages 1597–1607. PMLR.

Chen, X., Yuan, Y., Zeng, G., and Wang, J. (2021). Semi-

supervised semantic segmentation with cross pseudo

supervision. In IEEE Conference on Computer Vision

and Pattern Recognition (CVPR).

Deng, W. and Zheng, L. (2021). Are labels always nec-

essary for classiﬁer accuracy evaluation? In Proc.

CVPR.

Dowson, D. and Landau, B. (1982). The fr

echet distance

between multivariate normal distributions. Journal of

Multivariate Analysis, 12(3):450–455.

Grill, J.-B., Strub, F., Altch

e, F., Tallec, C., Richemond, P.,

Buchatskaya, E., Doersch, C., Avila Pires, B., Guo,

Z., Gheshlaghi Azar, M., Piot, B., kavukcuoglu, k.,

Munos, R., and Valko, M. (2020). Bootstrap your own

latent - a new approach to self-supervised learning.

In Larochelle, H., Ranzato, M., Hadsell, R., Balcan,

M. F., and Lin, H., editors, Advances in Neural Infor-

mation Processing Systems, volume 33, pages 21271–

21284. Curran Associates, Inc.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020).

Momentum contrast for unsupervised visual represen-

tation learning. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C. J. C., Bottou,

L., and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems, volume 25, pages

1097–1105. Curran Associates, Inc.

Loshchilov, I. and Hutter, F. (2017). SGDR: stochastic gra-

dient descent with warm restarts. In 5th International

Conference on Learning Representations, ICLR 2017,

Toulon, France, April 24-26, 2017, Conference Track

Proceedings. OpenReview.net.

Shamir, O. and Zhang, T. (2013). Stochastic gradient de-

scent for non-smooth optimization: Convergence re-

sults and optimal averaging schemes. In Dasgupta,

S. and McAllester, D., editors, Proceedings of the

30th International Conference on Machine Learning,

volume 28 of Proceedings of Machine Learning Re-

search, pages 71–79, Atlanta, Georgia, USA. PMLR.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Suzuki, K., Matsuzawa, T., Takimoto, M., and Kambayashi,

Y. (2021). Vector quantization to visualize the detec-

tion process. In Rocha, A., Steels, L., and van den

Herik, J., editors, ICAART 2021 - Proceedings of the

13th International Conference on Agents and Artiﬁ-

cial Intelligence, pages 553–561. SciTePress.

Tan, M. and Le, Q. (2019). EfﬁcientNet: Rethinking model

scaling for convolutional neural networks. volume 97

of Proceedings of Machine Learning Research, pages

6105–6114, Long Beach, California, USA. PMLR.

APPENDIX

Architecture

In this study, we use f

as ResNet-18 Encoder,

ResNet-18 without the last full connection layer (Ta-

ble 4), projector g

constructed as in Table 6, and pre-

dictor h

constructed as in Table 5 like in BYOL (Grill

et al., 2020).

Training Details for Representation

Learning

We used the SGD optimizer (Shamir and Zhang,

2013) to train model parameters θ in representation

SDMIS 2022 - Special Session on Super Distributed and Multi-agent Intelligent Systems

546

Table 4: ResNet-18 Encoder.

Group name Output size Block type

conv1 32 ×32 [7 × 7, 64, stride2]

conv2 32 ×32



3 ×3, 64



× 2

conv3 16 ×16



3 ×3, 128



× 2

conv4 8 ×8



3 ×3, 256



× 2

conv5 4 ×4



3 ×3, 512



× 2

1 ×1 average pool

Table 5: Projector.

Layer name Input size Output size

Linear 512 2048

BN 2048 2048

ReLU 2048 2048

Linear 2048 2048

BN 2048 2048

Table 6: Predictor.

Layer name Input size Output size

Linear 2048 512

BN 512 512

ReLU 512 512

Linear 512 2048

learning. The base hyperparameters are listed in Ta-

ble 7 with the cosine learning rate decay schedule

(Loshchilov and Hutter, 2017).

Table 7: Optimizer Hyperparameters for Representation

Learning.

Parameter name Value

base learning rate 0.06

ﬁnal learning rate 0.00

weight decay 0.0005

base epochs 800

batch size 512

The number of training epochs is inversely pro-

portional to the amount of each training data of sub-

set. In other words, the number of epochs at k-Fold is

800 ×

k−1

Training Details for Linear Evaluation

We ﬁxed ResNet-18 encoders with representation

learning, and added full connection layers for predict-

ing the 10 classes of CIFAR-10.

We used the SGD optimizer (Shamir and Zhang,

2013) for linear evaluation training. The base hyper-

parameters are listed in Table 8 with cosine learning

rate decay schedule (Loshchilov and Hutter, 2017).

Table 8: Optimizer Hyperparameters for Linear Evaluation.

Parameter name Value

base learning rate 30.0

ﬁnal learning rate 0.0

weight decay

0.0

epochs 100

batch size 256

CrossSiam: k-Fold Cross Representation Learning

547