Authors:
Qin Wang
1
;
Kai Krajsek
2
and
Hanno Scharr
1
Affiliations:
1
IAS-8: Data Analytics and Machine Learning, Forschungszentrum Jülich, Germany
;
2
Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, Germany
Keyword(s):
Self-Supervised Learning, Augmentation, Vision Transformer.
Abstract:
Many recent self-supervised pretraining methods use augmented versions of the same image as samples for their learning schemes. We observe that ’easy’ samples, i.e. samples being too similar to each other after augmentation, have only limited value as learning signal. We therefore propose to rescue easy samples and make them harder. To do so, we select the top k easiest samples using cosine similarity, strongly augment them, forward-pass them through the model, calculate cosine similarity of the output as loss, and add it to the original loss in a weighted fashion. This method can be adopted to all contrastive or other augmented-pair based learning methods, whether they involve negative pairs or not, as it changes handling of easy positives, only. This simple but effective approach introduces greater variability into such self-supervised pretraining processes, significantly increasing the performance on various downstream tasks as observed in our experiments. We pretrain models of di
fferent sizes, i.e. ResNet-50, ViT-S, ViT-B, or ViT-L, using ImageNet with SimCLR, MoCo v3, or DINOv2 training schemes. Here, e.g., we consistently find to improve results for ImageNet top-1 accuracy with a linear classifier establishing new SOTA for this task.
(More)