
5.1 RP2K Results
RP2K dataset paper proposes full finetuning ResNet
and other backbones on their dataset one category at
a time to get upto 95% accuracy in some categories
of their dataset. For some other harder categories, the
accuracy can be lower than 90%. However, due to
language barrier, our team has been unable to sepa-
rate out categories and thus we have to use models on
all categories combined, so this makes the problem
harder for RetailKLIP. Also RetailKLIP takes just one
image from train set to classify test images unlike full
finetuning which uses the ample amount of images
available. The accuracy for Zero Shot classification
on RP2K is 87.7%. This is close to the accuracy full
finetuning approach of (Peng et al., 2020) gets on the
hardest categories.
6 DISCUSSION
Our work proposes a method to create a Zero shot
classifier for Retail Product images on a single GPU
by finetuning OpenCLIP. The accuracy is competitive
or even sometimes better than full finetuning large
Convnet backbones on the same GPU. This enables
real world retail computer vision systems to quickly
integrate new products into their range and avoid re-
source intensive trainings multiple times.
REFERENCES
Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Ar-
cface: Additive angular margin loss for deep face
recognition. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition,
pages 4690–4699.
Dong, X., Bao, J., Zhang, T., Chen, D., Gu, S., Zhang,
W., Yuan, L., Chen, D., Wen, F., and Yu, N. (2022).
Clip itself is a strong fine-tuner: Achieving 85.7% and
88.0% top-1 accuracy with vit-b and vit-l on imagenet.
arXiv preprint arXiv:2212.06138.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. In International Conference on Learning
Representations.
Geng, W., Han, F., Lin, J., Zhu, L., Bai, J., Wang, S., He,
L., Xiao, Q., and Lai, Z. (2018). Fine-grained gro-
cery product recognition by one-shot learning. pages
1706–1714.
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car-
lini, N., Taori, R., Dave, A., Shankar, V., Namkoong,
H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt,
L. (2021). Openclip. If you use this software, please
cite it as below.
Kumar, A., Shen, R., Bubeck, S., and Gunasekar, S. (2022).
How to fine-tune vision models with sgd. arXiv
preprint arXiv:2211.09359.
Leutenegger, S., Chli, M., and Siegwart, R. Y. (2011).
Brisk: Binary robust invariant scalable keypoints. In
2011 International Conference on Computer Vision,
pages 2548–2555.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. volume 60, pages 91–110.
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
M., Li, Y., Bharambe, A., and van der Maaten, L.
(2018). Exploring the limits of weakly supervised
pretraining. In Ferrari, V., Hebert, M., Sminchisescu,
C., and Weiss, Y., editors, Computer Vision – ECCV
2018, pages 185–201, Cham. Springer International
Publishing.
Merler, M., Galleguillos, C., and Belongie, S. (2007). Rec-
ognizing groceries in situ using in vitro training data.
In 2007 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8.
Musgrave, K., Belongie, S. J., and Lim, S.-N. (2020). Py-
torch metric learning. ArXiv, abs/2008.09164.
Peng, J., Xiao, C., and Li, Y. (2020). Rp2k: A large-scale
retail product dataset for fine-grained image classifi-
cation. arXiv preprint arXiv:2006.12634.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision. In International
conference on machine learning, pages 8748–8763.
PMLR.
Srivastava, M. M. (2020). Bag of tricks for retail product
image classification. In Campilho, A., Karray, F., and
Wang, Z., editors, Image Analysis and Recognition,
pages 71–82, Cham. Springer International Publish-
ing.
Srivastava, M. M. (2022). Using contrastive learning and
pseudolabels to learn representations for retail product
image classification.
Tonioni, A. and Stefano, L. D. (2019). Domain invariant
hierarchical embedding for grocery products recog-
nition. Computer Vision and Image Understanding,
182:81–92.
Zhang, Y. and Deng, W. (2020). Class-balanced training
for deep face recognition. In 2020 IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
Workshops (CVPRW), pages 3594–3603.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
834