In order to improve our model and reach state-of-
the-art results, the main directions we draw are the
online estimation of positive and negative example at
batch level, the pre-training of the patch embedder
and an improved model to cluster the patch embed-
ding space with respect to the labels.
REFERENCES
Bekker, J. and Davis, J. (2020). Learning from positive
and unlabeled data: A survey. Machine Learning,
109(4):719–760.
Buades, A., Coll, B., and Morel, J.-M. (2005). A non-local
algorithm for image denoising. In IEEE/CVF Conf. on
Computer Vision and Pattern Recognition, volume 2,
pages 60–65 vol. 2.
Carbonneau, M.-A., Cheplygina, V., Granger, E., and
Gagnon, G. (2018). Multiple instance learning: A sur-
vey of problem characteristics and applications. Pat-
tern Recognition, 77:329–353.
Chen, C.-F. R., Fan, Q., and Panda, R. (2021). Crossvit:
Cross-attention multi-scale vision transformer for im-
age classification. In IEEE Int. Conf. on Computer
Vision, pages 357–366.
Cole, E., Mac Aodha, O., Lorieul, T., Perona, P., Morris, D.,
and Jojic, N. (2021). Multi-label learning from single
positive labels. In IEEE/CVF Conf. on Computer Vi-
sion and Pattern Recognition, pages 933–942.
Coup
´
e, P., Manj
´
on, J. V., Fonov, V., Pruessner, J., Robles,
M., and Collins, D. L. (2011). Patch-based segmenta-
tion using expert priors: Application to hippocampus
and ventricle segmentation. NeuroImage, 54(2):940–
954.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Efros, A. A. and Leung, T. K. (1999). Texture synthesis
by non-parametric sampling. In IEEE Int. Conf. on
Computer Vision, volume 2, pages 1033–1038.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,
and Zisserman, A. (2010). The pascal visual object
classes (voc) challenge. International journal of com-
puter vision, 88(2):303–338.
Freeman, W. T., Jones, T. R., and Pasztor, E. C. (2002).
Example-based super-resolution. IEEE Computer
Graphics and Applications, 22(2):56–65.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask r-cnn. In IEEE Int. Conf. on Computer Vision,
pages 2961–2969.
Huang, X. and Yan, M. (2018). Nonconvex penalties with
analytical solutions for one-bit compressive sensing.
Signal Processing, 144:341–351.
Ilse, M., Tomczak, J., and Welling, M. (2018). Attention-
based deep multiple instance learning. In Int. Conf. on
Machine Learning, pages 2127–2136.
Ishida, T., Niu, G., Hu, W., and Sugiyama, M. (2017).
Learning from complementary labels. Advances in
neural information processing systems, 30.
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman,
A., and Carreira, J. (2021). Perceiver: General percep-
tion with iterative attention. In Int. Conf. on Machine
Learning, pages 4651–4664.
Kanehira, A. and Harada, T. (2016). Multi-label ranking
from positive and unlabeled data. In IEEE/CVF Conf.
on Computer Vision and Pattern Recognition, pages
5138–5146.
Lanchantin, J., Wang, T., Ordonez, V., and Qi, Y. (2021).
General multi-label image classification with trans-
formers. In IEEE/CVF Conf. on Computer Vision and
Pattern Recognition, pages 16478–16488.
Lin, M., Chen, Q., and Yan, S. (2013). Network in network.
arXiv preprint arXiv:1312.4400.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Doll
´
ar, P., and Zitnick, C. L. (2014).
Microsoft coco: Common objects in context. In Euro-
pean Conf. on Computer Vision, pages 740–755.
Liu, L., Chen, J., Fieguth, P., Zhao, G., Chellappa, R.,
and Pietik
¨
ainen, M. (2019). From bow to cnn: Two
decades of texture representation for texture classi-
fication. International Journal of Computer Vision,
127(1):74–109.
Mac Aodha, O., Cole, E., and Perona, P. (2019). Presence-
only geographical priors for fine-grained image classi-
fication. In IEEE Int. Conf. on Computer Vision, pages
9596–9606.
Read, J., Pfahringer, B., Holmes, G., and Frank, E. (2009).
Classifier chains for multi-label classification. In Joint
European conference on machine learning and knowl-
edge discovery in databases, pages 254–269.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.
(2016). You only look once: Unified, real-time object
detection. In IEEE/CVF Conf. on Computer Vision
and Pattern Recognition, pages 779–788.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. Advances in neural information
processing systems, 28.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., and Batra, D. (2017). Grad-cam: Visual
explanations from deep networks via gradient-based
localization. In IEEE Int. Conf. on Computer Vision,
pages 618–626.
Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model
scaling for convolutional neural networks. In Int.
Conf. on Machine Learning, pages 6105–6114.
Trockman, A. and Kolter, J. Z. (2022). Patches are all you
need? arXiv preprint arXiv:2201.09792.
Varma, M. and Zisserman, A. (2008). A statistical ap-
proach to material classification using image patch ex-
emplars. IEEE Trans. on Pattern Analysis and Ma-
chine Intelligence, 31(11):2032–2047.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in Neural
Information Processing Systems, 30.
A Patch-Based Architecture for Multi-Label Classification from Single Positive Annotations
57