
of language model fine-tuning. arXiv preprint
arXiv:2012.13255.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr,
I., Hasson, Y., Lenc, K., Mensch, A., Millican, K.,
Reynolds, M., et al. (2022). Flamingo: a visual lan-
guage model for few-shot learning. NIPS, 35:23716–
23736.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. (2020). Language models are few-
shot learners. NIPS, 33:1877–1901.
Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J.,
and Luo, P. (2022). Adaptformer: Adapting vision
transformers for scalable visual recognition. NIPS,
35:16664–16678.
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S.,
Doll
´
ar, P., and Zitnick, C. L. (2015). Microsoft coco
captions: Data collection and evaluation server. arXiv
preprint arXiv:1504.00325.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Floridi, L. and Chiriatti, M. (2020). Gpt-3: Its nature,
scope, limits, and consequences. Minds and Ma-
chines, 30:681–694.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang,
Y., Li, H., and Qiao, Y. (2023). Clip-adapter: Better
vision-language models with feature adapters. IJCV,
pages 1–15.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,
De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
Gelly, S. (2019). Parameter-efficient transfer learning
for nlp. In ICML, pages 2790–2799. PMLR.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
S., Wang, L., and Chen, W. (2021). Lora: Low-rank
adaptation of large language models. arXiv preprint
arXiv:2106.09685.
Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S.,
Hariharan, B., and Lim, S.-N. (2022). Visual prompt
tuning. In ECCV, pages 709–727. Springer.
Karimi Mahabadi, R., Henderson, J., and Ruder, S. (2021).
Compacter: Efficient low-rank hypercomplex adapter
layers. NIPS, 34:1022–1035.
Khattak, M. U., Rasheed, H., Maaz, M., Khan, S., and
Khan, F. S. (2023). Maple: Multi-modal prompt learn-
ing. In CVPR, pages 19113–19122.
Lester, B., Al-Rfou, R., and Constant, N. (2021). The power
of scale for parameter-efficient prompt tuning. arXiv
preprint arXiv:2104.08691.
Li, J., Li, D., Savarese, S., and Hoi, S. (2023). Blip-
2: Bootstrapping language-image pre-training with
frozen image encoders and large language models.
arXiv preprint arXiv:2301.12597.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang,
L., Hu, H., Dong, L., Wei, F., et al. (2020). Os-
car: Object-semantics aligned pre-training for vision-
language tasks. In ECCV, pages 121–137. Springer.
Li, X. L. and Liang, P. (2021). Prefix-tuning: Optimiz-
ing continuous prompts for generation. arXiv preprint
arXiv:2101.00190.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,
G. (2023a). Pre-train, prompt, and predict: A system-
atic survey of prompting methods in natural language
processing. ACM Computing Surveys, 55(9):1–35.
Liu, X., Delany, S. J., and McKeever, S. (2023b). Applying
positional encoding to enhance vision-language trans-
formers.
Liu, X., Ji, K., Fu, Y., Tam, W. L., Du, Z., Yang, Z., and
Tang, J. (2021). P-tuning v2: Prompt tuning can
be comparable to fine-tuning universally across scales
and tasks. arXiv preprint arXiv:2110.07602.
OpenAI, R. (2023). Gpt-4 technical report. arXiv, pages
2303–08774.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: a method for automatic evaluation of machine
translation. In ACL, pages 311–318.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision. In ICML, pages
8748–8763. PMLR.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. NIPS, 28.
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili
´
c, S., Hess-
low, D., Castagn
´
e, R., Luccioni, A. S., Yvon, F., Gall
´
e,
M., et al. (2022). Bloom: A 176b-parameter open-
access multilingual language model. arXiv preprint
arXiv:2211.05100.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi
`
ere, B., Goyal, N., Hambro,
E., Azhar, F., et al. (2023). Llama: Open and ef-
ficient foundation language models. arXiv preprint
arXiv:2302.13971.
Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015).
Cider: Consensus-based image description evalua-
tion. In CVPR, pages 4566–4575.
Yang, X., Cheng, W., Zhao, X., Petzold, L., and Chen, H.
(2023). Dynamic prompting: A unified framework for
prompt tuning. arXiv preprint arXiv:2303.02909.
Zaken, E. B., Ravfogel, S., and Goldberg, Y. (2021).
Bitfit: Simple parameter-efficient fine-tuning for
transformer-based masked language-models. arXiv
preprint arXiv:2106.10199.
Zang, Y., Li, W., Zhou, K., Huang, C., and Loy, C. C.
(2022). Unified vision and language prompt learning.
arXiv preprint arXiv:2210.07225.
Zhang, Y., Zhou, K., and Liu, Z. (2022). Neural prompt
search. arXiv preprint arXiv:2206.04673.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022a). Condi-
tional prompt learning for vision-language models. In
CVPR, pages 16816–16825.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022b). Learn-
ing to prompt for vision-language models. IJCV,
130(9):2337–2348.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
508