AV-PEA: Parameter-Efficient Adapter for Audio-Visual Multimodal Learning

Abduljalil Radman; Jorma Laaksonen

doi:10.5220/0012431500003660

AV-PEA: Parameter-Efficient Adapter for Audio-Visual Multimodal Learning

Abduljalil Radman, Jorma Laaksonen

2024

Abstract

Fine-tuning has emerged as a widely used transfer learning technique for leveraging pre-trained vision transformers in various downstream tasks. However, its success relies on tuning a significant number of trainable parameters, which could lead to significant costs in terms of both model training and storage. When it comes to audio-visual multimodal learning, the challenge also lies in effectively incorporating both audio and visual cues into the transfer learning process, especially when the original model has been trained with unimodal samples only. This paper introduces a novel audio-visual parameter-efficient adapter (AV-PEA) designed to improve multimodal transfer learning for audio-visual tasks. Through the integration of AV-PEA into a frozen vision transformer, like the visual transformer (ViT), the transformer becomes adept at processing audio inputs without prior knowledge of audio pre-training. This also facilitates the exchange of essential audio-visual cues between audio and visual modalities, all while introducing a limited set of trainable parameters into each block of the frozen transformer. The experimental results demonstrate that our AV-PEA consistently achieves superior or comparable performance to state-of-the-art methods in a range of audio-visual tasks, including audio-visual event localization (AVEL), audio-visual question answering (AVQA), audio-visual retrieval (AVR), and audio-visual captioning (AVC). Furthermore, it distinguishes itself from competitors by enabling seamless integration into these tasks while maintaining a consistent number of trainable parameters, typically accounting for less than 3.7% of the total parameters per task.

Download

Paper Citation

in Harvard Style

Radman A. and Laaksonen J. (2024). AV-PEA: Parameter-Efficient Adapter for Audio-Visual Multimodal Learning. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-679-8, SciTePress, pages 730-737. DOI: 10.5220/0012431500003660

in Bibtex Style

@conference{visapp24,
author={Abduljalil Radman and Jorma Laaksonen},
title={AV-PEA: Parameter-Efficient Adapter for Audio-Visual Multimodal Learning},
booktitle={Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2024},
pages={730-737},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012431500003660},
isbn={978-989-758-679-8},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - AV-PEA: Parameter-Efficient Adapter for Audio-Visual Multimodal Learning
SN - 978-989-758-679-8
AU - Radman A.
AU - Laaksonen J.
PY - 2024
SP - 730
EP - 737
DO - 10.5220/0012431500003660
PB - SciTePress