Interactive Video Saliency Prediction: The Stacked-convLSTM Approach

N. Wondimu; N. Wondimu; U. Visser; C. Buche; C. Buche

doi:10.5220/0011664600003393

Interactive Video Saliency Prediction: The Stacked-convLSTM Approach

N. Wondimu, N. Wondimu, U. Visser, C. Buche, C. Buche

2023

Abstract

Cognitive and neuroscience of attention researches suggest the use of spatio-temporal features for an efficient video saliency prediction. This is due to the representative nature of spatio-temporal features for data collected across space and time, such as videos. Video saliency prediction aims to find visually salient regions in a stream of images. Many video saliency prediction models are proposed in the past couple of years. Due to the unique nature of videos from that of static images, the earliest efforts to employ static image saliency prediction models for video saliency prediction task yield reduced performance. Consequently, dynamic video saliency prediction models that use spatio-temporal features were introduced. These models, especially deep learning based video saliency prediction models, transformed the state-of-the-art of video saliency prediction to a better level. However, video saliency prediction still remains a considerable challenge. This has been mainly due to the complex nature of video saliency prediction and scarcity of representative saliency benchmarks. Given the importance of saliency identification for various computer vision tasks, revising and enhancing the performance of video saliency prediction models is crucial. To this end, we propose a novel interactive video saliency prediction model that employs stacked-ConvLSTM based architecture along with a novel XY-shift frame differencing custom layer. Specifically, we introduce an encoder-decoder based architecture with a prior layer undertaking XY-shift frame differencing, a residual layer fusing spatially processed (VGG-16 based) features with XY-shift frame differenced frames, and a stacked-ConvLSTM component. Extensive experimental results over the largest video saliency dataset, DHF1K, show the competitive performance of our model against the state-of-the-art models.

Download

Paper Citation

in Harvard Style

Wondimu N., Visser U. and Buche C. (2023). Interactive Video Saliency Prediction: The Stacked-convLSTM Approach. In Proceedings of the 15th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-623-1, pages 157-168. DOI: 10.5220/0011664600003393

in Bibtex Style

@conference{icaart23,
author={N. Wondimu and U. Visser and C. Buche},
title={Interactive Video Saliency Prediction: The Stacked-convLSTM Approach},
booktitle={Proceedings of the 15th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2023},
pages={157-168},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011664600003393},
isbn={978-989-758-623-1},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 15th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Interactive Video Saliency Prediction: The Stacked-convLSTM Approach
SN - 978-989-758-623-1
AU - Wondimu N.
AU - Visser U.
AU - Buche C.
PY - 2023
SP - 157
EP - 168
DO - 10.5220/0011664600003393