
Further investigations through a visual dialogue show,
that BakLLaVA can provide even more detailed infor-
mation about object attributes and locations. On the
other hand, all of the evaluated models struggle with
information encoded in temporal sequences of images
such as detailed maneuvers or object movement direc-
tions embedded in layer 4. This might also be related
to the static nature of the inputs, as not videos but sin-
gle input images were fed into the models.
7 CONCLUSION
The elaborated method outlines the potential of us-
ing pre-trained LVLMs for semantic enrichment and
retrieval of real driving data with natural language
queries in the form of functional scenario descrip-
tions. Specifically, BakLLaVA, consisting of an im-
age encoder and Mistral 7B as the LLM backbone,
achieves accurate query results even for detailed spec-
ifications such as the location and color of objects en-
coded in the images.
Future work should focus on several key areas.
One key is to create a dataset tailored for SR with
LVLMs that includes multi-modal driving data such
as time series or point clouds additional to images.
Incorporating external data sources such as map- and
weather data can provide additional semantic struc-
ture to produce meaningful joint embeddings. The
ability of LVLMs to incorporate other SR tasks, such
as querying abstract scenario descriptions from con-
crete, logical, and functional scenarios, offers poten-
tial for more efficient and effective SBT. Metrics like
recall at k (recall@k) should be evaluated in addition
to prec@k to ensure the relevance of the retrieved sce-
narios. Furthermore, future research should investi-
gate prompt engineering techniques, incorporate tax-
onomies for different use cases, and explore the tem-
poral domain using video language models. The im-
pact of fine-tuning compared to in-context learning,
and the associated trade-off in computational cost for
the SR task, may have important implications for fu-
ture research directions. User studies with domain
experts querying scenarios can be conducted to ex-
plore the feasibility of the concept and the ability of
the models to cope with domain-specific language.
Analysing combined queries that jointly integrate dif-
ferent scenario layers can provide a more comprehen-
sive understanding of the SR capability. Besides re-
trieval performance, additional metrics such as com-
putational efficiency, storage requirements, and re-
trieval time should be considered. These efforts will
advance SR methods in the automotive domain for
V&V tasks.
REFERENCES
Bock, F. and Lorenz, J. (2022). Abstract natural scenario
language version 1.0.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Burstein,
J., Doran, C., and Solorio, T., editors, Proceedings
of the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and
Short Papers), pages 4171–4186, Minneapolis, Min-
nesota. Association for Computational Linguistics.
Dr
¨
oschel, W. and Wiemers, M. (1999). Das V-Modell 97:
der Standard f
¨
ur die Entwicklung von IT-Systemen mit
Anleitung f
¨
ur den Praxiseinsatz. Oldenbourg Wis-
senschaftsverlag, Berlin, Boston, 2014th edition.
Elspas, P., Klose, Y., Isele, S. T., Bach, J., and Sax, E.
(2021). Time series segmentation for driving scenario
detection with fully convolutional networks. In VE-
HITS, pages 56–64.
Elspas, P., Langner, J., Aydinbas, M., Bach, J., and Sax,
E. (2020). Leveraging regular expressions for flexible
scenario detection in recorded driving data. In 2020
IEEE International Symposium on Systems Engineer-
ing (ISSE), pages 1–8. IEEE.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages
1440–1448.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).
Rich feature hierarchies for accurate object detec-
tion and semantic segmentation. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 580–587.
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., and
Darrell, T. (2016). Natural language object retrieval.
In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4555–4564.
Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X.,
and Wang, X. (2017). Object detection in videos with
tubelet proposal networks. In Proceedings of the IEEE
conference on computer vision and pattern recogni-
tion, pages 727–735.
Kim, J., Rohrbach, A., Darrell, T., Canny, J., and Akata,
Z. (2018). Textual explanations for self-driving vehi-
cles. In Proceedings of the European conference on
computer vision (ECCV), pages 563–578.
Langner, J., Grolig, H., Otten, S., Holz
¨
apfel, M., and Sax,
E. (2019). Logical scenario derivation by clustering
dynamic-length-segments extracted from real-world-
driving-data. In VEHITS, pages 458–467.
Lester, B., Al-Rfou, R., and Constant, N. (2021). The power
of scale for parameter-efficient prompt tuning. arXiv
preprint arXiv:2104.08691.
Li, J., Li, D., Savarese, S., and Hoi, S. (2023). Blip-
2: Bootstrapping language-image pre-training with
frozen image encoders and large language models.
Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual in-
struction tuning.
VEHITS 2024 - 10th International Conference on Vehicle Technology and Intelligent Transport Systems
504