MEFA: Multimodal Image Early Fusion with Attention Module for

Pedestrian and Vehicle Detection

Yoann Dupas

1,2 a

, Olivier Hotel

2 b

, Gr

egoire Lefebvre

2 c

and Christophe C

erin

1,3 d

Datamove, Inria, LIG, University Grenoble-Alpes, France

Orange Innovation, Meylan, France

LIPN, University Sorbonne Paris Nord, France

Keywords:

Image Fusion, Multimodal Fusion, Early-Fusion, Attention, Pedestrian and Vehicle Detection, Adverse

Weather.

Abstract:

Pedestrian and vehicle detection represents a signiﬁcant challenge in autonomous driving, particularly in ad-

verse weather conditions. Multimodal image fusion addresses this challenge. This paper proposes a new

early-fusion attention-based approach from visible, infrared, and LiDAR images, designated as MEFA (Multi-

modal image Early Fusion with Attention). In this study, we compare our MEFA proposal with a channel-wise

concatenation early-fusion approach. When coupled with YOLOv8 or RT-DETRv1 for pedestrian and vehicle

detection, our contribution is promising in adverse weather conditions (i.e. rainy days or foggy nights). Fur-

thermore, our MEFA proposal demonstrated superior mAP accuracy on the DENSE dataset.

1 INTRODUCTION

Deep learning fusion techniques are signiﬁcantly im-

pacting a number of ﬁelds, including autonomous

driving (AD) and autonomous driver assistance sys-

tems (ADAS). In particular, they address the chal-

lenge of perceiving the world and the challenge of

decision-making.

World perception systems extract essential infor-

mation from raw image data for decision making.

They include three tasks: localization, detection, and

tracking (Mart

ınez-D

ıaz and Soriguera, 2018). Effec-

tive performance requires accuracy, weather robust-

ness, efﬁciency with imprecise sensors, real-time pro-

cessing (Xiang et al., 2023), and reduced energy con-

sumption (Malawade et al., 2022).

However, such systems face many challenges, in-

cluding the variability of object shapes, potential oc-

clusions, variations in lighting, and the prevalence

of adverse weather conditions (Mart

ınez-D

ıaz and

Soriguera, 2018). The process of image fusion serves

to address the limitations of perception and decision-

making systems. The fusion of multiple cameras al-

lows the acquisition of data that is both rich and high-

dimensional, and which is also complementary by na-

https://orcid.org/0009-0001-2080-3842

https://orcid.org/0009-0005-9688-4326

https://orcid.org/0000-0002-1325-3010

https://orcid.org/0000-0003-0993-9826

ture (Xiang et al., 2023). In (Huang et al., 2022),

the authors present a taxonomy of three fusion strate-

gies. The ﬁrst of these is early fusion, or data-data

fusion, which involves the merging of data that has

been prepared to be spatially homogeneous. This

strategy identiﬁes correlations between image chan-

nels and generates a global fused image that is com-

patible with some existing object recognition mod-

els (Stahlschmidt et al., 2022), provided that the ﬁrst

layer of the model can be adapted to this kind of fused

image. The second fusion strategy, known as deep fu-

sion or feature-feature fusion, involves the merging

of latent space features generated by a backbone mo-

dule into a common latent space. Finally, late fusion,

or results-results fusion, represents a further approach

whereby ﬁnal object detection is mixed with different

strategies, including voting methods, stacking meth-

ods, and those based on Dempster-Shafer or Possibil-

ity Theory (Chen et al., 2022).

This article concentrates on the early fusion strat-

egy, which we consider to be a valuable approach for

generating fused images that include all the informa-

tion required by a single-modality object detection

model to perform object detection accurately. This

strategy makes it possible to utilize the most recent

single-modality models from the literature, such as

YOLO (You Only Look Once (Terven et al., 2023)) or

RT-DETR (Real-Time Detection Transformer (Zhao

et al., 2024)), and to beneﬁt from their enhanced per-

formance. Furthermore, in our view, the use of atten-

610

Dupas, Y., Hotel, O., Lefebvre, G. and Cérin, C.

MEFA: Multimodal Image Ear ly Fusion with Attention Module for Pedestrian and Vehicle Detection.

DOI: 10.5220/0013236000003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

610-617

ISBN: 978-989-758-728-3; ISSN: 2184-4321

tion mechanisms can help to improve accuracy and

robustness in the context of challenging weather con-

ditions. By assigning weights to information, it en-

ables the neural network to identify and select rele-

vant information from each modality and to correlate

them with information from the other modality.

The paper is structured as follows. Section 2

proposes the main related and recent work on im-

age attention-based fusion for pedestrian and vehicle

detection. Section 3 explains in detail the proposed

MEFA module. Section 4 presents the experimental

protocol. Section 5 presents the analysis of the per-

formances in global conditions and different weather

conditions. Section 6 discusses the contributions and

the results. Finally, Section 7 concludes this article

and gives some perspectives.

2 RELATED WORKS

Recent studies address the topics of multimodal fu-

sion with deep learning techniques based on attention

modules (Chaturvedi et al., 2022), (Tabassum and El-

Sharkawy, 2024).

The article (Chaturvedi et al., 2022) presents a

deep fusion approach with its Global-Local Atten-

tion (GLA) framework aiming at improving object de-

tection in adverse weather conditions, such as light

fog, dense fog, and snow. The GLA framework uti-

lizes multimodal sensor fusion, integrating data from

cameras, gated cameras, and LiDAR at two stages:

early-stage fusion through a Local Attention Network

and late-stage fusion via a Global Attention Network.

This dual approach allows the system to adaptively

focus on the most effective sensor data based on the

speciﬁc weather conditions. The GLA framework’s

architecture enables it to extract local and global fea-

tures, addressing the shortcomings of existing meth-

ods that typically rely on simple concatenation or

element-wise addition for sensor fusion. By employ-

ing attention mechanisms, the GLA framework can

dynamically allocate higher weights to the modal-

ity that exhibits better detection capabilities, thus en-

hancing the robustness of object detection.

In (Tabassum and El-Sharkawy, 2024), the au-

thors introduce a multi-head attention approach to

enhance vehicle detection in adverse weather condi-

tions, speciﬁcally focusing on the MVDNet (Multi-

modal Vehicle Detection Network). This model in-

tegrates a multi-head attention layer to improve the

processing and fusion of multimodal sensor data, such

as LiDAR and radar. By employing a multi-head at-

tention mechanism, the MVDNet can dynamically fo-

cus on various aspects of the input data, allowing for

a more comprehensive analysis and improved detec-

tion accuracy. The methodology involves two main

stages: the Region Proposal Network (RPN) for gen-

erating initial proposals from sensor data, and the

Region Fusion Network (RFN) for integrating these

proposals. The multi-head attention layer is strategi-

cally placed within the RFN to enhance feature ex-

traction from LiDAR and radar inputs. The paper

demonstrates that the multi-head MVDNet signiﬁ-

cantly outperforms baseline models and other sensor

fusion techniques.

These two papers propose some advantages. No-

tably, both emphasize the importance of multimodal

sensor fusion to improve object detection perfor-

mance in challenging weather conditions. The pro-

posed frameworks leverage advanced attention mech-

anisms to dynamically adjust the focus on different

sensor modalities, leading to signiﬁcantly higher de-

tection accuracy compared to traditional methods that

rely on single sensors.

Nevertheless, some limitations persist. Both

frameworks are intimately related to the object de-

tection framework they use. For example, in paper

(Tabassum and El-Sharkawy, 2024) they use a two-

stage object detection model, which may not be adapt-

able to a one-stage detection model, which are more

suitable for a real-time object detection. In paper

(Chaturvedi et al., 2022), the GLA framework may be

difﬁcult to adapt to a new state-of-the-art model that

uses a transformer-based approach, such as the VIT

(Dosovitskiy et al., 2020) model.

To limit these drawbacks, we propose the MEFA

module, described in the next section.

3 MEFA : MULTIMODAL EARLY

FUSION WITH ATTENTION

Raw

RGB

Raw

LiDAR

Processed

RGB

Processed

LiDAR

MEFA

Fused

Image

Object

Detection

Model

Figure 1: Overview of MEFA framework. The data ob-

tained from the various sensors are transformed and sub-

sequently transmitted to the MEFA module. This module

generates an intermediate fused image, which represents the

fusion of all input sensor data. This intermediate fused im-

age is then provided to a single-modality object detection

model, which can identify objects.

The pedestrian and vehicle detection framework, as

illustrated in Figure 1, consists of multiple modules.

Initially, the input images pass through image pro-

MEFA: Multimodal Image Early Fusion with Attention Module for Pedestrian and Vehicle Detection

611

Processed

RGB

Processed

LiDAR

Inception

Block A

Inception

Block A

Inception

Block A

Local

Attention

Local

Attention

Local

Attention

Inception

Block A

Global

Attention

Global

Attention

Global

Attention

Convolutional

Layer

Figure 2: The MEFA module consists of three streams, where the input is initially convoluted by an inception block and

subsequently provided to the local attention block. The output features are then multiplied by the output features of the incep-

tion block. The output from each stream is then combined through an addition operation. This fused output is subsequently

passed to an additional inception block, after which it is sent to a global attention block. This block takes input from the ﬁrst

inception block for its corresponding modality. The output is then multiplied by the fused output and added, forming the ﬁnal

fused output. In the ﬁnal step, the output is sent to a convolution layer, where it is transformed into a 3-channel output. The

corresponds to the multiplication operation and

corresponds to the addition operation.

cessing techniques to prepare them for fusion. Subse-

quently, an intermediate fused image is generated by

the MEFA module, which is compatible with state-

of-the-art image object detectors such as YOLOv8 or

RT-DETRv1 to predict pedestrian and vehicle objects.

3.1 Input Image Processing

The data transmitted by the vehicle sensors must

be transformed into a uniform image format. The

DENSE dataset (Bijelic et al., 2020) used in the ex-

periment provides data from three types of sensors:

an optical camera, an infrared camera, and a LiDAR

sensor. The following subsection will present a trans-

formation operation from raw to usable data.

3.1.1 Visible and Infrared Data Processing

The optical camera provides three-channel images

representing the red, blue, and green colors. In

the context of working with the YOLOv8 and RT-

DETRv1 models, which are dedicated to images, the

transformation operations consist of scaling to a res-

olution of 640 × 640 pixels (i.e., the size of the input

tensors), centering, and padding to ﬁll in missing pix-

els.

The NIR (Near-InfraRed) gated camera captures

photons from a speciﬁed distance by opening and

closing the camera with a speciﬁed delay after emit-

ting a pulse of near-infrared light (Grauer, 2014). It

provides a batch of three grayscale images (i.e., one

channel) with a time difference of 100ms between

each image. The three images have been merged into

a single composite image, in order to align with the

input speciﬁcations of the YOLOv8 and RT-DETRv1

models.

3.1.2 LiDAR Data Processing

The LiDAR sensor provides raw data in the form of a

point cloud. A point cloud can be represented in three

ways: point-based, voxel-based, or 2D mapping-

based (Huang et al., 2022). The 2D mapping or

view-based representation is constructed by project-

ing points onto the camera coordinate system. This

approach allows alignment with other sensors and en-

ables the direct utilization of 2D convolutional neural

network architectures. To ensure data homogeneity, a

camera plane map representation on the optical cam-

era coordinate was selected. This encoding provides

an image where the points are aligned with the optical

camera image.

3.2 MEFA Module

As illustrated in Figure 2, we used approaches similar

to those utilized in deep fusion. In the ﬁrst step, each

image stream is processed through convolution layers,

followed by attention layers. In the second step, the

feature maps from each stream are fused before pass-

ing through a new stage of convolution layers and at-

tention layers. The output goes through another stage

of convolution layers to parameterize the channel di-

mension of the output.

In contrast to deep-fusion approaches, convolu-

tion layers are not designed to create a high-level rep-

resentation feature space; rather, they are intended to

generate a new image representation containing all

relevant features from each modality. The attention

layer serves as a guide to select the speciﬁc informa-

tion required by the object detection model. In the

initial stage, it selects information from each stream,

regardless of the other modalities. In the subsequent

stage, a second attention layer is designed to utilize

the information from each modality to ﬁlter the fused

information from the preceding stage, thereby identi-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

612

fying only the relevant features.

The implementation of the MEFA module uses

the Inception Block version A from the Inception

V3 model (Szegedy et al., 2016) as convolution

layers and the Global Local Attention framework

(Chaturvedi et al., 2022) as attention layers. The local

attention network is used as the initial attention layer

for each stream. The global attention network is used

as the second attention stage with the fused feature

map of each stream. The fusion operation is the con-

catenation. The ﬁnal convolution operation acts as a

channel operator, parameterizing the channel output

dimension to a three-channel image.

3.3 Object Detection Models

The MEFA module provides a fused image as an out-

put, which can be used with any single-modality ob-

ject detection model. Here we look at two main mod-

els from the literature.

3.3.1 YOLOv8

YOLOv8 (Jocher et al., 2023) is part of the lineage of

YOLO (You Only Look Once) object detection mod-

els. The end-to-end single-shot detector architecture

offers a signiﬁcant advantage for real-time applica-

tions, representing a state-of-the-art model in terms

of speed and accuracy.

The model has three main parts: the backbone

extracts image features; the neck fuses these fea-

tures; and the head predicts bounding box coordi-

nates, object presence scores, and classiﬁcation prob-

abilities. The model employs the Complete Intersec-

tion over Union (CIoU) loss and the Distribution Fo-

cal loss (DFL) during training. This approach en-

ables enhanced performance, particularly in the case

of smaller objects (Terven et al., 2023). YOLOv8 is

an anchor-free model, whereby means that the out-

put is bounding box coordinates rather than offsets

from existing anchors. The backbone is a modi-

ﬁed CSPDarknet53 backbone with a new C2f module

(faster cross-stage partial bottleneck with two convo-

lutions). This module allows high-level features to be

combined with contextual information (Terven et al.,

2023).

3.3.2 RT-DETRv1

RT-DETRv1 (Zhao et al., 2024) is a hybrid object

detection model using a convolutional neural net-

work in conjunction with Transformers layers. The

end-to-end NMS-free (Non-Maximum Suppression)

architecture model consists of four distinct compo-

nents: a backbone, a hybrid coder with Attention-

based Intra-scale Feature Interaction (AIFI), a CNN-

based Cross-scale Feature Fusion (CCFF) layer, and

a decoder with an uncertainty-minimum query selec-

tion scheme. The function of the backbone is to ex-

tract features from the input image. The hybrid en-

coder uses these one-stage CNN features to generate

encoder features with the AIFI module. The CCFF

module merges the multi-scale features into a feature

map for the decoder. The ﬁnal stage is the decoder,

which uses the output of the hybrid encoder to pre-

dict coordinates and object class. The uncertainty-

minimal query selection allows to optimize of the

query output of the encoder to select higher quality

features for the decoder head to predict the class and

localization of the object.

4 EXPERIMENTAL SETUP

The experiment was conducted utilizing the DENSE

database (Bijelic et al., 2020). The database com-

prises approximately 13, 000 images of driving scenes

captured under a variety of light and weather con-

ditions, including day and night, clear, snow, light,

and dense fog. The images were captured using an

optical camera, a NIR gated camera, and a LiDAR

sensor. Two datasets were prepared for training and

testing purposes, with joint annotation from the three

modalities. The training set contains approximately

100, 000 objects, while the test set contains approxi-

mately 20, 000 objects. The objects included in the

datasets are of two main types: pedestrian and vehi-

cle. The vehicle category includes ridable vehicles,

large vehicles, vehicles, and passenger cars.

In this study, the proposed MEFA module is eval-

uated by comparing it with an early fusion method,

i.e., channel-wise concatenation. The MEFA module

is evaluated on the MEFA 3c version, which outputs

an image with three channels. Each approach is eval-

uated with two object detection models, namely the

YOLOv8 and the RT-DETRv1 models. For the RT-

DETRv1 model, we employed the pre-trained weights

on the COCO dataset.

To prepare the early fusion data, we combined

data from the visible, infrared, and LiDAR modal-

ities at the channel level. The labels used for the

ground truth were labels from all modalities in the

visible camera coordinate systems. The input layer of

the YOLOv8 and RT-DETRv1 models was adapted to

support a tensor of nine channels.

In order to ensure a fair and accurate benchmark,

we established a well-deﬁned experimental protocol

as follows. The initial stage of the process is to iden-

tify the optimal hyperparameters for the object detec-

MEFA: Multimodal Image Early Fusion with Attention Module for Pedestrian and Vehicle Detection

613

tion model. The search is conducted on the model

that has been adapted to the channel-wise concatena-

tion, and the optimal hyperparameters are employed

for all of the compared approaches concerning the ob-

ject detection model. The hyperparameters are opti-

mized through a grid search on one single fold of the

ﬁve k-folds of the training dataset. The hyperparame-

ters include the optimizer, the weight of the box loss,

the weight of the classiﬁcation loss in the total loss

function, the initial and ﬁnal learning rate, the opti-

mizer with momentum factor, the number of epochs

for learning rate warm-up with initial momentum, the

L2 regularization term, and the hyperparameters for

image augmentation techniques. Once these hyperpa-

rameters are identiﬁed, a ﬁnal model is trained for one

hundred epochs to reach the performance plateau.

The hyperparameters search indicates the use of

the SGD optimizer, with a learning rate of 0.005

for YOLOv8 and 0.008 for RT-DETRv1, as well as

weight of the box loss of 0.03 and weight of the clas-

siﬁcation loss of 0.66 for YOLOv8 and 0.18 and 0.57

for RT-DETRv1. With regard to the MEFA module,

the decision was taken to ﬁx the Inception A pool fea-

ture at 32, the local and global attention intermediate

output and ﬁnal output feature at 64 and 256, respec-

tively.

5 EXPERIMENTAL RESULTS

In this section, we ﬁrst present the quantitative results

with global object accuracy on the DENSE dataset

and for each weather condition. Secondly, the qual-

itative results are presented to offer a more detailed

explanation of the advantages and limitations of the

proposed approach.

5.1 Object Detection Accuracy

5.1.1 Overall Performances

Table 1 shows the mean average accuracy with IoU

at 50% (mAP

) on the validation and test sets with

the two object detection models (YOLOv8 and RT-

DETRv1). It can be observed that when the MEFA

module is combined with YOLOv8, the performance

improves in comparison to the channel-wise concate-

nation approach. These improvements are on average

0.85% across the ﬁve folds and up to 1% on the test

set. The MEFA module, when combined with RT-

DETRv1, also outperforms channel-wise concatena-

tion by 1.13% on the ﬁve folds and up to 0.5% on the

test set. This observation demonstrates the beneﬁts of

using the MEFA module to fuse and select relevant

information from multimodal images.

Table 1: Accuracy results in mAP

on the DENSE dataset.

Models Mean accuracy Top 1 Accuracy

(Validation) (Validation) (Test)

Channel-Concatenation

+ YOLOv8 69.03% ± 0.64 69.65 % 70.80 %

MEFA

+ YOLOv8 69.88% ± 0.70 70.78 % 71.80%

Channel-Concatenation

+ RT-DETRv1 72.55% ± 0.65 73.12 % 73.60%

MEFA

+ RT-DETRv1 73.75% ± 0.53 74.42 % 74.30%

5.1.2 Performances According to Weather

Conditions

Figure 3 illustrates the recall values of MEFA and the

channel-wise concatenation approach with YOLOv8

and RT-DETRv1 according to the weather and time

of day on the test set. Images captured in clear and

snowy weather represent 90% of the total images

(62.7% and 27.4%, respectively). Labels in light fog,

heavy fog, and rain represent 6.1%, 3%, and 0.7% of

the images, respectively. The recall score was calcu-

lated on the prediction with a conﬁdence score of 0.5

or above, and a detection was considered positive if

the intersection over the union was 0.5 or above.

In contrast to channel-wise concatenation, the in-

tegration of MEFA with YOLOv8 yields an increase

in object detection of 1.80%, 2.91%, 1.20%, 2.78%,

and 1.33% for clear, dense fog, light fog, rain, and

snow, respectively, across both daytime and night-

time conditions. MEFA, in combination with the RT-

DETRv1 model, has been observed to enhance ob-

ject detection by 3.94%, 3.39%, 3.67%, 2.78% and

4.04%, respectively, when compared to clear, dense

fog, light fog, rain and snow conditions, taken to-

gether on a daytime and night-time basis. In gen-

eral, MEFA has been found to improve recall accu-

racy across all weather conditions.

5.2 Qualitative Results

5.2.1 Intermediate Features Visualization

Figure 4 depicts the intermediate fused image of the

MEFA module output. It can be observed that the vis-

ible features represent most of the resulting images.

In this process, we analyze feature maps from local

attention outputs during the inference. In these out-

puts, only a small number of features are retained dur-

ing local attention for the infrared and LiDAR input,

whereas almost all features from visible input are re-

tained. Subsequently, we analyze feature maps from

global attention outputs. These outputs contain fea-

tures from all modalities input. These features func-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

614

Figure 3: Recall performance according to weather and

daytime conditions. Models are RT-DETRv1 combined

with MEFA model (MEFA + RT-DETR) and with channel-

wise concatenation (Conc + RT-DETR); YOLOv8 model

combined with MEFA module (MEFA + YOLO) and with

channel-wise concatenation (Conc + YOLO).

tion as a ﬁlter to pick up relevant visible features that

play a determinant role during feature extraction of

object detection models.

5.2.2 Final Visual Results

Figure 5 illustrates the results obtained in two dis-

tinct weather conditions. The ﬁrst two lines present

an example of the input in rainy conditions, while

the subsequent two lines illustrate an input in clear

night conditions. To demonstrate the potential limita-

tions of the channel-wise concatenation approach, the

conﬁdence level of objects was set to 0.25, thereby

highlighting the occurrence of intriguing false posi-

tive pedestrian detection. In the ﬁrst example, a false

positive pedestrian has been identiﬁed by the channel-

wise concatenation approach in the middle of the im-

age. The second example illustrates how the model,

when combined with MEFA, is more effective in de-

tecting vehicles on the right of the image that are dis-

cernible only in the LiDAR input image. The inter-

mediate fused image output of the ﬁrst example is il-

lustrated in Figure 4.

6 DISCUSSION

The MEFA module demonstrates superior accuracy

when integrated with a one-stage object detection

model, YOLOv8, or a transformer-based model, RT-

DETRv1, compared to the channel-wise concatena-

tion early fusion approach with the same model. The

module uses an attention mechanism similar to the

mid-fusion scheme. This mechanism allows the mo-

dule to combine relevant features from each modality

by using local attention in the initial stage and then

ﬁlter these fused features with attention computed on

each modality feature, which acts as global atten-

tion. We hypothesize that an attention mechanism

is essential to enable the model to distinguish which

features are critical for detection and which features

must function as ﬁlters. This hypothesis enables the

creation of interactions between the multimodal fea-

tures, thereby improving accuracy performance, par-

ticularly in adverse weather conditions. In such con-

ditions, sensors can provide noisy data leading to false

detections. It also enables robust detection when sen-

sors are unable to provide information due to time of

day or weather conditions.

The output of the MEFA module is a feature

map that functions as a single modality image.

This property allows the utilization of state-of-the-

art single-modality models that are becoming increas-

ingly prevalent in the computer vision ﬁeld. This as-

pect also permits more rapid training and enhanced

performance using pre-trained weights derived from

alternative datasets. Furthermore, by parameteriz-

ing the ﬁnal convolution block to output features

as a three-channel image, it is now possible to em-

ploy black-box models not able to accept images

with more than three channels. This feature has also

prompted new investigations into the potential inﬂu-

ence of this parameter on object detection outcomes.

Nevertheless, it is essential to acknowledge the in-

herent constraints of this approach. The module in-

troduces additional memory and computational con-

straints and operates at a processing speed of 200 mil-

liseconds, in comparison to the object model, which

operates at 5 milliseconds for YOLOv8 and 35 mil-

liseconds for RT-DETRv1

. Further ablation studies

could be conducted to gain a deeper understanding

of the role of the attention mechanism and whether

alternative strategies, such as Transformer-based at-

tention, could improve performance. Additionally, it

is crucial to recognize that data transformation, par-

ticularly in the case of LiDAR sensors, can result in

information loss.

The reference speed value was obtained on a Dell

Precision 5770 with an Intel® Core™ i7-12800H CPU, a

NVIDIA RTX A3000 12 GB GDDR6 GPU, and 32 GB of

DDR5 RAM.

MEFA: Multimodal Image Early Fusion with Attention Module for Pedestrian and Vehicle Detection

615

Figure 4: Example of intermediate fused image output from MEFA module combined with RT-DETRv1 model. Grayscale

images correspond to a speciﬁc single channel of the color image.

(Visible) (Infrared) (LiDAR)

(Concatenation + RT-DETRv1) (MEFA + RT-DETRv1) (MEFA + YOLOv8)

(Visible) (Infrared) (LiDAR)

(Concatenation + RT-DETRv1) (MEFA + RT-DETRv1) (MEFA + YOLOv8)

Figure 5: Examples of rain and clear night images. Visible, Infrared and LiDAR indicate visible, infrared and LiDAR sensor

images respectively with their ground truth labels. Concatenation + RT-DETRv1, MEFA + RT-DETRv1 and MEFA + YOLOv8

indicate the channel-wise concatenation or the MEFA module with the RT-DETRv1 model and the MEFA module combined

with the YOLOv8 model. Light blue and dark blue indicate vehicle and pedestrian detection respectively.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

616

7 CONCLUSION AND

PERSPECTIVES

This article presents a novel early fusion approach

based on our MEFA module. The MEFA module,

combined with state-of-the-art models, improves, es-

pecially in adverse weather conditions, the perfor-

mance accuracy of vehicle and pedestrian multimodal

detection. Furthermore, the MEFA module can im-

prove any single modality model, especially a black

box model, for any multimodal application.

In terms of future research, we identiﬁed several

potential avenues. Firstly, optimizing the module ar-

chitecture could reduce the computational load, espe-

cially when dealing with features of large spatial di-

mensions. Additional sensor types integration, such

as radar or ultrasonic sensors, would be beneﬁcial

in investigating and improving detection robustness

in challenging conditions. Secondly, further research

could be carried out on the MEFA module to better

understand the impact of characteristics of modalities

and external factors, such as weather or visibility, on

the accuracy.

In light of climate change, we aim to direct our fu-

ture efforts toward enhancing the module to minimize

its energy consumption and evaluate the carbon foot-

print of our models. Furthermore, we intend to inves-

tigate the integration of our model into edge devices,

exploring innovative approaches to optimize perfor-

mance while maintaining sustainability. It would be

a question of conducting holistic research consider-

ing the dimensions of (a) measurements and estima-

tions, (b) algorithms, methods, and models, (c) ex-

treme edge, and (d) understanding the systemic ef-

fects of AI.

ACKNOWLEDGEMENTS

This work was carried out in part within the frame-

work of the ”Edge Intelligence” Chair within MIAI of

the University of Grenoble Alpes, project referenced

ANR-19-PIA3-0003.

REFERENCES

Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W.,

Dietmayer, K., and Heide, F. (2020). Seeing Through

Fog Without Seeing Fog: Deep Multimodal Sensor

Fusion in Unseen Adverse Weather. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 11682–11692.

Chaturvedi, S. S., Zhang, L., and Yuan, X. (2022). Pay ”at-

tention” to adverse weather: Weather-aware attention-

based object detection. In 2022 26th International

Conference on Pattern Recognition (ICPR), pages

4573–4579.

Chen, Y.-T., Shi, J., Ye, Z., Mertz, C., Ramanan, D.,

and Kong, S. (2022). Multimodal object detection

via probabilistic ensembling. In Avidan, S., Brostow,

G., Ciss

e, M., Farinella, G. M., and Hassner, T., edi-

tors, Computer Vision – ECCV 2022, pages 139–158,

Cham. Springer Nature Switzerland.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Min-

derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and

Houlsby, N. (2020). An image is worth 16x16 words:

Transformers for image recognition at scale. ArXiv,

abs/2010.11929.

Grauer, Y. (2014). Active gated imaging in driver assistance

system. Advanced Optical Technologies, 3(2):151–

160.

Huang, K., Shi, B., Li, X., Li, X., Huang, S., and Li, Y.

(2022). Multi-modal Sensor Fusion for Auto Driving

Perception: A Survey.

Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics

YOLO.

Malawade, A. V., Mortlock, T., and Faruque, M. A. A.

(2022). Ecofusion: Energy-aware adaptive sensor fu-

sion for efﬁcient autonomous vehicle perception.

Mart

ınez-D

ıaz, M. and Soriguera, F. (2018). Autonomous

vehicles: Theoretical and practical challenges. Trans-

portation Research Procedia, 33:275–282.

Stahlschmidt, S. R., Ulfenborg, B., and Synnergren, J.

(2022). Multimodal deep learning for biomedical

data fusion: A review. Brieﬁngs in Bioinformatics,

23(2):bbab569.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-

jna, Z. (2016). Rethinking the inception architecture

for computer vision. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Tabassum, N. and El-Sharkawy, M. (2024). Vehicle detec-

tion in adverse weather: A multi-head attention ap-

proach with multimodal fusion. Journal of Low Power

Electronics and Applications.

Terven, J., C

ordova-Esparza, D.-M., and Romero-

Gonz

alez, J.-A. (2023). A comprehensive review of

yolo architectures in computer vision: From yolov1 to

yolov8 and yolo-nas. Machine Learning and Knowl-

edge Extraction, 5(4):1680–1716.

Xiang, C., Feng, C., Xie, X., Shi, B., Lu, H., Lv, Y., Yang,

M., and Niu, Z. (2023). Multi-Sensor Fusion and Co-

operative Perception for Autonomous Driving: A Re-

view. IEEE Intelligent Transportation Systems Maga-

zine, 15(5):36–58.

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu,

Y., and Chen, J. (2024). DETRs Beat YOLOs on Real-

time Object Detection.

MEFA: Multimodal Image Early Fusion with Attention Module for Pedestrian and Vehicle Detection

617