
7 CONCLUSION AND
PERSPECTIVES
This article presents a novel early fusion approach
based on our MEFA module. The MEFA module,
combined with state-of-the-art models, improves, es-
pecially in adverse weather conditions, the perfor-
mance accuracy of vehicle and pedestrian multimodal
detection. Furthermore, the MEFA module can im-
prove any single modality model, especially a black
box model, for any multimodal application.
In terms of future research, we identified several
potential avenues. Firstly, optimizing the module ar-
chitecture could reduce the computational load, espe-
cially when dealing with features of large spatial di-
mensions. Additional sensor types integration, such
as radar or ultrasonic sensors, would be beneficial
in investigating and improving detection robustness
in challenging conditions. Secondly, further research
could be carried out on the MEFA module to better
understand the impact of characteristics of modalities
and external factors, such as weather or visibility, on
the accuracy.
In light of climate change, we aim to direct our fu-
ture efforts toward enhancing the module to minimize
its energy consumption and evaluate the carbon foot-
print of our models. Furthermore, we intend to inves-
tigate the integration of our model into edge devices,
exploring innovative approaches to optimize perfor-
mance while maintaining sustainability. It would be
a question of conducting holistic research consider-
ing the dimensions of (a) measurements and estima-
tions, (b) algorithms, methods, and models, (c) ex-
treme edge, and (d) understanding the systemic ef-
fects of AI.
ACKNOWLEDGEMENTS
This work was carried out in part within the frame-
work of the ”Edge Intelligence” Chair within MIAI of
the University of Grenoble Alpes, project referenced
ANR-19-PIA3-0003.
REFERENCES
Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W.,
Dietmayer, K., and Heide, F. (2020). Seeing Through
Fog Without Seeing Fog: Deep Multimodal Sensor
Fusion in Unseen Adverse Weather. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 11682–11692.
Chaturvedi, S. S., Zhang, L., and Yuan, X. (2022). Pay ”at-
tention” to adverse weather: Weather-aware attention-
based object detection. In 2022 26th International
Conference on Pattern Recognition (ICPR), pages
4573–4579.
Chen, Y.-T., Shi, J., Ye, Z., Mertz, C., Ramanan, D.,
and Kong, S. (2022). Multimodal object detection
via probabilistic ensembling. In Avidan, S., Brostow,
G., Ciss
´
e, M., Farinella, G. M., and Hassner, T., edi-
tors, Computer Vision – ECCV 2022, pages 139–158,
Cham. Springer Nature Switzerland.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Min-
derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. ArXiv,
abs/2010.11929.
Grauer, Y. (2014). Active gated imaging in driver assistance
system. Advanced Optical Technologies, 3(2):151–
160.
Huang, K., Shi, B., Li, X., Li, X., Huang, S., and Li, Y.
(2022). Multi-modal Sensor Fusion for Auto Driving
Perception: A Survey.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics
YOLO.
Malawade, A. V., Mortlock, T., and Faruque, M. A. A.
(2022). Ecofusion: Energy-aware adaptive sensor fu-
sion for efficient autonomous vehicle perception.
Mart
´
ınez-D
´
ıaz, M. and Soriguera, F. (2018). Autonomous
vehicles: Theoretical and practical challenges. Trans-
portation Research Procedia, 33:275–282.
Stahlschmidt, S. R., Ulfenborg, B., and Synnergren, J.
(2022). Multimodal deep learning for biomedical
data fusion: A review. Briefings in Bioinformatics,
23(2):bbab569.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-
jna, Z. (2016). Rethinking the inception architecture
for computer vision. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR).
Tabassum, N. and El-Sharkawy, M. (2024). Vehicle detec-
tion in adverse weather: A multi-head attention ap-
proach with multimodal fusion. Journal of Low Power
Electronics and Applications.
Terven, J., C
´
ordova-Esparza, D.-M., and Romero-
Gonz
´
alez, J.-A. (2023). A comprehensive review of
yolo architectures in computer vision: From yolov1 to
yolov8 and yolo-nas. Machine Learning and Knowl-
edge Extraction, 5(4):1680–1716.
Xiang, C., Feng, C., Xie, X., Shi, B., Lu, H., Lv, Y., Yang,
M., and Niu, Z. (2023). Multi-Sensor Fusion and Co-
operative Perception for Autonomous Driving: A Re-
view. IEEE Intelligent Transportation Systems Maga-
zine, 15(5):36–58.
Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu,
Y., and Chen, J. (2024). DETRs Beat YOLOs on Real-
time Object Detection.
MEFA: Multimodal Image Early Fusion with Attention Module for Pedestrian and Vehicle Detection
617