Authors:
Aidana Nurakhmetova
;
Jean Lahoud
and
Hisham Cholakkal
Affiliation:
Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi, U.A.E.
Keyword(s):
3D Point Clouds, Data-Efficient Transformer, 3D Object Detection.
Abstract:
Recent 3D detection models rely on Transformer architecture due to its natural ability to abstract global context
features. One is the 3DETR network - a pure transformer-based model designed to generate 3D boxes on
indoor dataset scans. It is generally known that transformers are data-hungry. However, data collection and
annotation in 3D are more challenging than in 2D. Thus, our goal is to study the data-hungriness of the
3DETR-m model and propose a solution for its data efficiency. Our methodology is based on the observation
that PointNet++ provides more locally aggregated features that can be useful to support 3DETR-m prediction
on small dataset problem. We suggest three methods of backbone fusion that are based on addition (Fusion I),
concatenation (Fusion II), and replacement (Fusion III). We utilize pre-trained weights from the Group-free
model trained on the SUN RGB-D dataset. The proposed 3DETR-m outperforms the original model in all data
proportions (10%, 25%, 50%,
75%, and 100%). We improve 3DETR-m paper results by 1.46% and 2.46% in
mAP@25 and mAP@50 on the full dataset. Hence, we believe our research efforts can provide new insights
into the data-hungriness issue of 3D transformer detectors and inspire the usage of pre-trained models in 3D
as one way towards data efficiency.
(More)