Multimodal 6D Detection of Industrial Pallets, in Real and Virtual
Environments, with Applications in Industrial AMRs
Jos
´
e Lourenc¸o, Gonc¸alo Ars
´
enio, Lu
´
ıs Garrote and Urbano J. Nunes
University of Coimbra, Institute of Systems and Robotics, Department of Electrical and Computer Engineering, Portugal
{garrote, urbano}@isr.uc.pt
Keywords:
6D Pallet Detection, Multimodal Deep Learning, RGB-D Based and Point Cloud Based.
Abstract:
In this work we propose a multimodal approach for detecting and estimating the 6D pose of pallets, to be
applied in industrial environments. The method is designed for future integration with Autonomous Mo-
bile Robots (AMRs) for enhanced warehouse automation. Using the DenseFusion framework as basis, the
proposed approach fuses RGB and Depth data using multi-head self-attention mechanisms to improve its ro-
bustness. To test the proposed methods, three datasets were developed: two virtual and one real-world indoor
dataset, with varying degrees of occlusion and alignment challenges. Experimental results demonstrated that
the approach achieved a high accuracy in occluded virtual scenarios and a promising result in real indoor sce-
narios, with increased performance when considering higher error thresholds. The obtained results show the
potential of this system for use in AMRs to enhance the efficiency and safety of automated pallet handling in
industrial settings in the future.
1 INTRODUCTION
Mobile robotics has repeatedly revolutionized the in-
dustry, reshaping the technology in material handling.
As industries strive for greater efficiency and automa-
tion, warehouses and other sectors rapidly transition
from traditional Automated Guided Vehicles (AGVs)
to Autonomous Mobile Robots (AMRs), marking a
paradigm shift in how tasks are executed, and goods
are managed within dynamic environments (Fraga-
pane et al., 2021). As those AMRs become an essen-
tial part of the warehouses, the importance of sophis-
ticated perception capabilities, such as multi-instance
6D object pose estimation, becomes increasingly ev-
ident. It enables the robots to classify and recognize
objects, estimate their poses, and track them over a
period of time (Chen and Guhl, 2018). In this context,
multi-instance 6D object pose estimation comprises
the detection of objects and an estimation of their 3D
translation and 3D rotation. For some detection meth-
ods, this is a single stage, while others perform ob-
ject detection and pose estimation as distinct stages
(Gorschl
¨
uter et al., 2022). For the detection method,
several techniques and algorithms exist to date having
point clouds, RGB, Depth, or RGB-D images as their
inputs, however, all of them face difficult challenges
(robustness of detection) in the industrial environment
such as noise and occlusions in the sensor data. In-
dustrial environments, like factories and warehouses,
often exhibit a cluttered arrangement of objects and
machinery. This poses a challenge for accurate ob-
ject detection using 6D methods, as it can be intricate
to discern the object of interest. Also, dealing with
objects that have different types of textures and sym-
metries can affect the performance of the AMRs. In
some cases, the object may lack adequate texture or
distinctive features, making it more difficult to accu-
rately estimate the object’s pose. Estimating the ob-
ject’s pose in real-time can also be a challenge in in-
dustrial environments due to the large amount of data
that needs to be processed.
In this work, a multimodal approach for detecting
and estimating the 6D pose of pallets within an in-
dustrial environment is proposed. The goal with this
detection system is, in the future, to integrate it into an
autonomous forklift platform. This autonomous fork-
lift will be capable of navigating to designated drop-
off and pick-up zones, detecting pallets of interest,
and managing their transportation. The integration
of this system into the production line is expected to
significantly enhance the warehouse’s efficiency and
reinforce workplace safety by further improving the
automation of forklift maneuvers.
The proposed multimodal detection approach is
based on the DenseFusion framework (Wang et al.,
2019), with changes introduced in the feature fu-
Lourenço, J., Arsénio, G., Garrote, L. and Nunes, U.
Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs.
DOI: 10.5220/0013073300003822
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 21st International Conference on Informatics in Control, Automation and Robotics (ICINCO 2024) - Volume 2, pages 345-352
ISBN: 978-989-758-717-7; ISSN: 2184-2809
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
345
sion and geometry feature extraction stages. Due
to difficulties on acquiring data on running factories
with an AMR, and to not hinder the research on this
topic, we also prepared three datasets. Two datasets
were acquired and annotated in a virtual environment,
containing multiple industrial shelving units and pal-
lets, while one dataset was acquired indoor and user-
annotated, with a pallet in different positions and with
different levels of occlusion.
2 RELATED WORK
The problem of 6D object detection is widely studied
in different fields of robotics. It pertains to the process
of recognizing 3D objects within a 3D space and de-
termining their positioning (X, Y , Z) and orientation
(roll, pitch, yaw). The different approaches to this
problem can be divided into RGB-based approaches
and RGB-D-based approaches.
RGB-based approaches can be holistic or based
on the dense correspondence. Holistic approaches
involve directly extracting the pose parameters from
RGB images, as in the case of DeepIM (Li et al.,
2020), a method that takes as an input the initial 6D
pose estimation of an object in the image and out-
puts a relative SE(3) transformation that is compared
to the initial pose to improve the estimate. On the
other hand, dense correspondence approaches estab-
lish correspondences between image pixels and mesh
vertices to recover poses using Perspective-n-Point
(PnP) techniques, like in the Coordinates-Based Dis-
entangles Pose Network (CDPN) (Li et al., 2019) that
separates the pose estimation process into distinct pre-
dictions for rotation and translation.The rotation es-
timation employs a carefully designed local region-
based framework, enhancing both accuracy and ef-
ficiency. For translation estimation, the network di-
rectly derives this information from localized image
patches. These distinct tasks are integrated and ad-
dressed within a single unified network. Given that
the size of an object in an image can vary signifi-
cantly with its distance from the camera, the object
is scaled to a fixed size based on the detection out-
put. Finally, another approach is 2D-keypoint based
that detect 2D keypoints to establish the 2D-3D cor-
respondence for pose estimation, although they may
suffer from loss of geometry information due to per-
spective projections. This is the case of the Pixel-wise
Voting Network (Peng et al., 2022) which employs re-
gression on pixel-wise vectors to infer the positions
of keypoints, which are subsequently utilized to cast
votes for keypoint localization. This methodology es-
tablishes a versatile representation capable of accu-
rately localizing keypoints, even in scenarios where
they may be occluded or truncated. Furthermore, this
approach provides a means to assess the uncertainties
associated with keypoint locations, thus offering valu-
able insights for the PnP solver.
Due to RGB-D images being easy to obtain, RGB-
D based approaches are widely investigated in the
problem of 6D object detection. They can be divided
into different methods such as template-based meth-
ods, that rely on feature and shape-based template
matching to locate the object in the image and roughly
estimate its pose, such as (Cao et al., 2016), which
employed a 3D model to generate example poses of
a textureless object to identify the closest match to
the input image using GPU implementation. Their
method involved transforming images into the Lapla-
cian of the Gaussian space to ensure invariance to
changes in illumination and appearance. To enable
real-time matching, the authors proposed modifica-
tions to the template set and the image, as well as a
restructuration of the conventional normalized cross-
correlation operation. These adjustments allowed for
the harnessing of the computational power of the
GPU to perform rapid matrix-matrix multiplication.
Feature-based methods are also used in this type of
approaches, they exploit the point cloud to match 3D
features and fit the object models into the scene. The
approach proposed by (Hinterstoisser et al., 2016) in
2016 with a series of enhancements to the PPF ap-
proach (Drost et al., 2010). These advancements en-
compass sampling and voting schemes aimed at mit-
igating the influence of clutter and sensor noise. The
sampling scheme selects pairs of points that are prob-
able to belong to the same object, while deliberately
avoiding pairs considered likely to belong to differ-
ent objects on the background. The voting scheme
then consolidates the PPF of all pairs of points antic-
ipated to belong to the same object, while disregard-
ing those anticipated to belong to different objects or
the background. Finally, Deep-Learning-based meth-
ods like DenseFusion (Wang et al., 2019) which pro-
cesses separately RGB and depth in two main stages.
First it processes the inputted color images to perform
semantic segmentation for each object, and then pro-
cesses the results of the segmentation and estimates
the object’s 6D pose using an iterative pose refine-
ment module that increases the precision of orienta-
tion estimation with a small inference time.
3 METHODOLOGY
The pipeline of the proposed framework illustrated in
Figure 1, which uses RGB and Depth as inputs, is
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
346
YOLO Object
Detection/
Segmentation
RGB Image
Input
Output
CNN
CNN
PointNet
Detected objects
of interest
Multi-head
Attention
Fusion
Geometric
Features
Pose
Predictor
6D pose
Fused
Features
Pixel-wise Feature Fusion
Point Cloud
Extractor
Camera Calibration
Object's
Depth Crop
Depth Map
Multi-head
Attention
Multimodal Fusion of RGB and Depth Feature Maps
Point Cloud Feature Extraction
Figure 1: Pipeline of the proposed framework using RGB and Depth for pallet detection and 6D pose prediction.
heavily inspired on the DenseFusion approach (Wang
et al., 2019) and reuses several of its modules, in-
cluding the RGB feature extraction network, the point
cloud feature extraction network, the pixel-wise fea-
ture fusion, and the pose predictor. Modifications on
the input of the pipeline to use object detection in-
stead of a segmentation network and multi-head self-
attention at different levels are introduced to improve
over the DenseFusion approach, creating a new ap-
proach that can be deployed in an AMR considering
a shared object detection system.
3.1 Object Detector
The initial stage involves the RGB-D information as
input and performs an object detection for each object
of interest in the image. The object detection network
is the YOLOv8 network (Jocher et al., 2023), com-
posed of a backbone network, a neck network and a
head network. The backbone network is built upon a
custom CSP-Darknet53 network (Wang et al., 2020)
and has a Spatial Pyramid Pooling Fast (SPPF) layer.
The neck network employs a Path Aggregation Net-
work (PANet) structure, which helps the model to ef-
fectively capture features at several scales by flowing
information across different spatial resolutions. Fi-
nally, the head network is responsible for generating
the final outputs, such as bounding boxes and con-
fidence scores for each object. For each frame pro-
cessed in the object detector, a set of bounding box de-
tections is obtained. For each detection that contains
a pallet, a crop of the RGB and Depth images is per-
formed considering the bounding box shape, to guar-
antee that only the object’s shape and texture is pro-
cessed in the subsequent steps, one object at a time.
3.1.1 RGB and Depth Feature Extractors
The RGB feature extractor is a modified version of
the Residual Network (He et al., 2016) integrated with
the Scene Parsing Network (Zhao et al., 2017) mod-
ule. The main goal of the feature extractor is to get
relevant features from RGB images. The backbone
of the feature extractor is the ResNet, known for its
ability to train deep models effectively. The ResNet
network uses residual blocks that allow the network
to learn residual functions, which represent the dif-
ference between the input and output of a layer in a
neural network, instead of unreferenced ones. This
means that instead of attempting to learn the complete
identity mapping from the initial stages, the network
can focus on learning the changes, or ”residuals, to
the input’s identity mapping. The residual block con-
sists of two or three streams of convolutional neural
networks, followed by an element-wise addition op-
eration that combines the input with the output of the
convolutional layers.
In this particular implementation, ResNet-18
serves as the backbone. The different ResNet archi-
tectures vary in the number of layers, therefore in this
case it consists of 18 layers. ResNet-18 is the optimal
option when weighing the trade-off between accuracy
and computing resource usage.
The PSPNet module is incorporated to enhance
the feature extractor’s capability to extract quality fea-
tures. The PSPNet module utilizes a pyramid pool-
ing strategy to capture multiscale contextual informa-
tion from the input image. The feature maps are di-
vided into multiple stages, each employing adaptive
average pooling and convolution operations to extract
features at different spatial resolutions. The original
features are then concatenated with these features af-
ter bilinearly upsampled (a process of increasing the
spatial resolution of feature maps using bilinear in-
terpolation, which estimates new pixel values based
on the linear interpolation of neighboring pixels). To
improve efficiency, a bottleneck convolutional layer
minimizes the dimensionality of the concatenated fea-
tures. The process of feature extraction starts with
the RGB image passing through the ResNet back-
bone. Then, the network extracts both low-level and
high-level features from the image. These features
are then processed by the PSPNet module, which
Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs
347
captures contextual information at multiple pyramid
scales and incorporates it into the feature representa-
tion. By incorporating this module, object pose es-
timation is now robust to scale variations, making it
unaffected by changes in scale. For the depth feature
extraction, the first layer of the ResNet was modified
to process the 1-channel depth image.
3.1.2 Multimodal Fusion of RGB and Depth
Feature Maps
This stage involves selecting which features from
the RGB (F
RGB
) and depth (F
D
) are relevant to esti-
mate/predict the object’s 6D pose. A multi-head self-
attention (Srinivas et al., 2021) strategy is employed
in order to capture the most relevant features from
both modalities.
Attention mechanisms (Luong et al., 2015) pro-
vide the network with salient features from each
modality, which minimizes the noise and irrelevant
information. This approach enables the network to
decide when and how to integrate RGB and Depth
data. These mechanisms generate attention weights
that emphasize the most salient features from each
modality.
Given the availability of both RGB and depth
features, the introduction of an attention mechanism
aims to fuse the two modalities to leverage comple-
mentary information. Let F
RGB
R
d
RGB
represent the
feature vector derived from the RGB modality, where
d
RGB
is the dimensionality of the RGB feature space,
and F
D
R
d
D
represent the corresponding depth fea-
tures, where d
D
is the dimensionality of the depth fea-
ture space. To fuse the two modalities, we concate-
nate these feature vectors along the feature dimen-
sion:
F
F
= [F
RGB
, F
D
] R
(d
RGB
+d
D
)
(1)
The combined feature vector F
F
contains information
from both RGB and depth modalities for each spatial
location. Next, to model the relevant feature interde-
pendencies and relationships, we employ the multi-
head self-attention mechanism. The multi-head self-
attention mechanism allows each location to attend to
all other locations, enabling the model to capture rel-
evant feature interactions. The attention mechanism
computes a weighted sum of all the feature represen-
tations, where the weights are determined dynami-
cally based on the similarity between the query and
key vectors. For each attention head, the query (Q),
key (K), and value (V ) matrices are computed from
the combined feature representation F
F
:
Q = W
Q
F
F
, K = W
K
F
F
, V = W
V
F
F
(2)
where W
Q
, W
K
, W
V
R
(d
RGB
+d
D
)×d
head
are learned pro-
jection matrices and d
head
is the dimensionality of
each attention head. The attention weights are com-
puted as:
Attention(Q, K, V ) = softmax
QK
T
d
head
V (3)
The outputs from multiple attention heads are con-
catenated and projected back to the original feature
space, yielding a more comprehensive representation.
By applying multi-head self-attention, the model cap-
tures both spatial and cross-modal interactions be-
tween the RGB and depth features, leading to a richer
representation that combines both appearance and
depth characteristics.
3.1.3 Point Cloud Feature Extraction
From the object’s cropped representation, and using
the camera’s intrinsic parameters, a 3D point cloud is
obtained. From the set of 3D points from the point
cloud, a N
PC
number of points is selected (P). If a
mask of the object is present, the points are selected
within the exported points that represent the object,
otherwise they are uniformly sampled without repeti-
tion. If the 3D point cloud’s size is below N
PC
, the
point cloud is oversampled to match the N
PC
points.
This can occur for objects that are too occluded or
that are far away from the camera, but its pose esti-
mate is still required. The point cloud feature extrac-
tion employed is derived from the DenseFusion’s im-
plementation, as it uses a PointNet-like architecture
to extract per-point geometric features. An additional
multi-head self-attention mechanism is introduced at
the end, similarly to the approach presented in Section
3.1.2, to focus the network on the geometric features
more relevant to the 6D pose estimation task.
3.1.4 Pixel-Wise Dense Feature Fusion Network
The objective of the Pixel-wise Dense Feature Fu-
sion Network is to fuse the information obtained from
the image and the 3D point cloud. The concept be-
hind the pixel-wise dense fusion network is to move
away from relying solely on the object’s global fea-
tures to determine its pose. Instead, the DenseFusion
approach performs local per-pixel fusion so that it is
possible to make predictions based on each feature.
In more practical terms, to each point of the point
cloud P a set of features are associated, composed
of global features, geometric features and fused fea-
tures. The global features are common to each point p
P, and are obtained from a Multi-Layer Perceptron
(MLP) using all geometric features and fused features
as inputs. This process aims to minimize the effects
of occlusion and detection/segmentation noise. This
allows the method to select the most reliable repre-
sentations based on the visible portion of the object,
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
348
reducing the impact of issues such as objects partially
hidden from view or interference from background el-
ements.
3.1.5 Pose Estimator
The pose estimator block in the Dense Fusion archi-
tecture estimates the 6D pose of known objects from
the RGB-D images. The block takes the pixel-wise
dense feature embedding from the Pixel-wise Dense
Feature Fusion network as input and outputs the pre-
dicted pose of the object. The fused features are pro-
cessed using an MLP which outputs a 3D vector rep-
resenting the translation of the object in the 3D space,
a quaternion representing the rotation of the object
and a confidence coefficient that represents the quality
of the pose estimate. This block uses a residual-based
approach to estimate the pose, and the pose estimation
loss is calculated by measuring the distance between
the observed object’s point cloud (P) and the corre-
sponding object’s points centered on the object’s cen-
ter of mass (P
M
) transformed by the estimated pose
(T ). This loss is quantified by the distance by those
points and is defined as:
L =
1
N
PC
i=1
N
PC
(|T p
M
i
p
i
|c
i
w log(c
i
)), (4)
where c
i
is the confidence coefficient, w is a balanc-
ing hyperparameter used as a secondary regulariza-
tion term to balance the average distance loss and
confidence, and p and p
M
are points from the sets of
points P and P
M
respectively.
The network’s output comprises N
PC
point predic-
tions. Each prediction includes the rotation quater-
nion, translation vector, and confidence coefficient,
all contributing to the estimated pose. By incorpo-
rating the confidence coefficient, the network can au-
tonomously evaluate the quality of its predictions.
The object’s 6D pose prediction is the one associated
with the highest confidence coefficient.
4 EXPERIMENTAL VALIDATION
To validate the proposed framework, datasets tailored
to the needs of the 6D pose estimation problem for
pallets were needed. In particular, due to the ab-
sence of realistic and readily available datasets online
for validating the accuracy of the detection of pallets
within an indoor or warehouse setting. Datasets such
as the PalLoc6D dataset (Knitt et al., 2022), which
serves as an RGB-D virtual dataset for the 6D detec-
tion of pallets, lack a realistic scenario because the
pallets are generated randomly in various locations,
surrounded by random objects, within a randomized
background. Since the dataset introduces unrealis-
tic backgrounds that do not represent real scenarios
that an AMR may find, in this work we propose two
virtual datasets considering pallets in an industrial
shelve. Additionally, to validate the proposed method
in a real scenario, a small indoor dataset was also ac-
quired.
4.1 Evaluation Datasets
This section presents a detailed explanation of the
three datasets created to evaluate the proposed
pipeline; two datasets generated in a virtual dataset
and one dataset in an indoor setting. Samples from
the three datasets are shown in Fig. 2.
4.1.1 Virtual Pallet Dataset
The first virtual dataset was created due to the lack
of a realistic and available online dataset to vali-
date the accuracy of the detection of pallets within a
warehouse setting. The key idea revolves around an
AMR such as a forklift capable of navigating towards
designated pick-up and drop-off zones. Once posi-
tioned correctly, the robot must accurately identify the
pallet’s location, enabling seamless execution of the
loading and unloading processes. To achieve this ob-
jective, the dataset simulates a virtual warehouse envi-
ronment (see Fig. 3), consisting of carefully designed
shelves populated with pallets and boxes, capturing
data from the perspective of a robotic forklift. The
acquisition process is automated from a set of prede-
fined camera positions. The virtual dataset that was
produced comprises 816 RGB-D raw images with a
resolution of 1224x370 coupled with the correspond-
ing point clouds, 2D and 3D bounding box annota-
tions for every pallet object within the image, as well
as the essential calibration matrices. Additionally, the
system exports the masks of the objects in the scene,
focusing only in this context on the pallets.
4.1.2 Virtual Pallet Dataset with Occlusions
The second dataset was acquired on the same scenario
as the previous one (see Fig. 3), and introduces occlu-
sions to make it closer to the reality in industrial envi-
ronments. This dataset tries to simulate the scenarios
where the AMR is not completely aligned with the
pallets during the pick-up process. The acquisition
process is automated from a set of predefined cam-
era positions, but a noise factor is introduced to create
misaligned and occluded views. Different pallet loca-
tions were added, along with scenarios where the pal-
let was barely visible due to occlusions from boxes or
Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs
349
Samples Dataset 3 (indoor)Samples Dataset 2 (virtual scenario with occlusions)Samples Dataset 1 (virtual scenario)
Figure 2: Sample RGB and Depth images, of the three evaluation datasets.
Figure 3: Virtual environment developed to acquire the vir-
tual datasets.
other elements of the warehouse. The virtual dataset
that was produced comprises 1632 RGB-D raw im-
ages with a resolution of 1224x370 coupled with the
corresponding point clouds, 2D and 3D bounding box
annotations for every pallet object within the image,
as well as the essential calibration matrices.
4.1.3 Indoor Pallet Dataset
In order to make a first approach from simulation to
reality, an indoor dataset comprised of, 1597 RGB-
D images were generated. We used a mobile robot
equipped with an Intel RealSense D435 camera. This
choice was based on its affordability and the high-
quality RGB sensor, which is capable of producing
excellent images even in low-light conditions. The
depth sensor performs well within a range of 2 to 4
meters, but its accuracy diminishes for more distant
objects, likely due to the low-light environment. A
scenario with a real pallet was created, with multi-
ple boxes stacked over the pallet. The robot would
move close to the pallet and then a box would be re-
moved, and the run replicated, until the pallet was
empty. A final run was included without the pallet
to serve as additional background. The RGB-D im-
ages were acquired in ROS, exported and processed
in the Roboflow platform (Dwyer et al., 2024). Us-
ing the Roboflow interface, the pallets were anno-
tated and the dataset created. Its interface supports
various annotation types, including bounding boxes,
polygons, and key points, allowing for precise delin-
eation of objects within images. In the context of this
work, Roboflow was used to label the pallets in the
collected 2D RGB images, preparing them for further
processing and analysis.
After the labelling process, aided by the Roboflow
interface, an in-house software was used to crop the
labels and assign a 6D pose to each detection using
the point cloud obtained from the depth image (using
the camera’s intrinsic parameters).
4.2 Experimental Results
This section presents the performance and results of
the proposed approach. A brief explanation about the
evaluation metrics will be conducted, afterwards the
validation in each dataset will have a distance-based
accuracy study, to evaluate the network’s ability to es-
timate object poses at various distances from the sen-
sor, as well as a multimodal study to analyze how dif-
ferent input data can impact a model’s performance.
4.2.1 Evaluation Metric
The evaluation of the method’s performance will be
presented in terms of Average Distance of Keypoints
(ADD). The ADD is first referenced by Hinterstoisser
et al. (Hinterstoisser et al., 2012) and is a metric that
computes the average Euclidean distance between the
estimated keypoint positions (
ˆ
R,
ˆ
t) and the ground
truth pose positions. The lower score indicates a
greater accuracy of the pose estimation algorithm, and
it is computed as follows:
ADD =
1
N
PC
pP
||(p (
ˆ
Rp
M
+
ˆ
t)||, (5)
where
ˆ
R is the estimated rotation, and
ˆ
t is the esti-
mated translation, p represents one of the sampled
points belonging to the point cloud P and p
M
the cor-
responding point of the object with its ground-truth
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
350
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
ADD Threshold
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Accuracy
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Error Distance (m)
0
100
200
300
400
500
600
700
800
Frequency
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
ADD Threshold
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Accuracy
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Error Distance (m)
0
10
20
30
40
50
60
70
80
Frequency
Figure 4: Accuracy ADD according to the distance of the objects and distribution histogram of ADD per pallet instance, for
the virtual and indoor datasets, respectively.
rotation and translation removed.This metric can ef-
fectively function as both a loss function and a mea-
sure of accuracy. Predictions that attain a score lower
than a predetermined threshold are considered cor-
rect.
4.2.2 Virtual Pallet Dataset Performance
Before introducing and analyzing the results obtained,
it is important to note that initial validation tests were
performed in the LINEMOD dataset (Hinterstoisser
et al., 2013), although the dataset context is out of
scope from the topic of pallet detection it is still rele-
vant to point out that in our tests a baseline DenseFu-
sion architecture achieved 95.3% on average accuracy
using the LINEMOD dataset thresholds, and our ap-
proach achieved 98.1% on the same conditions. It is
important to also note that these results were obtained
with the inclusion of a refinement step that we do not
include in this work, as it did not improve the results
in both virtual and indoor datasets.
The results on the first virtual dataset were a per-
formance of 100% considering a threshold on the
ADD of 0.05, it is important to reinforce that this
dataset represents the ideal conditions an AMR may
observe, and as such a high to perfect accuracy is ex-
pected. For the second virtual dataset, the obtained
performance was ±88%. The results for different
ADD thresholds are shown in Fig. 4. The x axis rep-
resent the ADD threshold, and the y axis represent the
accuracy obtained. To better assess the results of the
method, Fig. 4 shows the distribution of the ADD
distance per pallet instance and in this case the major-
ity of the pose estimations had an error distance infe-
rior to 0.1 meters, leaving the remaining ADD clus-
ters at ± 0.4 and ± 0.7. From an analysis of the data,
these clusters correspond to overly occluded pallets
where only a small number of points were extracted,
and an oversampling strategy was employed. In the
future, such objects may be automatically rejected as
its 6D pose is difficult to predict. Overall, the results
demonstrate that the proposed framework is capable
of achieving high accuracy, even in occluded scenar-
ios, as shown by both the accuracy curve and the ADD
distribution histogram.
4.2.3 Indoor Dataset Performance
On the indoor dataset, the obtained performance was
± 56% for a similar ADD threshold of 0.05 me-
ters. Figure 4 shows the accuracy with relative ADD
threshold distance. For a different threshold of 0.1
the method’s performance rises to ± 82%. This may
be caused due to the noisy nature of the real data, that
was affected by motion artifacts as well as by the poor
performance of the depth sensor due to varying lu-
minosity. When the analysis focuses over the ADD
distribution, it reveals that the majority of estimated
poses have an error close or inferior to 0.1 meters.
The ADD cluster on the 0.7 meters represents a simi-
lar behavior as observed on the second virtual dataset.
In particular, the accuracy curve shows a similar trend
as in the virtual dataset, but with slightly different re-
sults. The curve starts at around ± 56% accuracy for
an ADD distance of 0.05 meters, indicating that for
very small errors, the accuracy is lower when com-
pared to the virtual dataset. The accuracy improves
significantly as the error distance increases and if we
set an acceptable 0.1-0.2 meters of error distance, due
to incorrect annotation (as it was performed in the
Depth generated point cloud), or due to small occlu-
sions, we achieve an accuracy between 80 and 90%.
The ADD distribution histogram shows that the ma-
jority of poses have an error distance of less than 0.1
meters, indicating that the framework is able to esti-
mate most object poses with high precision.
5 CONCLUSIONS
This work presents a multimodal 6D pose estima-
tion of industrial objects in real and virtual envi-
ronments, particularly aimed at future integration
with AMRs. Using the DenseFusion framework as
a basis, an enhanced version is proposed combin-
ing RGB and Depth and utilizing multi-head self-
attention mechanisms for robust feature fusion. The
Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs
351
method was tested on two virtual datasets, includ-
ing scenarios with occlusions, and a real-world in-
door dataset, showing promising results even under
challenging conditions such as occlusions and noise.
The proposed framework achieved, expectedly, a bet-
ter accuracy in the occluded virtual dataset than on the
real-world indoor dataset, due to the noisy nature of
the measurements (that is not replicated in the virtual
datasets). Still, these results demonstrate the poten-
tial of the approach for future applications in indus-
trial environments, where it can significantly enhance
efficiency and safety. Future work will include the ac-
quisition of a new dataset in an industrial setting, with
further validation of the method proposed.
ACKNOWLEDGEMENTS
This work has been supported by the Por-
tuguese Foundation for Science and Technology
(FCT) through grant UIDB/00048/2020 (DOI
10.54499/UIDB/00048/2020) and by Agenda
“GreenAuto: Green innovation for the Automotive
Industry”, with reference PRR-C644867037-
00000013.
REFERENCES
Cao, Z., Sheikh, Y., and Banerjee, N. K. (2016). Real-time
scalable 6DOF pose estimation for textureless objects.
In 2016 IEEE International Conference on Robotics
and Automation (ICRA).
Chen, X. and Guhl, J. (2018). Industrial Robot Control with
Object Recognition based on Deep Learning. Proce-
dia CIRP, 76:149–154.
Drost, B., Ulrich, M., Navab, N., and Ilic, S. (2010). Model
globally, match locally: Efficient and robust 3D object
recognition. In 2010 IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition.
Dwyer, B., Nelson, J., Hansen, T., et al. (2024). Roboflow
(version 1.0) [software]. https://roboflow.com.
Fragapane, G., De Koster, R., Sgarbossa, F., and Strandha-
gen, J. O. (2021). Planning and control of autonomous
mobile robots for intralogistics: Literature review and
research agenda. European Journal of Operational
Research, 294(2):405–426.
Gorschl
¨
uter, F., Rojtberg, P., and P
¨
ollabauer, T. (2022). A
Survey of 6D Object Detection Based on 3D Mod-
els for Industrial Applications. Journal of Imaging,
8(3):53.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Resid-
ual Learning for Image Recognition. In 2016 IEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), Las Vegas, NV, USA. IEEE.
Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski,
G., Konolige, K., and Navab, N. (2013). Model Based
Training, Detection and Pose Estimation of Texture-
Less 3D Objects in Heavily Cluttered Scenes. In 11th
Asian Conference on Computer Vision. Springer.
Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Kono-
lige, K., Bradski, G., and Navab, N. (2012). Techni-
cal Demonstration on Model Based Training, Detec-
tion and Pose Estimation of Texture-Less 3D Objects
in Heavily Cluttered Scenes. In Computer Vision
ECCV 2012. Workshops and Demonstrations, volume
7585. Springer Berlin Heidelberg, Berlin, Heidelberg.
Hinterstoisser, S., Lepetit, V., Rajkumar, N., and Konolige,
K. (2016). Going Further with Point Pair Features.
volume 9907, pages 834–848. arXiv:1711.04061 [cs].
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics
yolov8. https://github.com/ultralytics/ultralytics. Ac-
cessed: 2024-06-5.
Knitt, M., Schyga, J., Adamanov, A., Hinckeldeyn, J., and
Kreutzfeldt, J. (2022). PalLoc6D-Estimating the Pose
of a Euro Pallet with an RGB Camera based on Syn-
thetic Training Data. https://doi.org/10.15480/336.
4470.
Li, Y., Wang, G., Ji, X., Xiang, Y., and Fox, D. (2020).
DeepIM: Deep Iterative Matching for 6D Pose Esti-
mation. International Journal of Computer Vision,
128(3):657–678. arXiv:1804.00175 [cs].
Li, Z., Wang, G., and Ji, X. (2019). CDPN: Coordinates-
Based Disentangled Pose Network for Real-Time
RGB-Based 6-DoF Object Pose Estimation. In 2019
IEEE/CVF International Conference on Computer Vi-
sion (ICCV).
Luong, M.-T., Pham, H., and Manning, C. D. (2015). Ef-
fective approaches to attention-based neural machine
translation. arXiv preprint arXiv:1508.04025.
Peng, S., Zhou, X., Liu, Y., Lin, H., Huang, Q., and Bao, H.
(2022). PVNet: Pixel-Wise Voting Network for 6DoF
Object Pose Estimation. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 44(6):3212–
3223.
Srinivas, A., Lin, T., Parmar, N., Shlens, J., Abbeel, P., and
Vaswani, A. (2021). Bottleneck transformers for vi-
sual recognition. CoRR, abs/2101.11605.
Wang, C., Xu, D., Zhu, Y., Mart
´
ın-Mart
´
ın, R., Lu, C.,
Fei-Fei, L., and Savarese, S. (2019). DenseFusion:
6D Object Pose Estimation by Iterative Dense Fusion.
arXiv:1901.04780 [cs].
Wang, C.-Y., Liao, H.-Y. M., Wu, Y.-H., Chen, P.-Y., Hsieh,
J.-W., and Yeh, I.-H. (2020). CSPNet: A new back-
bone that can enhance learning capability of cnn. In
Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition workshops, pages
390–391.
Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyra-
mid Scene Parsing Network. arXiv:1612.01105 [cs].
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
352