Multimodal 6D Detection of Industrial Pallets, in Real and Virtual

Environments, with Applications in Industrial AMRs

Jos

e Lourenc¸o, Gonc¸alo Ars

enio, Lu

ıs Garrote and Urbano J. Nunes

University of Coimbra, Institute of Systems and Robotics, Department of Electrical and Computer Engineering, Portugal

Keywords:

6D Pallet Detection, Multimodal Deep Learning, RGB-D Based and Point Cloud Based.

Abstract:

In this work we propose a multimodal approach for detecting and estimating the 6D pose of pallets, to be

applied in industrial environments. The method is designed for future integration with Autonomous Mo-

bile Robots (AMRs) for enhanced warehouse automation. Using the DenseFusion framework as basis, the

proposed approach fuses RGB and Depth data using multi-head self-attention mechanisms to improve its ro-

bustness. To test the proposed methods, three datasets were developed: two virtual and one real-world indoor

dataset, with varying degrees of occlusion and alignment challenges. Experimental results demonstrated that

the approach achieved a high accuracy in occluded virtual scenarios and a promising result in real indoor sce-

narios, with increased performance when considering higher error thresholds. The obtained results show the

potential of this system for use in AMRs to enhance the efﬁciency and safety of automated pallet handling in

industrial settings in the future.

1 INTRODUCTION

Mobile robotics has repeatedly revolutionized the in-

dustry, reshaping the technology in material handling.

As industries strive for greater efﬁciency and automa-

tion, warehouses and other sectors rapidly transition

from traditional Automated Guided Vehicles (AGVs)

to Autonomous Mobile Robots (AMRs), marking a

paradigm shift in how tasks are executed, and goods

are managed within dynamic environments (Fraga-

pane et al., 2021). As those AMRs become an essen-

tial part of the warehouses, the importance of sophis-

ticated perception capabilities, such as multi-instance

6D object pose estimation, becomes increasingly ev-

ident. It enables the robots to classify and recognize

objects, estimate their poses, and track them over a

period of time (Chen and Guhl, 2018). In this context,

multi-instance 6D object pose estimation comprises

the detection of objects and an estimation of their 3D

translation and 3D rotation. For some detection meth-

ods, this is a single stage, while others perform ob-

ject detection and pose estimation as distinct stages

(Gorschl

uter et al., 2022). For the detection method,

several techniques and algorithms exist to date having

point clouds, RGB, Depth, or RGB-D images as their

inputs, however, all of them face difﬁcult challenges

(robustness of detection) in the industrial environment

such as noise and occlusions in the sensor data. In-

dustrial environments, like factories and warehouses,

often exhibit a cluttered arrangement of objects and

machinery. This poses a challenge for accurate ob-

ject detection using 6D methods, as it can be intricate

to discern the object of interest. Also, dealing with

objects that have different types of textures and sym-

metries can affect the performance of the AMRs. In

some cases, the object may lack adequate texture or

distinctive features, making it more difﬁcult to accu-

rately estimate the object’s pose. Estimating the ob-

ject’s pose in real-time can also be a challenge in in-

dustrial environments due to the large amount of data

that needs to be processed.

In this work, a multimodal approach for detecting

and estimating the 6D pose of pallets within an in-

dustrial environment is proposed. The goal with this

detection system is, in the future, to integrate it into an

autonomous forklift platform. This autonomous fork-

lift will be capable of navigating to designated drop-

off and pick-up zones, detecting pallets of interest,

and managing their transportation. The integration

of this system into the production line is expected to

signiﬁcantly enhance the warehouse’s efﬁciency and

reinforce workplace safety by further improving the

automation of forklift maneuvers.

The proposed multimodal detection approach is

based on the DenseFusion framework (Wang et al.,

2019), with changes introduced in the feature fu-

Lourenço, J., Arsénio, G., Garrote, L. and Nunes, U.

Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs.

DOI: 10.5220/0013073300003822

In Proceedings of the 21st International Conference on Informatics in Control, Automation and Robotics (ICINCO 2024) - Volume 2, pages 345-352

ISBN: 978-989-758-717-7; ISSN: 2184-2809

345

sion and geometry feature extraction stages. Due

to difﬁculties on acquiring data on running factories

with an AMR, and to not hinder the research on this

topic, we also prepared three datasets. Two datasets

were acquired and annotated in a virtual environment,

containing multiple industrial shelving units and pal-

lets, while one dataset was acquired indoor and user-

annotated, with a pallet in different positions and with

different levels of occlusion.

2 RELATED WORK

The problem of 6D object detection is widely studied

in different ﬁelds of robotics. It pertains to the process

of recognizing 3D objects within a 3D space and de-

termining their positioning (X, Y , Z) and orientation

(roll, pitch, yaw). The different approaches to this

problem can be divided into RGB-based approaches

and RGB-D-based approaches.

RGB-based approaches can be holistic or based

on the dense correspondence. Holistic approaches

involve directly extracting the pose parameters from

RGB images, as in the case of DeepIM (Li et al.,

2020), a method that takes as an input the initial 6D

pose estimation of an object in the image and out-

puts a relative SE(3) transformation that is compared

to the initial pose to improve the estimate. On the

other hand, dense correspondence approaches estab-

lish correspondences between image pixels and mesh

vertices to recover poses using Perspective-n-Point

(PnP) techniques, like in the Coordinates-Based Dis-

entangles Pose Network (CDPN) (Li et al., 2019) that

separates the pose estimation process into distinct pre-

dictions for rotation and translation.The rotation es-

timation employs a carefully designed local region-

based framework, enhancing both accuracy and ef-

ﬁciency. For translation estimation, the network di-

rectly derives this information from localized image

patches. These distinct tasks are integrated and ad-

dressed within a single uniﬁed network. Given that

the size of an object in an image can vary signiﬁ-

cantly with its distance from the camera, the object

is scaled to a ﬁxed size based on the detection out-

put. Finally, another approach is 2D-keypoint based

that detect 2D keypoints to establish the 2D-3D cor-

respondence for pose estimation, although they may

suffer from loss of geometry information due to per-

spective projections. This is the case of the Pixel-wise

Voting Network (Peng et al., 2022) which employs re-

gression on pixel-wise vectors to infer the positions

of keypoints, which are subsequently utilized to cast

votes for keypoint localization. This methodology es-

tablishes a versatile representation capable of accu-

rately localizing keypoints, even in scenarios where

they may be occluded or truncated. Furthermore, this

approach provides a means to assess the uncertainties

associated with keypoint locations, thus offering valu-

able insights for the PnP solver.

Due to RGB-D images being easy to obtain, RGB-

D based approaches are widely investigated in the

problem of 6D object detection. They can be divided

into different methods such as template-based meth-

ods, that rely on feature and shape-based template

matching to locate the object in the image and roughly

estimate its pose, such as (Cao et al., 2016), which

employed a 3D model to generate example poses of

a textureless object to identify the closest match to

the input image using GPU implementation. Their

method involved transforming images into the Lapla-

cian of the Gaussian space to ensure invariance to

changes in illumination and appearance. To enable

real-time matching, the authors proposed modiﬁca-

tions to the template set and the image, as well as a

restructuration of the conventional normalized cross-

correlation operation. These adjustments allowed for

the harnessing of the computational power of the

GPU to perform rapid matrix-matrix multiplication.

Feature-based methods are also used in this type of

approaches, they exploit the point cloud to match 3D

features and ﬁt the object models into the scene. The

approach proposed by (Hinterstoisser et al., 2016) in

2016 with a series of enhancements to the PPF ap-

proach (Drost et al., 2010). These advancements en-

compass sampling and voting schemes aimed at mit-

igating the inﬂuence of clutter and sensor noise. The

sampling scheme selects pairs of points that are prob-

able to belong to the same object, while deliberately

avoiding pairs considered likely to belong to differ-

ent objects on the background. The voting scheme

then consolidates the PPF of all pairs of points antic-

ipated to belong to the same object, while disregard-

ing those anticipated to belong to different objects or

the background. Finally, Deep-Learning-based meth-

ods like DenseFusion (Wang et al., 2019) which pro-

cesses separately RGB and depth in two main stages.

First it processes the inputted color images to perform

semantic segmentation for each object, and then pro-

cesses the results of the segmentation and estimates

the object’s 6D pose using an iterative pose reﬁne-

ment module that increases the precision of orienta-

tion estimation with a small inference time.

3 METHODOLOGY

The pipeline of the proposed framework illustrated in

Figure 1, which uses RGB and Depth as inputs, is

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

346

YOLO Object

Detection/

Segmentation

RGB Image

Input

Output

CNN

PointNet

Detected objects

of interest

Multi-head

Attention

Fusion

Geometric

Features

Pose

Predictor

6D pose

Fused

Features

Pixel-wise Feature Fusion

Point Cloud

Extractor

Camera Calibration

Object's

Depth Crop

Depth Map

Multi-head

Attention

Multimodal Fusion of RGB and Depth Feature Maps

Point Cloud Feature Extraction

Figure 1: Pipeline of the proposed framework using RGB and Depth for pallet detection and 6D pose prediction.

heavily inspired on the DenseFusion approach (Wang

et al., 2019) and reuses several of its modules, in-

cluding the RGB feature extraction network, the point

cloud feature extraction network, the pixel-wise fea-

ture fusion, and the pose predictor. Modiﬁcations on

the input of the pipeline to use object detection in-

stead of a segmentation network and multi-head self-

attention at different levels are introduced to improve

over the DenseFusion approach, creating a new ap-

proach that can be deployed in an AMR considering

a shared object detection system.

3.1 Object Detector

The initial stage involves the RGB-D information as

input and performs an object detection for each object

of interest in the image. The object detection network

is the YOLOv8 network (Jocher et al., 2023), com-

posed of a backbone network, a neck network and a

head network. The backbone network is built upon a

custom CSP-Darknet53 network (Wang et al., 2020)

and has a Spatial Pyramid Pooling Fast (SPPF) layer.

The neck network employs a Path Aggregation Net-

work (PANet) structure, which helps the model to ef-

fectively capture features at several scales by ﬂowing

information across different spatial resolutions. Fi-

nally, the head network is responsible for generating

the ﬁnal outputs, such as bounding boxes and con-

ﬁdence scores for each object. For each frame pro-

cessed in the object detector, a set of bounding box de-

tections is obtained. For each detection that contains

a pallet, a crop of the RGB and Depth images is per-

formed considering the bounding box shape, to guar-

antee that only the object’s shape and texture is pro-

cessed in the subsequent steps, one object at a time.

3.1.1 RGB and Depth Feature Extractors

The RGB feature extractor is a modiﬁed version of

the Residual Network (He et al., 2016) integrated with

the Scene Parsing Network (Zhao et al., 2017) mod-

ule. The main goal of the feature extractor is to get

relevant features from RGB images. The backbone

of the feature extractor is the ResNet, known for its

ability to train deep models effectively. The ResNet

network uses residual blocks that allow the network

to learn residual functions, which represent the dif-

ference between the input and output of a layer in a

neural network, instead of unreferenced ones. This

means that instead of attempting to learn the complete

identity mapping from the initial stages, the network

can focus on learning the changes, or ”residuals,” to

the input’s identity mapping. The residual block con-

sists of two or three streams of convolutional neural

networks, followed by an element-wise addition op-

eration that combines the input with the output of the

convolutional layers.

In this particular implementation, ResNet-18

serves as the backbone. The different ResNet archi-

tectures vary in the number of layers, therefore in this

case it consists of 18 layers. ResNet-18 is the optimal

option when weighing the trade-off between accuracy

and computing resource usage.

The PSPNet module is incorporated to enhance

the feature extractor’s capability to extract quality fea-

tures. The PSPNet module utilizes a pyramid pool-

ing strategy to capture multiscale contextual informa-

tion from the input image. The feature maps are di-

vided into multiple stages, each employing adaptive

average pooling and convolution operations to extract

features at different spatial resolutions. The original

features are then concatenated with these features af-

ter bilinearly upsampled (a process of increasing the

spatial resolution of feature maps using bilinear in-

terpolation, which estimates new pixel values based

on the linear interpolation of neighboring pixels). To

improve efﬁciency, a bottleneck convolutional layer

minimizes the dimensionality of the concatenated fea-

tures. The process of feature extraction starts with

the RGB image passing through the ResNet back-

bone. Then, the network extracts both low-level and

high-level features from the image. These features

are then processed by the PSPNet module, which

Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs

347

captures contextual information at multiple pyramid

scales and incorporates it into the feature representa-

tion. By incorporating this module, object pose es-

timation is now robust to scale variations, making it

unaffected by changes in scale. For the depth feature

extraction, the ﬁrst layer of the ResNet was modiﬁed

to process the 1-channel depth image.

3.1.2 Multimodal Fusion of RGB and Depth

Feature Maps

This stage involves selecting which features from

the RGB (F

RGB

) and depth (F

) are relevant to esti-

mate/predict the object’s 6D pose. A multi-head self-

attention (Srinivas et al., 2021) strategy is employed

in order to capture the most relevant features from

both modalities.

Attention mechanisms (Luong et al., 2015) pro-

vide the network with salient features from each

modality, which minimizes the noise and irrelevant

information. This approach enables the network to

decide when and how to integrate RGB and Depth

data. These mechanisms generate attention weights

that emphasize the most salient features from each

modality.

Given the availability of both RGB and depth

features, the introduction of an attention mechanism

aims to fuse the two modalities to leverage comple-

mentary information. Let F

RGB

∈ R

RGB

represent the

feature vector derived from the RGB modality, where

RGB

is the dimensionality of the RGB feature space,

and F

∈ R

represent the corresponding depth fea-

tures, where d

is the dimensionality of the depth fea-

ture space. To fuse the two modalities, we concate-

nate these feature vectors along the feature dimen-

sion:

= [F

RGB

, F

] ∈ R

RGB

)

(1)

The combined feature vector F

contains information

from both RGB and depth modalities for each spatial

location. Next, to model the relevant feature interde-

pendencies and relationships, we employ the multi-

head self-attention mechanism. The multi-head self-

attention mechanism allows each location to attend to

all other locations, enabling the model to capture rel-

evant feature interactions. The attention mechanism

computes a weighted sum of all the feature represen-

tations, where the weights are determined dynami-

cally based on the similarity between the query and

key vectors. For each attention head, the query (Q),

key (K), and value (V ) matrices are computed from

the combined feature representation F

Q = W

, K = W

, V = W

(2)

where W

, W

∈R

RGB

)×d

head

are learned pro-

jection matrices and d

head

is the dimensionality of

each attention head. The attention weights are com-

puted as:

Attention(Q, K, V ) = softmax



√

head



V (3)

The outputs from multiple attention heads are con-

catenated and projected back to the original feature

space, yielding a more comprehensive representation.

By applying multi-head self-attention, the model cap-

tures both spatial and cross-modal interactions be-

tween the RGB and depth features, leading to a richer

representation that combines both appearance and

depth characteristics.

3.1.3 Point Cloud Feature Extraction

From the object’s cropped representation, and using

the camera’s intrinsic parameters, a 3D point cloud is

obtained. From the set of 3D points from the point

cloud, a N

number of points is selected (P). If a

mask of the object is present, the points are selected

within the exported points that represent the object,

otherwise they are uniformly sampled without repeti-

tion. If the 3D point cloud’s size is below N

, the

point cloud is oversampled to match the N

points.

This can occur for objects that are too occluded or

that are far away from the camera, but its pose esti-

mate is still required. The point cloud feature extrac-

tion employed is derived from the DenseFusion’s im-

plementation, as it uses a PointNet-like architecture

to extract per-point geometric features. An additional

multi-head self-attention mechanism is introduced at

the end, similarly to the approach presented in Section

3.1.2, to focus the network on the geometric features

more relevant to the 6D pose estimation task.

3.1.4 Pixel-Wise Dense Feature Fusion Network

The objective of the Pixel-wise Dense Feature Fu-

sion Network is to fuse the information obtained from

the image and the 3D point cloud. The concept be-

hind the pixel-wise dense fusion network is to move

away from relying solely on the object’s global fea-

tures to determine its pose. Instead, the DenseFusion

approach performs local per-pixel fusion so that it is

possible to make predictions based on each feature.

In more practical terms, to each point of the point

cloud P a set of features are associated, composed

of global features, geometric features and fused fea-

tures. The global features are common to each point p

∈ P, and are obtained from a Multi-Layer Perceptron

(MLP) using all geometric features and fused features

as inputs. This process aims to minimize the effects

of occlusion and detection/segmentation noise. This

allows the method to select the most reliable repre-

sentations based on the visible portion of the object,

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

348

reducing the impact of issues such as objects partially

hidden from view or interference from background el-

ements.

3.1.5 Pose Estimator

The pose estimator block in the Dense Fusion archi-

tecture estimates the 6D pose of known objects from

the RGB-D images. The block takes the pixel-wise

dense feature embedding from the Pixel-wise Dense

Feature Fusion network as input and outputs the pre-

dicted pose of the object. The fused features are pro-

cessed using an MLP which outputs a 3D vector rep-

resenting the translation of the object in the 3D space,

a quaternion representing the rotation of the object

and a conﬁdence coefﬁcient that represents the quality

of the pose estimate. This block uses a residual-based

approach to estimate the pose, and the pose estimation

loss is calculated by measuring the distance between

the observed object’s point cloud (P) and the corre-

sponding object’s points centered on the object’s cen-

ter of mass (P

) transformed by the estimated pose

(T ). This loss is quantiﬁed by the distance by those

points and is deﬁned as:

L =

i=1

∑

(|T p

−p

−w log(c

)), (4)

where c

is the conﬁdence coefﬁcient, w is a balanc-

ing hyperparameter used as a secondary regulariza-

tion term to balance the average distance loss and

conﬁdence, and p and p

are points from the sets of

points P and P

respectively.

The network’s output comprises N

point predic-

tions. Each prediction includes the rotation quater-

nion, translation vector, and conﬁdence coefﬁcient,

all contributing to the estimated pose. By incorpo-

rating the conﬁdence coefﬁcient, the network can au-

tonomously evaluate the quality of its predictions.

The object’s 6D pose prediction is the one associated

with the highest conﬁdence coefﬁcient.

4 EXPERIMENTAL VALIDATION

To validate the proposed framework, datasets tailored

to the needs of the 6D pose estimation problem for

pallets were needed. In particular, due to the ab-

sence of realistic and readily available datasets online

for validating the accuracy of the detection of pallets

within an indoor or warehouse setting. Datasets such

as the PalLoc6D dataset (Knitt et al., 2022), which

serves as an RGB-D virtual dataset for the 6D detec-

tion of pallets, lack a realistic scenario because the

pallets are generated randomly in various locations,

surrounded by random objects, within a randomized

background. Since the dataset introduces unrealis-

tic backgrounds that do not represent real scenarios

that an AMR may ﬁnd, in this work we propose two

virtual datasets considering pallets in an industrial

shelve. Additionally, to validate the proposed method

in a real scenario, a small indoor dataset was also ac-

quired.

4.1 Evaluation Datasets

This section presents a detailed explanation of the

three datasets created to evaluate the proposed

pipeline; two datasets generated in a virtual dataset

and one dataset in an indoor setting. Samples from

the three datasets are shown in Fig. 2.

4.1.1 Virtual Pallet Dataset

The ﬁrst virtual dataset was created due to the lack

of a realistic and available online dataset to vali-

date the accuracy of the detection of pallets within a

warehouse setting. The key idea revolves around an

AMR such as a forklift capable of navigating towards

designated pick-up and drop-off zones. Once posi-

tioned correctly, the robot must accurately identify the

pallet’s location, enabling seamless execution of the

loading and unloading processes. To achieve this ob-

jective, the dataset simulates a virtual warehouse envi-

ronment (see Fig. 3), consisting of carefully designed

shelves populated with pallets and boxes, capturing

data from the perspective of a robotic forklift. The

acquisition process is automated from a set of prede-

ﬁned camera positions. The virtual dataset that was

produced comprises 816 RGB-D raw images with a

resolution of 1224x370 coupled with the correspond-

ing point clouds, 2D and 3D bounding box annota-

tions for every pallet object within the image, as well

as the essential calibration matrices. Additionally, the

system exports the masks of the objects in the scene,

focusing only in this context on the pallets.

4.1.2 Virtual Pallet Dataset with Occlusions

The second dataset was acquired on the same scenario

as the previous one (see Fig. 3), and introduces occlu-

sions to make it closer to the reality in industrial envi-

ronments. This dataset tries to simulate the scenarios

where the AMR is not completely aligned with the

pallets during the pick-up process. The acquisition

process is automated from a set of predeﬁned cam-

era positions, but a noise factor is introduced to create

misaligned and occluded views. Different pallet loca-

tions were added, along with scenarios where the pal-

let was barely visible due to occlusions from boxes or

Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs

349

Samples Dataset 3 (indoor)Samples Dataset 2 (virtual scenario with occlusions)Samples Dataset 1 (virtual scenario)

Figure 2: Sample RGB and Depth images, of the three evaluation datasets.

Figure 3: Virtual environment developed to acquire the vir-

tual datasets.

other elements of the warehouse. The virtual dataset

that was produced comprises 1632 RGB-D raw im-

ages with a resolution of 1224x370 coupled with the

corresponding point clouds, 2D and 3D bounding box

annotations for every pallet object within the image,

as well as the essential calibration matrices.

4.1.3 Indoor Pallet Dataset

In order to make a ﬁrst approach from simulation to

reality, an indoor dataset comprised of, 1597 RGB-

D images were generated. We used a mobile robot

equipped with an Intel RealSense D435 camera. This

choice was based on its affordability and the high-

quality RGB sensor, which is capable of producing

excellent images even in low-light conditions. The

depth sensor performs well within a range of 2 to 4

meters, but its accuracy diminishes for more distant

objects, likely due to the low-light environment. A

scenario with a real pallet was created, with multi-

ple boxes stacked over the pallet. The robot would

move close to the pallet and then a box would be re-

moved, and the run replicated, until the pallet was

empty. A ﬁnal run was included without the pallet

to serve as additional background. The RGB-D im-

ages were acquired in ROS, exported and processed

in the Roboﬂow platform (Dwyer et al., 2024). Us-

ing the Roboﬂow interface, the pallets were anno-

tated and the dataset created. Its interface supports

various annotation types, including bounding boxes,

polygons, and key points, allowing for precise delin-

eation of objects within images. In the context of this

work, Roboﬂow was used to label the pallets in the

collected 2D RGB images, preparing them for further

processing and analysis.

After the labelling process, aided by the Roboﬂow

interface, an in-house software was used to crop the

labels and assign a 6D pose to each detection using

the point cloud obtained from the depth image (using

the camera’s intrinsic parameters).

4.2 Experimental Results

This section presents the performance and results of

the proposed approach. A brief explanation about the

evaluation metrics will be conducted, afterwards the

validation in each dataset will have a distance-based

accuracy study, to evaluate the network’s ability to es-

timate object poses at various distances from the sen-

sor, as well as a multimodal study to analyze how dif-

ferent input data can impact a model’s performance.

4.2.1 Evaluation Metric

The evaluation of the method’s performance will be

presented in terms of Average Distance of Keypoints

(ADD). The ADD is ﬁrst referenced by Hinterstoisser

et al. (Hinterstoisser et al., 2012) and is a metric that

computes the average Euclidean distance between the

estimated keypoint positions (

t) and the ground

truth pose positions. The lower score indicates a

greater accuracy of the pose estimation algorithm, and

it is computed as follows:

ADD =

∑

p∈P

||(p −(

t)||, (5)

where

R is the estimated rotation, and

t is the esti-

mated translation, p represents one of the sampled

points belonging to the point cloud P and p

the cor-

responding point of the object with its ground-truth

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

ADD Threshold

0.86

0.88

0.9

0.92

0.94

0.96

0.98

Accuracy

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Error Distance (m)

100

200

300

400

500

600

700

800

Frequency

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

ADD Threshold

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Accuracy

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Error Distance (m)

Frequency

Figure 4: Accuracy ADD according to the distance of the objects and distribution histogram of ADD per pallet instance, for

the virtual and indoor datasets, respectively.

rotation and translation removed.This metric can ef-

fectively function as both a loss function and a mea-

sure of accuracy. Predictions that attain a score lower

than a predetermined threshold are considered cor-

rect.

4.2.2 Virtual Pallet Dataset Performance

Before introducing and analyzing the results obtained,

it is important to note that initial validation tests were

performed in the LINEMOD dataset (Hinterstoisser

et al., 2013), although the dataset context is out of

scope from the topic of pallet detection it is still rele-

vant to point out that in our tests a baseline DenseFu-

sion architecture achieved 95.3% on average accuracy

using the LINEMOD dataset thresholds, and our ap-

proach achieved 98.1% on the same conditions. It is

important to also note that these results were obtained

with the inclusion of a reﬁnement step that we do not

include in this work, as it did not improve the results

in both virtual and indoor datasets.

The results on the ﬁrst virtual dataset were a per-

formance of 100% considering a threshold on the

ADD of 0.05, it is important to reinforce that this

dataset represents the ideal conditions an AMR may

observe, and as such a high to perfect accuracy is ex-

pected. For the second virtual dataset, the obtained

performance was ±88%. The results for different

ADD thresholds are shown in Fig. 4. The x axis rep-

resent the ADD threshold, and the y axis represent the

accuracy obtained. To better assess the results of the

method, Fig. 4 shows the distribution of the ADD

distance per pallet instance and in this case the major-

ity of the pose estimations had an error distance infe-

rior to 0.1 meters, leaving the remaining ADD clus-

ters at ± 0.4 and ± 0.7. From an analysis of the data,

these clusters correspond to overly occluded pallets

where only a small number of points were extracted,

and an oversampling strategy was employed. In the

future, such objects may be automatically rejected as

its 6D pose is difﬁcult to predict. Overall, the results

demonstrate that the proposed framework is capable

of achieving high accuracy, even in occluded scenar-

ios, as shown by both the accuracy curve and the ADD

distribution histogram.

4.2.3 Indoor Dataset Performance

On the indoor dataset, the obtained performance was

± 56% for a similar ADD threshold of 0.05 me-

ters. Figure 4 shows the accuracy with relative ADD

threshold distance. For a different threshold of 0.1

the method’s performance rises to ± 82%. This may

be caused due to the noisy nature of the real data, that

was affected by motion artifacts as well as by the poor

performance of the depth sensor due to varying lu-

minosity. When the analysis focuses over the ADD

distribution, it reveals that the majority of estimated

poses have an error close or inferior to 0.1 meters.

The ADD cluster on the 0.7 meters represents a simi-

lar behavior as observed on the second virtual dataset.

In particular, the accuracy curve shows a similar trend

as in the virtual dataset, but with slightly different re-

sults. The curve starts at around ± 56% accuracy for

an ADD distance of 0.05 meters, indicating that for

very small errors, the accuracy is lower when com-

pared to the virtual dataset. The accuracy improves

signiﬁcantly as the error distance increases and if we

set an acceptable 0.1-0.2 meters of error distance, due

to incorrect annotation (as it was performed in the

Depth generated point cloud), or due to small occlu-

sions, we achieve an accuracy between 80 and 90%.

The ADD distribution histogram shows that the ma-

jority of poses have an error distance of less than 0.1

meters, indicating that the framework is able to esti-

mate most object poses with high precision.

5 CONCLUSIONS

This work presents a multimodal 6D pose estima-

tion of industrial objects in real and virtual envi-

ronments, particularly aimed at future integration

with AMRs. Using the DenseFusion framework as

a basis, an enhanced version is proposed combin-

ing RGB and Depth and utilizing multi-head self-

attention mechanisms for robust feature fusion. The

Multimodal 6D Detection of Industrial Pallets, in Real and Virtual Environments, with Applications in Industrial AMRs

351

method was tested on two virtual datasets, includ-

ing scenarios with occlusions, and a real-world in-

door dataset, showing promising results even under

challenging conditions such as occlusions and noise.

The proposed framework achieved, expectedly, a bet-

ter accuracy in the occluded virtual dataset than on the

real-world indoor dataset, due to the noisy nature of

the measurements (that is not replicated in the virtual

datasets). Still, these results demonstrate the poten-

tial of the approach for future applications in indus-

trial environments, where it can signiﬁcantly enhance

efﬁciency and safety. Future work will include the ac-

quisition of a new dataset in an industrial setting, with

further validation of the method proposed.

ACKNOWLEDGEMENTS

This work has been supported by the Por-

tuguese Foundation for Science and Technology

(FCT) through grant UIDB/00048/2020 (DOI

10.54499/UIDB/00048/2020) and by Agenda

“GreenAuto: Green innovation for the Automotive

Industry”, with reference PRR-C644867037-

00000013.

REFERENCES

Cao, Z., Sheikh, Y., and Banerjee, N. K. (2016). Real-time

scalable 6DOF pose estimation for textureless objects.

In 2016 IEEE International Conference on Robotics

and Automation (ICRA).

Chen, X. and Guhl, J. (2018). Industrial Robot Control with

Object Recognition based on Deep Learning. Proce-

dia CIRP, 76:149–154.

Drost, B., Ulrich, M., Navab, N., and Ilic, S. (2010). Model

globally, match locally: Efﬁcient and robust 3D object

recognition. In 2010 IEEE Computer Society Confer-

ence on Computer Vision and Pattern Recognition.

Dwyer, B., Nelson, J., Hansen, T., et al. (2024). Roboﬂow

(version 1.0) [software]. https://roboﬂow.com.

Fragapane, G., De Koster, R., Sgarbossa, F., and Strandha-

gen, J. O. (2021). Planning and control of autonomous

mobile robots for intralogistics: Literature review and

research agenda. European Journal of Operational

Research, 294(2):405–426.

Gorschl

uter, F., Rojtberg, P., and P

ollabauer, T. (2022). A

Survey of 6D Object Detection Based on 3D Mod-

els for Industrial Applications. Journal of Imaging,

8(3):53.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Resid-

ual Learning for Image Recognition. In 2016 IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), Las Vegas, NV, USA. IEEE.

Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski,

G., Konolige, K., and Navab, N. (2013). Model Based

Training, Detection and Pose Estimation of Texture-

Less 3D Objects in Heavily Cluttered Scenes. In 11th

Asian Conference on Computer Vision. Springer.

Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Kono-

lige, K., Bradski, G., and Navab, N. (2012). Techni-

cal Demonstration on Model Based Training, Detec-

tion and Pose Estimation of Texture-Less 3D Objects

in Heavily Cluttered Scenes. In Computer Vision –

ECCV 2012. Workshops and Demonstrations, volume

7585. Springer Berlin Heidelberg, Berlin, Heidelberg.

Hinterstoisser, S., Lepetit, V., Rajkumar, N., and Konolige,

K. (2016). Going Further with Point Pair Features.

volume 9907, pages 834–848. arXiv:1711.04061 [cs].

Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics

yolov8. https://github.com/ultralytics/ultralytics. Ac-

cessed: 2024-06-5.

Knitt, M., Schyga, J., Adamanov, A., Hinckeldeyn, J., and

Kreutzfeldt, J. (2022). PalLoc6D-Estimating the Pose

of a Euro Pallet with an RGB Camera based on Syn-

thetic Training Data. https://doi.org/10.15480/336.

4470.

Li, Y., Wang, G., Ji, X., Xiang, Y., and Fox, D. (2020).

DeepIM: Deep Iterative Matching for 6D Pose Esti-

mation. International Journal of Computer Vision,

128(3):657–678. arXiv:1804.00175 [cs].

Li, Z., Wang, G., and Ji, X. (2019). CDPN: Coordinates-

Based Disentangled Pose Network for Real-Time

RGB-Based 6-DoF Object Pose Estimation. In 2019

IEEE/CVF International Conference on Computer Vi-

sion (ICCV).

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Ef-

fective approaches to attention-based neural machine

translation. arXiv preprint arXiv:1508.04025.

Peng, S., Zhou, X., Liu, Y., Lin, H., Huang, Q., and Bao, H.

(2022). PVNet: Pixel-Wise Voting Network for 6DoF

Object Pose Estimation. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 44(6):3212–

3223.

Srinivas, A., Lin, T., Parmar, N., Shlens, J., Abbeel, P., and

Vaswani, A. (2021). Bottleneck transformers for vi-

sual recognition. CoRR, abs/2101.11605.

Wang, C., Xu, D., Zhu, Y., Mart

ın-Mart

ın, R., Lu, C.,

Fei-Fei, L., and Savarese, S. (2019). DenseFusion:

6D Object Pose Estimation by Iterative Dense Fusion.

arXiv:1901.04780 [cs].

Wang, C.-Y., Liao, H.-Y. M., Wu, Y.-H., Chen, P.-Y., Hsieh,

J.-W., and Yeh, I.-H. (2020). CSPNet: A new back-

bone that can enhance learning capability of cnn. In

Proceedings of the IEEE/CVF conference on com-

puter vision and pattern recognition workshops, pages

390–391.

Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyra-

mid Scene Parsing Network. arXiv:1612.01105 [cs].

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

352