Automatic 3D Object Recognition and Localization for Robotic Grasping

Bruno Santo, Liliana Ant

and Gil Gonc¸alves

SYSTEC, Research Center for Systems and Technologies,

Faculty of Engineering, University of Porto,

Rua Dr. Roberto Frias, 4200-465 Porto, Portugal

Keywords:

Robotics, Grasping, Object Pose Estimation, Object Recognition.

Abstract:

With the emergence of Industry 4.0 and its highly re-conﬁgurable manufacturing context, the typical ﬁxed-

position grasping systems are no longer usable. This reality underlined the necessity for fully automatic

and adaptable robotic grasping systems. With that in mind, the primary purpose of this paper is to join

Machine Learning models for detection and pose estimation into an automatic system to be used in a grasping

environment. The developed system uses Mask-RCNN and Densefusion models for the recognition and pose

estimation of objects, respectively. The grasping is executed, taking into consideration both the pose and the

object’s ID, as well as allowing for user and application adaptability through an initial conﬁguration. The

system was tested both on a validation dataset and in a real-world environment. The main results show that

the system has more difﬁculty with complex objects; however, it shows promising results for simpler objects,

even with training on a reduced dataset. It is also able to generalize to objects slightly different than the ones

seen in training. There is an 80% success rate in the best cases for simple grasping attempts.

1 INTRODUCTION

Object recognition and pose estimation are two of the

main topics in Computer Vision (CV), with an in-

creasing number of explored solutions over the last

few years. Despite being applied to several different

areas, from augmented reality to autonomous driving,

robotic grasping is one of these topics’ most critical

applications. The main focus in grasping is to enable

computer systems to process, analyze and, ultimately,

extract information or understanding of the objects to

grasp from a digital image. Traditionally, the focus

has been on using image processing techniques com-

bined with careful analysis and programming. Meth-

ods such as these can be used for grasp planning us-

ing shape primitives and previous knowledge of op-

timal grasping positions (Miller et al., 2003). This

has had improvements in performance and productiv-

ity for automation in the industry.

However, in this new reality of Industry 4.0, pro-

duction processes are increasingly oriented towards

product customization, demanding robotic solutions

to be more and more ﬂexible. This is true espe-

cially in the context of Human-Robot Collaboration,

https://orcid.org/0000-0002-2903-0989

https://orcid.org/0000-0001-7757-7308

where human-robot teams need to perceive tasks the

same way between them. Humans are inherently good

at perceiving their environment and the objects sur-

rounding them, identifying their characteristics, posi-

tion, and deciding naturally where to grasp an object

and how much force to exert on it (Bicchi and Ku-

mar, 2000). For this reason, the main focus in the

latest grasping applications has been on increasing

reconﬁgurability and responsiveness (Saxena et al.,

2008). Using only typical hard-coded robotic solu-

tions where the objects’ positions are ﬁxed, this real-

ity is not achievable due to a very restrictive imple-

mentation.

In the last decade, CV research has focused on ap-

plying image processing techniques in tandem with

Machine Learning (ML) and Deep Learning (DL)

models. Due to the convincing results that have been

achieved in this scope, there is an increasing trend to-

wards the implementation of DL algorithms in robotic

grasping applications (Lenz et al., 2015). This en-

ables robots to use cameras to perceive their environ-

ment and reason about tasks more intelligently. DL

models are often applied in pose estimation, enabling

fully automatic and efﬁcient grasping. In order to im-

prove robotic grasping, besides localization, provid-

ing object recognition can also be advantageous. By

416

Santo, B., Antão, L. and Gonçalves, G.

Automatic 3D Object Recognition and Localization for Robotic Grasping.

DOI: 10.5220/0010552704160425

In Proceedings of the 18th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2021), pages 416-425

ISBN: 978-989-758-522-7

giving the object’s semantic class, the system can then

adapt the grasping to the object’s characteristics. For

instance, the gripping strength the manipulator can

apply without causing damage to the object, can be in-

ferred from its classiﬁcation, or even choosing which

object to pick out in a group of objects by its charac-

teristics.

There are already many object detection and clas-

siﬁcation systems, as well as pose estimation models

for grasping. However, very few showing the beneﬁts

of joining both approaches for a more complete and

autonomous grasping, much less fully automated or

based on only camera frame information (both RGB

and/or depth). Therefore, the challenge tackled in this

paper is to explore and develop a real-time intelligent

and fully automated system that joins both localiza-

tion and categorization of objects in 3D for grasping,

derived from the analysis of data from a stereo vision

camera system.

In short, our approach allows for the following

contributions:

• It combines both detection and pose information

(using Mask-RCNN and Densefusion), allowing

for grasping adapted to each type of object;

• User customization, allowing for the use of a con-

ﬁguration ﬁle with optimal grasping points for

each object, increasing user and application adapt-

ability (as well as accuracy and speed);

• The system runs entirely automatically, being

able to detect movements and differences between

video frames, which allows for more optimized

calculations and inference on each frame;

• The use of transfer learning is combined with a

smaller dataset for real-world testing;

2 RELATED WORK

2.1 Object Detection and Recognition

There have been two main approaches for object de-

tection in recent years: region based detectors and

single-shot detectors. The former utilizes some kind

of region proposal algorithm, followed by a bound-

ing box regressor and a classiﬁer. The latter tries to

speed up computation by combining these methods

into a single network. Of the region based detec-

tors, the main models are Faster-RCNN (Ren et al.,

2017), and R-FCN (Dai et al., 2016). Faster-RCNN

utilizes a region proposal neural network, followed

by a classiﬁcation layer to label the proposed regions

and a regression layer for the bounding-box coordi-

nates (Ren et al., 2017). R-FCN shares the same ar-

chitecture for region proposal generation. However,

it applies another layer after the proposal to obtain

position-sensitive score maps for each object class.

These maps are used to classify the object in each

region. In terms of bounding-box regression, it fol-

lows the same architecture as the previous model (Dai

et al., 2016).

When it comes to single-shot detectors, the dom-

inant models are SSD (Liu et al., 2016) and YOLO

(presented initially in (Redmon et al., 2016)). SSD

eliminates the need for region proposal methods by

generating feature maps at multiple scales and us-

ing convolutional kernels on default-sized bounding-

box proposals to compute class probability scores and

bounding-box size offsets (Liu et al., 2016). YOLO

divides its input image into a grid, and each cell is re-

sponsible for predicting anchor boxes for the objects,

as well as class probabilities (Redmon et al., 2016),

(Redmon and Farhadi, 2017), (Farhadi and Redmon,

2018). When speed is a principal concern, the single-

shot detectors are best, with YOLO being the fastest

of the two. However, for accuracy, the best model is

R-FCN, although it is too slow for most real-time ap-

plications. For this reason, YOLO presents the best

trade-off between accuracy and speed, for bounding-

box detection.

2.2 Object Pose Estimation

The main deep learning approaches to pose estimation

are PoseCNN (Xiang et al., 2018), DPOD (Zakharov

et al., 2019), PVNet (Peng et al., 2019), Densefu-

sion (Wang et al., 2019) and HybridPose (Song et al.,

2020). Some models use only RGB image features for

their predictions, while others use depth data. Most

models also apply a reﬁnement step on the predic-

tion to make it more accurate, using depth data. The

following shows the most important characteristics of

these models.

PoseCNN decouples pose estimation into three

separate tasks: semantic labeling, 3D translation es-

timation, and 3D rotation regression, enabling the

network to model how each task relates to the oth-

ers (Xiang et al., 2018). As for DPOD, it comprises

a model composed of three blocks: correspondence

block, pose block, and a ﬁnal pose reﬁnement block

(Zakharov et al., 2019).

PVNet (Peng et al., 2019) looks to turn its model

more robust to occlusion and truncation. The main

idea is to predict, for each pixel in an object, the unit

vectors from that pixel to all the keypoints in the ob-

ject (Peng et al., 2019). Densefusion is one of the

Automatic 3D Object Recognition and Localization for Robotic Grasping

417

few to take advantage of depth information from the

ﬁrst stage of the model by fusing RGB data and point

cloud values on a per-pixel basis, utilizing a dense fu-

sion algorithm (Wang et al., 2019).

HybridPose (Song et al., 2020) differentiates itself

from other models by adding more intermediate rep-

resentations, enabling its model to work better in oc-

clusion situations and having more information about

the geometrical characteristics of the object. The in-

termediate representations used are: keypoints, edge

vectors, and dense pixel-wise symmetry correspon-

dences (Song et al., 2020).

As for the pose reﬁnement algorithms typically

used, Iterative Closest Point (ICP) (Rusinkiewicz and

Levoy, 2001) tends to be the most used method. This

algorithm tries to align a set of input points to a refer-

ence set, given some error metric, in an iterative man-

ner. DeepIM - presented in (Li et al., 2019) - is an-

other algorithm, which takes as inputs the object 3D

model and the image of the object and renders an im-

age based on the 3D model. The rendering is com-

pared to the original image using an artiﬁcial neu-

ral network (ANN) in an iterative manner. In most

benchmarking datasets, the combination of PoseCNN

and DeepIM tends to achieve the most accurate re-

sults, however it is slower than other models, due to

the many iterations it uses. Densefusion is the second

most accurate model overall, except when it comes

to occlusion, where HybridPose tends to achieve the

best results, due to its use of multiple intermediate

representations.

3 SYSTEM OVERVIEW

For the purpose of creating a solution able to detect

and locate objects for automatic grasping, several dif-

ferent modules were joined. The complete general

architecture of the system is presented in Figure 1,

where the data ﬂow can also be seen.

To capture the work-space, a Data Acquisition

module using a ZED stereo-vision camera was imple-

mented. After visual data acquisition, the Computer

Vision module (Object Detection/Segmentation plus

Pose Estimation) processes the received information

(RGB frames and depth maps) to obtain the object’s

6D coordinates (three for translation (Point Coord.),

and three for rotation (Gripper Rot.)), as well as iden-

tiﬁcation (Object ID).

The Computer Vision Module was implemented

in a Python script that contains a MQTT (MQ Teleme-

try Transport) client. This client is connected to a

MQTT broker located inside the 3D printed gripper

controller (a Raspberry Pi). In order to send the infor-

mation for the robot, the 6D pose and object’s ID are

published as messages in two different MQTT topics:

”topic/pose” and ”topic/id”.

A Robot/Grasping API was used on the robotic

system’s side, also implemented in Python with a

MQTT Client. In this API, the client subscribes to the

pose and id topics, receiving from the broker every

published message in those topics. With the object’s

pose, the API is then able to send a direct ”MOVE”

command to the robotic manipulator controller via

TCP/IP. The robot controller then sends back a signal

when arriving at the deﬁned position. After receiving

this feedback, the system knows the grasping itself

can be performed.

To achieve this, the Robot API generates a clos-

ing percentage for the gripper (% Closure), ranging

from 0 (totally open) to 100 (totally closed), based on

the object’s ID. This percentage is then published in a

MQTT topic (”topic/gripper”), triggering the gripper

to close or open accordingly and grasp the object in

question. After ﬁnishing grasping, the gripper pub-

lishes a message to provide the API with that info,

which then can send a new ”MOVE” command for

the object’s goal position after grasping. Finally, the

controller sends the ﬁnished movement signal. The

API publishes a message for the gripper to open. The

gripper publishes a message indicating that it ﬁnished

the task to the vision system, concluding the process.

3.1 Operation Modes

As can be seen in Figure 2, the complete system can

be run in two different modes: Conﬁguration mode

and Normal mode. In the ﬁrst mode, the objective is

to create three conﬁguration ﬁles that will be used by

the system in normal operation. The second mode is

responsible for the interaction with the system and the

calculations for pose estimation.

3.1.1 Conﬁguration Mode

The Conﬁguration mode has three steps and is in-

tended to be used by a technician. Its purpose is to

ease the creation of conﬁguration ﬁles that are re-

quired and loaded into Normal mode. These ﬁles are

more practical and increase the efﬁciency and mem-

ory optimization of the program. The ﬁrst step in

this mode is to input a set of matching points coordi-

nates, both in the camera coordinate frame and in the

robotic manipulator coordinate frame (the base joint,

usually). These coordinates are then saved to the ﬁrst

conﬁguration ﬁle.

In the second step, the user is shown an interac-

tive window for each object 3D model and is given

the opportunity of choosing the region of points that

ICINCO 2021 - 18th International Conference on Informatics in Control, Automation and Robotics

418

Figure 1: General system architecture with data ﬂow illustration.

Figure 2: Overview of all operation modes.

are ideal for grasping. This selection is essential, as

it provides the user with the possibility of adapting

the optimal grasping position according to the spe-

ciﬁc task being performed. The selected points are

then saved to the second conﬁguration ﬁle.

Finally, in the third step, the module requests the

user to input the range of rotation values that corre-

sponds to a particular position of the object, accord-

ing to grasping preference and application (such as a

rotation of 90 degrees around the x-axis correspond-

ing to the laying-down position). Afterward, the user

can also input a set of gripper rotations that give the

optimal grasping for a speciﬁc object position. These

values are saved into the third conﬁguration ﬁle.

3.1.2 Normal Mode

This mode represents the main operational mode for

the system. It is divided into three distinct units. The

ﬁrst unit (Initialization Module) runs only on start-

up and loads the various conﬁguration ﬁles and mod-

els needed while also establishing the MQTT connec-

tions required. The other two units run in parallel for

the duration of the program.

The Data Acquisition Model has two main func-

tions. The ﬁrst function runs on an inﬁnite loop and

deals with data retrieval from the ZED camera. The

other function calculates differences between consec-

utive frames using the Structural Similarity (SSIM)

algorithm. This algorithm outputs a value from 0 to 1

that indicates how similar two images are. The func-

tion compares this value to a threshold value to tell if

there were changes between frames (such as an object

moving). In the Computer Vision Module, a state ma-

chine (Figure 3) is implemented. There is one Initial

state and four main states:

• Detection: implements the Object Detection and

Segmentation system;

• Localization: implements the 6D Object Pose Es-

timation system;

• Ready to Grasp: chooses which object to grasp

Automatic 3D Object Recognition and Localization for Robotic Grasping

419

and sends coordinates to robotic manipulator;

• Grasping: waits for grasping to ﬁnish and tries to

detect changes in frame.

The transition between states depends on the changes

between frames that are detected in module 1. If any

object in the frame moves or disappears, the whole

state machine resets to the Detection state.

Figure 3: State machine of the pose estimation module.

4 DEEP LEARNING MODELS

4.1 Detection and Segmentation Model

The object detection and segmentation model is com-

posed of a pre-processing stage and an inference

stage. In pre-processing, the data received from the

Data Acquisition module is processed, so it can be fed

into the inference stage, which is made up of a Mask-

RCNN model(He et al., 2017). This stage then out-

puts a set of object identiﬁcations, bounding-boxes,

segmentation masks, and conﬁdence values for each

object in the camera frame.

In the pre-processing stage, three RGB image

frames and corresponding depth maps are received by

the pipeline, and an average of the pixel values for

both the RGB frames and the depth maps are calcu-

lated. This is to help reduce the impact of small illu-

mination differences, pixel shifting, and random noise

when a frame is acquired, which in turn reduces the

ﬂickering effect in the object masks.

Two different alternatives were analyzed for in-

ference: using YOLO with a segmentation mask

network overhead (much like how Mask-RCNN im-

proves upon Faster R-CNN) or using Mask-RCNN

instead. The conclusions were that Mask R-CNN

is actually more accurate and marginally faster than

the other alternative; therefore, it is actually a better

choice for this implementation.

4.2 Pose Estimation Model

This model begins with a pre-processing stage, which

processes the data received from the object detection

model to optimize the inputs of the pose estimation

stage. In pose estimation, a Densefusion(Wang et al.,

2019) architecture is used for inference, and it outputs

a set of rotation quaternions and translation vectors

for each object. A pose reﬁnement stage is then ap-

plied to give a better pose estimation. Finally, a post-

processing stage is executed to calculate the ﬁnal op-

timal grasping point coordinates and gripper pose for

every object in the frame.

In pre-processing, the mask obtained from Mask-

RCNN is used to obtain a set of object’s depth map

points. From this set, a group of 2000 points is sam-

pled at random in order to have a proper distribu-

tion throughout the whole object. The value of 2000

points, after some testing, was chosen because it is a

good trade-off between accuracy and speed. With the

points chosen, the point cloud x and y values for each

point is calculated, following Equations 1 and 2.

x =

(a − cx) × z

f x

(1)

y =

(b − cy) × z

f y

(2)

In these equations, z in calculated by dividing

depth by scale. Depth represents the depth value for

a point, scale depends on the measurement units used

by the camera, a and b are the pixel coordinates for a

point and cx, cy, fx and fy are intrinsic camera param-

eters. Following all this pre-processing, the Densefu-

sion model takes as input a crop of the camera image

frame (in the dimensions of the object’s bounding-

box), the calculated point cloud, the 2000 points from

the mask, and the object’s ID. The pose estimation

is outputted in the form of rotation quaternions and a

translation vector, which then goes through two itera-

tions of pose reﬁnement, to produce better results.

The values obtained are used to create a transfor-

mation matrix that transforms the points in the ob-

ject 3D model from the canonical frame deﬁned in

the dataset to the camera coordinate frame. In post-

processing, the set of optimal grasping points (which

were loaded from one of the conﬁguration ﬁles), are

transformed according to the transformation given by

the Densefusion model. The center of this group of

ICINCO 2021 - 18th International Conference on Informatics in Control, Automation and Robotics

420

points is then calculated to give the optimal point of

contact for grasping.

5 EXPERIMENTS AND RESULTS

5.1 Dataset

In order to test and validate our proposed system, the

dataset used to train both the object detection and the

pose estimation models was the YCB-Video dataset,

which uses 20 objects from the YCB Object Model

Set (Calli et al., 2015). From a quantitative stand-

point, this dataset has 133827 RGB images (with

480x640 pixels) and their corresponding depth maps,

as well as a collection of the point cloud 3D models

of all the objects. The camera’s intrinsic parameters

and its positions in the world are also given for each

frame.

Figure 4: Distribution of number of instances per object.

In order to evaluate if the dataset was imbalanced,

the number of instances per type of object in the

dataset was plotted. As can be seen in Figure 4,

the distribution of instances for each object is rela-

tively balanced, and each object has at least 15000

instances. From a qualitative standpoint, the YCB-

Video dataset presents many advantages in terms of

variability: cluttered/uncluttered backgrounds, vary-

ing light conditions, and simple/complex objects.

5.2 Object Detection Model

The training of this model was heavily based on the

idea of transfer learning. This method consists of us-

ing pre-trained weights from a model - obtained on

one dataset - as a starting point for training on a differ-

ent dataset. This approach speeds up training time and

increases performance, but only when the pre-trained

model learned relevant general features on a balanced

dataset.

In our ﬁrst approach, the pre-trained weights were

used for every branch, except the network heads; on

the second approach, the pre-trained weights were uti-

lized only for the backbone feature extraction net-

work, except for the ﬁnal layers. The ﬁrst approach

resulted in a 14% drop in mask generation accuracy

and a 5% drop in class label accuracy in relation to

the second approach. Although the ﬁrst approach

only takes 33 hours to train compared to the 46 hours

of training in the second approach, the segmentation

masks’ accuracy is paramount, so this is an acceptable

trade-off between speed and accuracy.

Table 1: Comparison of original and ﬁne-tuned hyper-

parameter values.

Hyper-parameter

Original

value

Fine-tuned

value

Number classes NA 21

Max gt instances 100 10

Detection min conﬁdence 0.8 0.7

ROI minibatch size 512 128

RPN nms threshold 0.7 0.5

Learning rate 0.02 0.002

RPN class loss 1.0 1.0

RPN bbox loss 1.0 0.01

Mrcnn class loss 1.0 1.0

Mrcnn bbox loss 1.0 0.01

Mrcnn mask loss 1.0 10.0

Given that the original hyper-parameters of the

model were optimized for the COCO dataset, the

hyper-parameters of the object detection model were

tuned. For that, Tensorboard (a visualization and op-

timization tool for tuning parameters in Tensorﬂow)

was used. In Table 1, a comparison between the orig-

inal and the ﬁne-tuned model parameters is shown.

Table 2: Results for bounding-box and segmentation mask

precision on all objects.

Class

name

Object ID

Bounding-box

Segmentation

mask AP

Master chef can 1 31.7 37.3

Cracker box 2 32.3 36.9

Sugar box 3 31.3 37.1

Tomato soup can 4 32.0 37.0

Mustard bottle 5 31.5 36.5

Tuna ﬁsh can 6 28.9 35.8

Pudding box 7 31.9 37.1

Gelatin box 8 32.3 37.4

Potted meat can 9 29.6 35.9

Banana 10 32.2 37.1

Pitcher base 11 31.8 36.8

Bleach cleanser 12 30.7 36.0

Bowl 13 32.2 37.2

Mug 14 31.8 35.8

Power drill 15 29.5 36.8

Wood block 16 32.3 37.0

Scissors 17 28.6 35.6

Marker 18 29.1 36.1

Clamp 19 28.9 35.7

Foam brick 20 31.0 36.7

Automatic 3D Object Recognition and Localization for Robotic Grasping

421

The results in Table 2 show that the average

bounding-box and segmentation mask precision is

comparable to the baseline values obtained on the

COCO dataset in (He et al., 2017). These values pro-

duce accurate segmentation masks for most objects,

which was the intended result.

5.3 Pose Estimation Model

Since Densefusion was already trained on the YCB-

Video dataset, pre-trained weights and parameters

were readily available for the implementation de-

scribed in (Wang et al., 2019). Therefore, this model’s

training strategy was to validate it with the pre-trained

weights on the YCB-Video dataset. There was no

need for parameter optimization, as the parameters

discovered in (Wang et al., 2019) were already op-

timal. The results of validation are presented in Table

3, where ADD-S represents the average distance be-

tween every point in the ground-truth pose and the

estimated pose.

Table 3: Results of validation of the pose estimation model.

Class

name

Object ID ADD-S (%)

Master chef can 1 95.8

Cracker box 2 94.7

Sugar box 3 97.1

Tomato soup can 4 93.4

Mustard bottle 5 97.0

Tuna ﬁsh can 6 96.3

Pudding box 7 95.5

Gelatin box 8 98.0

Potted meat can 9 91.1

Banana 10 96.1

Pitcher base 11 96.9

Bleach cleanser 12 95.3

Bowl 13 86.1

Mug 14 96.7

Power drill 15 95.5

Wood block 16 89.2

Scissors 17 95.0

Marker 18 97.1

Clamp 19 70.7

Foam brick 20 91.8

The worst performing objects are the bowl, the

clamp, and the tuna ﬁsh can. This is due to the bowl

and the clamp being symmetrical objects, which in-

ﬂuences pose estimation, since the same viewpoint

results in different poses, but also due to the fact that

they are texture-less objects, which makes it harder

for the network to extract feature keypoints from

them. The tuna ﬁsh can show poor results due to its

small size (for the same reasons that it had poor de-

tection results).

5.4 Computer Vision Module Testing

After training both models that compose the Com-

puter Vision Module individually on the YCB dataset,

the complete pipeline was analyzed. In this testing

phase, the localization model’s output was tested us-

ing as input the outputs given by the object detection

model. This is an important test to understand the im-

pact of detection errors on pose estimation.

One of the ﬁrst issues encountered was loading

both model weights into the GPU at the same time.

Since model weights take up a lot of GPU mem-

ory, the pose estimation model would fail due to the

GPU running out of available memory during calcu-

lations. The solution to this problem was to load the

less computationally expensive model (object detec-

tion model) weights into CPU memory while keeping

the pose estimation model running on the GPU. The

results in Table 4 show the accuracy of the pose esti-

mations, following the same metric used in 3.

Table 4: Results of testing on the Computer Vision Module.

Class

name

Object ID ADD-S (%)

Master chef can 1 77.5

Cracker box 2 72.7

Sugar box 3 72.6

Tomato soup can 4 71.5

Mustard bottle 5 78.4

Tuna ﬁsh can 6 70.0

Pudding box 7 79.4

Gelatin box 8 80.3

Potted meat can 9 74.0

Banana 10 78.6

Pitcher base 11 77.2

Bleach cleanser 12 78.1

Bowl 13 69.8

Mug 14 80.6

Power drill 15 73.0

Wood block 16 70.2

Scissors 17 73.9

Marker 18 71.4

Clamp 19 56.5

Foam brick 20 68.8

The results in Table 4 drop close to 20 points in

accuracy, for almost every object, in relation to the

results presented in Table 3. This drop is an ex-

pected outcome since the original Densefusion model

uses exact segmentation masks from the ground-truth,

which gives an accurate region of points belonging to

ICINCO 2021 - 18th International Conference on Informatics in Control, Automation and Robotics

422

a given object. The object detection model, however,

has some errors in its segmentation masks. This er-

ror results in the segmentation mask covering parts

of different objects, especially when multiple objects

are close together, on one extreme, or not covering the

entire object, on the other.

6 LIVE SYSTEM VALIDATION

For validating the live system, some online tests were

made. A group of 6 objects was selected from the 20

in the dataset. It was important to choose two objects

with the worst performance in ofﬂine testing, as well

as two objects that were top-performers. This enables

the testing to validate if the patterns observed in the

simulation are reproduced in a real environment. Due

to the industrial context of the project, two tools were

also selected. The list of objects used is as follows: 1)

Bowl; 2) Clamp; 3) Gelatin box; 4) Marker; 5) Mug,

and 6) Power drill.

The test setting was an environment with con-

trolled lighting and a table where the objects would

be placed. A ZED camera was positioned in front of

the table and connected (via USB 3.0) to a computer

running the system and equipped with a GeForce RTX

2080 Ti GPU. A collaborative robotic arm with a 5kg

payload (Universal Robots’ UR5) with a 3D printed

gripper, was used for the grasping experiments. The

following tests were performed:

1. Testing the Data Acquisition Module (frame ac-

quisition and analysis);

2. Testing the Object Detection and Segmentation

model for identifying each object in various po-

sitions and visually validate the object’s produced

mask;

3. Testing the ﬁnal Computer Vision Module output

(Identiﬁcation + Pose) with a grasping experiment

and measuring the processing time of each com-

ponent;

The training dataset used for this model was a

smaller version with only 2000 instances per object.

This presents an exciting analysis opportunity on the

effects of using a smaller dataset with transfer learn-

ing.

6.1 Object Detection and Segmentation

Testing

In this test, the aim was to validate the object de-

tection and segmentation model and evaluate its per-

formance in a typical grasping setting. Several ob-

jects were arranged on the workspace (as can be seen

in Figure 5), and close to ten seconds of frames (at

15fps) were retrieved and analyzed by the Object De-

tection and Segmentation model. The results measure

the percentage of accurate, inaccurate, and no detec-

tions, as well as the model’s average conﬁdence. The

reduced dataset has a big impact on detection perfor-

mance. The main issues with the use of so few in-

stances of each object results in the following prob-

lems:

• For complex objects (such as the clamp), the

model needs more data to be able to distinguish

features between objects, resulting in situations of

incorrect classiﬁcation;

• Even for objects with simple features (like the

gelatin box, the bowl, and the marker), the model

overﬁtted to certain viewpoints (which were more

prevalent in the reduced dataset), which resulted

in no detection in certain object positions;

• When objects were too close together, the model

had trouble in correctly detecting them, since this

ability is a more complex feature that needed more

training data;

However, as in any other problem, besides the is-

sues, there are some positive aspects worth mention-

ing regarding the model’s performance:

• The model is able to generalize and correctly

identify objects which are different from the

dataset, even with limited data. This is the case

of the gelatin boxes, which have different height

and width in relation to the original object; the

bowl, which is taller and less wide than the orig-

inal model; and the mug, which has a different

color and is more narrow than the dataset model;

• With the limited amount of training data, the

model is still able to detect accurately and with

great conﬁdence, the less complex objects, like

the gelatin boxes, the bowl, and the mug;

• Even with heavy occlusion (as seen in Figure 5f),

the model can correctly detect the bowl;

6.2 Grasping Experiments

In the ﬁnal test, the computer vision module was val-

idated. The testing methodology used was a grasping

experiment on the 5 of the 6 objects that were cho-

sen for testing. This was due to the available gripper

being too small to be able to grasp the power drill.

Each object was placed in a different work-space

position, surrounded by other objects. A grasping at-

tempt was made for each position, for a total of ﬁve

attempts. Illustrations of one of the positions for each

object are presented in Figure 6. The results obtained

Automatic 3D Object Recognition and Localization for Robotic Grasping

423

(a) Scene 1 (b) Scene 2 (c) Scene 3 (d) Scene 4

(e) Scene 5 (f) Scene 6 (g) Scene 7 (h) Scene 8

(i) Scene 9

Figure 5: Overview of all the scenarios used in detection testing.

Figure 6: Overview of all the grasping scenarios.

were recorded in Table 5. For each object, the Com-

puter Vision Module ran detection, generation of the

segmentation mask, and pose estimation. The center

point of the object’s coordinates were obtained, and

a transformation was applied from the center point to

the optimal grasping point to enable correct grasping.

Due to some time-restrictions, the set of opti-

mal grasping points’ transformation was not imple-

mented. However, this test serves as a proof of con-

cept, since the transformation of all object points is

equivalent to the transformation of the optimal grasp-

ing points. Furthermore, since the primary purpose of

this project is to develop a system for object recogni-

tion and localization, the main focus was on the cre-

ation of a pipeline with those capabilities.

Table 5: Grasping results.

Scene Object ID

Successful

attempts

(%)

Average

error

(cm)

1 8 60 2

2 8 60 5

3 13 20 2.7

4 14 80 0.5

5 18 60 3.5

6 19 40 0.75

7 CONCLUSIONS

In this paper, the problem of developing an automatic

object recognition and pose estimation system was

ICINCO 2021 - 18th International Conference on Informatics in Control, Automation and Robotics

424

addressed. This system combines object localization

and identiﬁcation to improve robotic grasping. This

work’s main contribution was the development of a

framework integrating two different ML models to

create a system for more complete and autonomous

grasping tasks.

This paper’s main focus was on the integration of

two different ML models for detection and pose es-

timation. Both Mask-RCNN and Densefusion were

correctly integrated and show adequate results in

a real testing environment. The developed system

is capable of detecting movements and differences

between video frames, therefore reducing the need

for constant calculation and inference on each video

frame. The proposed solution runs entirely automat-

ically after an initial user parameters’ selection and

can generate grasping outputs from video data at a

rate of 2.6fps.

An interesting positive aspect of the proposed so-

lution is the ability of a human to be in control of

choosing the optimal grasping points for an object and

creating a conﬁguration ﬁle with these points. This

allows for user and application adaptability while in-

creasing the system’s accuracy and speed since it does

not estimate these points during run time. The main

limitations found in the proposed solution were the

overhead work needed to output correct gripper rota-

tions for certain object positions, the impact of light-

ing conditions, and the dependency on previously

scanned 3D models of the actual objects.

ACKNOWLEDGEMENTS

INDTECH 4.0 – New technologies for intelligent

manufacturing. Support on behalf of IS for Techno-

logical Research and Development (SI

a Investigac¸

e Desenvolvimento Tecnol

ogico). POCI-01-0247-

FEDER-026653.

REFERENCES

Bicchi, A. and Kumar, V. (2000). Robotic grasping and

contact: a review. In Proceedings 2000 ICRA. Mil-

lennium Conference. IEEE International Conference

on Robotics and Automation. Symposia Proceedings

(Cat. No.00CH37065), pages 348–353 vol.1.

Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P.,

and Dollar, A. M. (2015). Benchmarking in manip-

ulation research: Using the yale-cmu-berkeley object

and model set. IEEE Robotics Automation Magazine,

22(3):36–52.

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-FCN:

Object detection via region-based fully convolutional

networks. Advances in Neural Information Processing

Systems, pages 379–387.

Farhadi, A. and Redmon, J. (2018). Yolov3: An incremental

improvement. Computer Vision Pattern Recognition.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn.

Lenz, I., Lee, H., and Saxena, A. (2015). Deep learning for

detecting robotic grasps. The International Journal of

Robotics Research, 34(4-5):705–724.

Li, Y., Wang, G., Ji, X., Xiang, Y., and Fox, D. (2019).

Deepim: Deep iterative matching for 6d pose estima-

tion. International Journal of Computer Vision.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C. Y., and Berg, A. C. (2016). SSD: Single shot

multibox detector. Lecture Notes in Computer Sci-

ence, 9905 LNCS:21–37.

Miller, A. T., Knoop, S., Christensen, H. I., and Allen,

P. K. (2003). Automatic grasp planning using shape

primitives. In 2003 IEEE International Conference

on Robotics and Automation, pages 1824–1829 vol.2.

Peng, S., Liu, Y., Huang, Q., Zhou, X., and Bao, H. (2019).

Pvnet: Pixel-wise voting network for 6dof pose es-

timation. 2019 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR).

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion, 2016-Dec:779–788.

Redmon, J. and Farhadi, A. (2017). YOLO9000: Better,

faster, stronger. Proceedings - 30th IEEE Conference

on Computer Vision and Pattern Recognition, CVPR

2017, 2017-Jan:6517–6525.

Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster R-

CNN:Towards Real-Time Object Detection with Re-

gion Proposal Networks. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 39(6):1137–

1149.

Rusinkiewicz, S. and Levoy, M. (2001). Efﬁcient variants

of the icp algorithm. In Proceedings Third Interna-

tional Conference on 3-D Digital Imaging and Mod-

eling, pages 145–152.

Saxena, A., Driemeyer, J., and Ng, A. Y. (2008). Robotic

grasping of novel objects using vision. The Interna-

tional Journal of Robotics Research, 27(2):157–173.

Song, C., Song, J., and Huang, Q. (2020). Hybridpose: 6d

object pose estimation under hybrid representations.

Wang, C., Xu, D., Zhu, Y., Martin-Martin, R., Lu, C.,

Fei-Fei, L., and Savarese, S. (2019). Densefusion:

6d object pose estimation by iterative dense fusion.

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition.

Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. (2018).

PoseCNN: A Convolutional Neural Network for 6D

Object Pose Estimation in Cluttered Scenes.

Zakharov, S., Shugurov, I., and Ilic, S. (2019). DPOD: 6D

Pose Object Detector and Reﬁner.

Automatic 3D Object Recognition and Localization for Robotic Grasping

425