Enhanced 3D Point Cloud Object Detection with Iterative Sampling and

Clustering Algorithms

Shane Ward and Hossein Malekmohamadi

Institute of Artiﬁcial Intelligence, De Montfort University, The Gateway, Leicester, U.K.

Keywords:

mAP– Mean Average Precision, RANSAC – Random Sampling and Consensus, DBSCAN – Density-based

Spatial Clustering of Applications with Noise, BIRCH – Balanced Iterative Reducing and Clustering using

Hierarchies, OPTICS – Ordering Points to Identify the Clustering Structure, MLVCNet – Multi-level Context

VoteNet.

Abstract:

Existing state-of-the-art object detection networks for 3D point clouds provide bounding box results directly

from 3D data, without reliance on 2D detection methods. While state-of-the-art accuracy and mAP (mean-

average precision) results are achieved by GroupFree3D, MLCVNet and VoteNet methods for the SUN RGB-

D and ScanNet V2 datasets, challenges remain in translating these methods across multiple datasets for a

variety of applications. These challenges arise due to the irregularity, sparsity and noise present in point

clouds which hinder object detection networks from extracting accurate features and bounding box results.

In this paper, we extend existing state-of-the-art 3D point cloud object detection methods to include ﬁltering

of outlier data via iterative sampling and accentuate feature learning via clustering algorithms. Speciﬁcally,

the use of RANSAC allows for the removal of outlier points from the dataset scenes and the integration of

DBSCAN, K-means, BIRCH and OPTICS clustering algorithms allows the detection networks to optimise the

extraction of object features. We demonstrate a mean average precision improvement for some classes of the

SUN RGB-D validation dataset through the use of iterative sampling against current state-of-the-art methods

while demonstrating a consistent object accuracy of above 99.1%. The results of this paper demonstrate

that combining iterative sampling with current state-of-the-art 3D point cloud object detection methods can

improve accuracy and performance while reducing the computational size.

1 INTRODUCTION

For common point cloud object detection applica-

tions involving scene understanding, the accuracy and

performance of the method relies heavily on pre-

processing of the input data prior to training the ob-

ject detection neural network. In complex real-world

applications, the scene and objects to be inspected

are susceptible to large amounts of outlier points and

noise which results in reduced accuracy and perfor-

mance. This also results in suboptimal use of com-

putational power on input data points which provide

misleading information of the objects in the scene.

Recent works related to neural networks for 3D ob-

ject detection, speciﬁcally using point cloud input,

have yielded promising results for various applica-

tions. It has also been demonstrated that the use of

purely geometric data with existing state-of-the-art

neural networks such as VoteNet (Qi et. al, 2019),

MLCVNet (Xie et. al, 2020) and GroupFree3D (Liu

et. al, 2021) can produce superior results compared

to methods which utilize 2D detectors and approxi-

mate 3D bounding box proposals based on 3D region

networks. Methods heavily inﬂuenced by 2D detec-

tors become computationally expensive for deducing

3D proposals for complex scene understanding and

applications where speed is critical.

The PointNet (Qi et. al, 2017) architecture was the

catalyst for the development of this new set of deep

learning methods with the objective of directly pro-

cessing point cloud data to tackle classiﬁcation, seg-

mentation, and object detection tasks. Prior to this

work, most 3D object detection methods performed

operations on 2D and 2.5D data to infer or project de-

tection algorithms onto 3D space such as Shape-based

3D matching or by transforming the 3D point cloud

data from irregular point clouds to regular 3D voxel

grids with methods based on VoxelNet. The PointNet

architecture was improved in terms of capturing local

structures in metric space, addressed by PointNet++.

674

Ward, S. and Malekmohamadi, H.

Enhanced 3D Point Cloud Object Detection with Iterative Sampling and Clustering Algorithms.

DOI: 10.5220/0010910600003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

674-681

ISBN: 978-989-758-555-5; ISSN: 2184-4321

The PointNet++ architecture is a direct extension of

PointNet using additional sampling and grouping in

conjunction with PointNet. This improved the ear-

lier method by using a hierarchical network utilizing

sampling and grouping layers which in turn improved

the model’s ability to classify and segment in met-

ric space. The current state-of-the-art works for 3D

point cloud object detection all utilize a PointNet++

backbone with additional network architectures for

each such as deep hough voting (Qi et. al, 2019),

multi-level context attention (Xie et. al, 2020) and

transformer-based attention (Liu et. al, 2021).

In this paper, we build on the existing state-of-

the-art 3D point cloud object detection methods by

demonstrating the importance of iterative sampling

and clustering algorithms to achieve both fast and ac-

curate 3D bounding box proposals. We propose en-

hanced versions of the current state-of-the-art meth-

ods by integrating a RANSAC iterative sampling

method and combining this with multiple clustering

algorithms to serve a wide variety of applications

(DBSCAN, K-Means, BIRCH, OPTICS). The itera-

tive sampling method provides a customisable ﬁlter

for the raw input point cloud data to separate outliers

and the various clustering algorithms allow for the

early extraction of features prior to neural network

training. For fair comparison we run our enhanced

VoteNet, MLCVNet and GroupFree3D methods on

two common benchmark indoor 3D datasets, SUN-

RGBD and ScanNet. The objective of this work is to

present the following contributions:

1. Propose a novel iterative sampling and clus-

tering framework for 3D point cloud object detection

and can be applied to a wide variety of applications.

We demonstrate increased efﬁciency, accuracy and

speed through our pre-processing framework.

2. Enhanced VoteNet, MLCVNet and

GroupFree3D methods achieving state-of-the-art

results through:

- Integration of customisable iterative sampling

method for the ﬁltering of outlier points.

- Integration and comparison of four customis-

able clustering methods to allow for early feature

extraction in training phase.

3. Considerations for deployment of state-of-the-

art 3D object detection methods in real-world ap-

plications where efﬁciency, accuracy and speed are

paramount.

2 BACKGROUND

In recent times, there have been many contributions

to the state-of-the-art methods for 3D object detection

on various input data. In this section, we review the

methods most relevant to this work and speciﬁcally

for methods with point cloud input data.

PointNet. The PointNet architecture as previously

stated was a large breakthrough in the direct process-

ing of raw point cloud data to achieve results with-

out the use of 2D detectors. There are advantages to

this method such as the processing time and ability to

process low numbers of data points but disadvantages

such as poor accuracy and the disconnect between the

data representation and the actual world scene, make

the method unusable for 3D point cloud object detec-

tion in applications where dense scene understanding

is a requirement. The PointNet architecture provided

an end-to-end network for the classiﬁcation, part seg-

mentation and semantic segmentation of raw point

cloud data. The method which uses sampling of point

sets, is an alternative to 3D voxelization which ap-

proximates errors for applications where high accu-

racy is required. This work demonstrated that with

a basic architecture reasonable results are achieved.

For testing robustness, it was shown that with 50% of

points missing from an input set via random sampling,

the accuracy only dropped by 2.4% and 3.8%. Also,

the method demonstrated robustness to outlier points,

achieving greater than 80% accuracy even when 20%

of points are outliers. PointNet was the ﬁrst of its

kind in demonstrating computational cost efﬁciency

which is an important factor in industrial applications.

PointNet is capable of processing greater than 1M

points/second with 1080X GPU showing great poten-

tial for real-time applications but the method did not

capture local structures in metric space.

PointNet++. The shortcomings of the PointNet archi-

tecture in terms of capturing local structures in met-

ric space were quickly addressed with PointNet++.

The architecture is a direct extension of PointNet us-

ing additional sampling and grouping in conjunction

with PointNet. This improved the earlier method by

using a hierarchical network utilizing sampling and

grouping layers which in turn improved the mod-

els ability to classify and segment in metric space.

The performance of the PointNet++ method on the

ModelNet40 dataset outperformed Subvolume (voxel

method), MVCNN (image method) and the earlier

PointNet method (Point clouds) with an accuracy of

91.9%. The paper acknowledged that further work in

improving inference speed (especially for MSG and

MRG layers) was a future option. It is also noted that

CNN based methods do not apply to unordered point

sets (point cloud data) and that the method can scale

well.

VoteNet. Perhaps the biggest breakthrough related to

this work was the introduction of the Deep Hough

Enhanced 3D Point Cloud Object Detection with Iterative Sampling and Clustering Algorithms

675

Voting network for object detection, also known as

VoteNet. The method of this paper, utilizes a Point-

Net++ backbone for feature learning and couples this

with Deep Hough voting in order to sample, group

and propose classiﬁcation. The VoteNet method uti-

lizes 3D bounding boxes and depends solely on ge-

ometric information. As previously stated, VoteNet

does not make use of RGB or Depth images similar

to other methods which supports the theory that state-

of-the-art object detection methods may be developed

from the processing of raw point clouds i.e., this is an

end-to-end method. In summary, the VoteNet method

learns to vote to object centroids directly from raw

point clouds and aggregates votes through their fea-

tures and local geometry to generate high-quality de-

tection proposals using only point cloud input, outper-

forming other methods where depth and colour im-

ages are also used.

MLCVNet. The objective of the MLCVNet (multi-

level context VoteNet) method is to recognize 3D

objects correlatively, building on the state-of-the-art

VoteNet. This method utilizes a self-attention mech-

anism and multi-scale feature fusion to model the

multi-level contextual information and propose three

sub-modules. The testing performed by the authors

of this paper proves that the contextual sub-modules

improve the accuracy and performance of 3D object

detection. The results of the MLCVNet architecture

described in the MLCVNet paper can be described

as state-of-the-art. On the ScanNet v2 dataset, the

MLCVNet method outperformed VoteNet and 3DSIS

methods for all categories of the dataset in terms of

mAP. Also, the qualitative results of 3D object de-

tection on the SUN-RGBD dataset demonstrate state-

of-the-art results. The ground truth bounding boxes

were compared to the results of mainly the VoteNet

and MLCVNet networks.

GroupFree3D. At the time of undertaking this

work, the most recent state-of-the-art neural net-

work method for performing object detection on point

cloud data is the GroupFree3D method. The method

computes the feature of an object from all points

in the scene point cloud through the help of an at-

tention mechanism where the contribution of each

point is automatically learned during the training

phase. GroupFree3D proposes an attention mecha-

nism which utilises a Transformer decoder allowing

for all points in the input point cloud to be used during

training. Implemented on the benchmark SUN RGB-

D and ScanNet v2 datasets, the method obtained state-

of-the-art mAP results of 69.1 @ 0.25 and 52.8 @

0.50. The authors for this work also executed abla-

tion studies on sampling strategy which demonstrated

improvements on the initial results. The objective of

this work is to build on recent state-of-the-art devel-

opments through implementation and evaluation of

enhanced versions of the identiﬁed current state-of-

the-art end to end 3D object detection methods on a

benchmark 3D point cloud dataset.

3 METHODOLOGY

We present a framework for performing iterative sam-

pling and clustering of point cloud data for 3D object

detection methods. The desired outcome of combin-

ing iterative sampling and clustering methods results

is to reduce the number of points in the input point

cloud. As a result of iterative sampling, the input

point cloud will have outlier points ﬁltered which im-

proves the neural network’s ability to accurately de-

tect objects in a dense or noisy scene. Adding cluster-

ing methods in combination with sampling will allow

for early extraction of key features and the identiﬁ-

cation of point clusters, the building blocks of each

object present in a scene. We recognise that a wide

variety of applications may be served by such a frame-

work for 3D point cloud object detection and we

therefore include several clustering algorithm options

in the framework to cater for this.

3.1 Iterative Sampling

Iterative sampling algorithms have existed for

decades and have proven to be powerful tools in the

pre-processing and ﬁltering of input data prior to neu-

ral network training. Perhaps the most common and

effective iterative sampling method is the Random

sample consensus (RANSAC) which estimates pa-

rameters of a mathematical model from a set of ob-

served data that contains outliers. A basic assumption

is that the data consists of inlier data points whose

distribution can be described by a model, and out-

liers which are data points which do not ﬁt the model.

These outlier points in point cloud dada, can result in

incorrect detection approximations about the interpre-

tation of the point set.

Figure 1: Segmentation of inlier and outlier points using

RANSAC method on industrial MVTec ITODD dataset.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

676

This outlier detection method applies to a wide

range of data science applications, but in this context

applies to dense point clouds for real-world applica-

tions. The removal of points in input point clouds

which provide no contextual information of objects

to be detected will reduce bounding box detection in-

accuracies and result in increased computational ef-

ﬁciency. The issue of computational size in training

neural networks with point cloud data remains one of

the most prevalent and RANSAC allows for signiﬁ-

cant reductions in non-contextual points in the input

data. The most relevant purpose of of the RANSAC

method is to provide a robust method for the segmen-

tation and removal of planes from point cloud scenes

which is important in many applications where base

planes are present with the objects to be inspected on

top of the base plane.

3.2 Clustering

Similar to iterative sampling methods, clustering al-

gorithms allow data points to be grouped into clus-

ters in an unsupervised manner. However clustering

methods allow for further subdivision of point sets

into several groups as opposed to just inlier and out-

lier groups with the RANSAC method. Multiple clus-

tering algorithms exist and are widely used in data

science applications. Relevant to this work on clus-

tering point cloud data for 3D object detection, we

have included four of the most common algorithms as

options within our proposed framework for training

neural networks.

3.2.1 DBSCAN

Density-based spatial clustering of applications with

noise (DBSCAN) is a density-based clustering algo-

rithm. DBSCAN is one of the most common clus-

tering algorithms and most cited in scientiﬁc litera-

ture, hence our selection for our proposed framework.

Given a set of points in space, DBSCAN groups to-

gether points that are closely packed together i.e.

points with multiple nearest neighbors. DBSCAN

marks points that lie alone in low-density regions as

outliers i.e., points whose nearest neighbors are too

far away.

3.2.2 K-means

K-means clustering is another popular unsupervised

clustering algorithm, which aims to group a number

of observations n into a target number of clusters k.

For each observation, it belongs to the cluster with

the nearest mean (cluster centroid), which serves as a

Figure 2: DBSCAN clustering algorithm diagram.

Figure 3: DBSCAN clusters example using industrial

MVTec ITODD dataset.

prototype of the cluster. The overall effect of mini-

mizing the averages of the squared distances between

the data points in the same point set. The pseudo-code

for the K-means clustering algorithm is described in

Fig. 4 below.

Algorithm 1 k-meeans algorithm

1: Specify the number k of clusters to assign.

2: Randomly initialize k centroids.

3: repeat

4: expectation: Assign each point to its closest centroid.

5: maximization: Compute the new centroid (mean) of each cluster.

6: until the centroid positions do not change.

Figure 4: Pseudo-code for K-means clustering algorithm.

The K-means clustering method provides a useful

alternative to DBSCAN which focuses on the centroid

centres of clusters. This method aligns with the over-

all desired outcome of 3D point cloud object detection

and pointwise networks due to the use of centroid cen-

tres. The size of the clusters must be set and for this,

the mean average size of each object class is used.

3.2.3 BIRCH

Balanced iterative reducing and clustering using hier-

archies (BIRCH) is another commonly used unsuper-

vised data science algorithm used to perform hierar-

chical clustering over particularly large datasets. The

main advantage of the BIRCH clustering algorithm is

its ability to incrementally cluster multi-dimensional

metric data points in a given point set to produce the

best quality clustering for a given set of memory and

time resource constraints. As a result of the efﬁ-

ciency of the BIRCH clustering algorithm we imple-

ment BIRCH as another option in the proposed frame-

work. BIRCH has been successfully implemented

in several related works for the clustering of multi-

dimensional point sets.

Enhanced 3D Point Cloud Object Detection with Iterative Sampling and Clustering Algorithms

677

3.2.4 OPTICS

Ordering points to identify the clustering structure

(OPTICS) is the ﬁnal commonly used clustering al-

gorithm implemented in the framework for enhanc-

ing 3D point cloud object detection. The OPTICS

clustering algorithm is also used for ﬁnding density-

based clusters in spatial data. The principle of OP-

TICS is similar to DBSCAN, however it addresses

the main DBSCAN weakness: the detection of mean-

ingful clusters in data of varying density. In order to

achieve this, each point in the point set is ordered such

that the spatially closest points are neighbors in the or-

dered structure. A special distance is also stored for

each point that represents the density that must be ac-

cepted for a cluster so that both points belong to the

same cluster.

4 RESULTS AND DISCUSSION

In order to evaluate the performance of our iterative

sampling and clustering framework, we ﬁrst integrate

it to the current state-of-the-art VoteNet, MLCVNet

and GroupFree3D 3D point cloud object detection

methods. We demonstrate the ability of the itera-

tive sampling to separate outlier points and reduce the

size of the input point cloud while all relevant data

points remain. We also demonstrate and compare the

ability of each clustering algorithm to enhance fea-

ture extraction in each of te state-of-the-art methods

using the benchmark SUN RGB-D and ScanNet V2

datasets with PointNet++ backbone for fair compar-

ison. All experiments for the purposes of this paper

were run utilizing the same setup for a fair compar-

ison also. The workstation consists of an Intel i9-

10900 processor (2.8GHz) and Nvidia GeForce RTX

2060 GPU. The workstation is running Ubuntu 20.04

and we use a python 3.7 anaconda environment to in-

stall all required packages, including PyTorch 1.1 and

Cuda 10.1.

4.1 Evaluation of Iterative Sampling

Enhanced VoteNet. For the implementation of this

method, we follow the provided instructions of the

VoteNet paper. This includes the use of a PointNet++

backbone with 4 set abstraction layers and 2 feature

propagation layers for a fair comparison and the use

of the common benchmark SUN RGB-D training and

validation datasets. The integration of the iterative

sampling framework method includes the modifying

the SUN RGB-D detection dataset class to include the

option to run VoteNet with RANSAC iterative sam-

pling. To achieve the results described in Tables 1

and 2 below, we use 20,000 points as the input for

each point cloud scene. We run 400 epochs with a

batch size of 8 and a learning rate of 0.001. A key

point to note is that for training we use only geometric

information and no image data for a fair comparison

against methods utilising image data.

Table 1: Mean average precision mAP @ 0.25 Enhanced

VoteNet comparison against current state-of-the-art meth-

ods on SUN RGB-D v1 validation set - Part 1.

Method bath bed bookshelf chair desk

VoteNet 74.4 93.0 28.8 75.3 22.0

MLVCNet 79.2 85.8 31.9 75.8 26.5

GroupFree3D 80.0 87.8 32.5 79.4 32.6

Ours RANSAC 80.4 87.4 30.2 63.2 96.0

Ours DBSCAN 61.8 65.4 21.3 48.6 63.5

Ours BIRCH 66.1 69.3 20.7 47.4 62.6

Ours KMeans 75.4 73.7 23.2 51.3 66.8

Ours OPTICS 64.3 66.5 21.3 44.7 59.3

Table 2: Mean average precision mAP @ 0.25 Enhanced

VoteNet comparison against current state-of-the-art meth-

ods on SUN RGB-D v1 validation set - Part 2.

Method dresser nightstand sofa table toilet

VoteNet 29.8 62.2 64.0 47.3 90.1

MLVCNet 31.3 61.5 66.3 50.4 89.1

GroupFree3D 36.0 66.7 70.0 53.8 91.1

Ours RANSAC 24.1 63.6 66.4 45.2 69.1

Ours DBSCAN 17.9 36.6 31.9 36.5 80.3

Ours BIRCH 18.1 42.2 32.5 39.9 84.4

Ours KMeans 21.3 49.4 34.3 43.5 91.1

Ours OPTICS 18.8 43.4 31.0 40.3 79.6

Table 3: Mean average precision mAP @ 0.5 Enhanced

VoteNet comparison against current state-of-the-art meth-

ods on SUN RGB-D v1 validation set - Part 1.

Method bath bed bookshelf chair desk

VoteNet 45.4 53.4 6.8 56.5 5.9

GroupFree3D 64.0 67.1 12.4 62.6 14.5

Ours RANSAC 37.7 18.6 14.2 37.4 56.4

Ours DBSCAN 42.1 19.7 6.9 19.3 5.8

Ours BIRCH 50.3 17.2 4.1 20.2 7.5

Ours KMeans 57.1 35.4 7.0 23.3 11.4

Ours OPTICS 52.2 17.7 6.3 18.2 6.9

Table 4: Mean average precision mAP @ 0.5 Enhanced

VoteNet comparison against current state-of-the-art meth-

ods on SUN RGB-D v1 validation set - Part 2.

Method dresser nightstand sofa table toilet

VoteNet 12.0 38.6 49.1 21.3 68.5

GroupFree3D 21.9 49.8 58.2 29.2 72.2

Ours RANSAC 13.6 34.7 14.4 16.0 61.3

Ours DBSCAN 6.4 17.0 7.2 23.5 63.6

Ours BIRCH 5.6 19.6 7.8 28.4 52.5

Ours KMeans 7.3 23.8 9.1 22.6 64.7

Ours OPTICS 6.1 21.0 8.1 28.9 50.7

The integration of the RANSAC iterative sam-

pling method to remove outlier points yielded promis-

ing results across the bath and desk class at mAP @

0.25 and bookshelf and desk at mAP @ 0.5 improving

on the current state-of-the-art GroupFree3D methods

shown in Tables 1 and 2 above, however the result

proved inconsistent across all classes.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

678

Figure 5: Average Object Accuracy of 99.27% and 99.84%

on SUN RGB-D training and validation datasets.

Figure 6: Input point cloud scene (20k points, Enhanced

VoteNet bounding box prediction, Ground truth comparison

vs prediction.

Enhanced MLCVNet. For the implementation of

this method, we follow the provided instructions of

the MLCVNet paper. This includes the use of a Point-

Net++ backbone with 4 set abstraction layers, 2 fea-

ture propagation layers and three sub-modules (patch-

patch context, object-object context and global-scene

context) to support a multi-level context attention

mechanism for a fair comparison and the use of the

common benchmark SUN RGB-D training and vali-

dation datasets. The integration of the iterative sam-

pling framework method includes the modifying the

SUN RGB-D detection dataset class to include the op-

tion to run MLCVNet with RANSAC iterative sam-

pling. To achieve the results described in Tables 3

and 4 below, we again use 20,000 points as the input

for each point cloud scene. We run 400 epochs with a

batch size of 8 and a learning rate of 0.001 due to time

constraints. Additionally, A key point to note is that

for training we use only geometric information and

no image data for a fair comparison against methods

utilising image data.

Table 5: Mean average precision mAP @ 0.25 En-

hanced MLCVNet comparison against current state-of-the-

art methods on SUN RGB-D v1 validation set - Part 1.

Method bath bed bookshelf chair desk

VoteNet 74.4 93.0 28.8 75.3 22.0

MLVCNet 79.2 85.8 31.9 75.8 26.5

GroupFree3D 80.0 87.8 32.5 79.4 32.6

Ours RANSAC 76.7 79.6 15.6 60.9 9.8

Ours DBSCAN 23.8 74.0 5.6 55.4 7.6

Ours BIRCH 26.5 76.7 4.6 56.4 8.5

Ours KMeans 31.2 76.7 4.6 56.4 8.5

Ours OPTICS 25.6 79.2 12.3 43.9 6.2

Enhanced GroupFree3D. For the implementation

of this method, we follow the provided instructions

of the GroupFree3D paper. This includes the use of

a PointNet++ backbone with 4 set abstraction layers

and 2 feature propagation layers and transformer de-

Table 6: Mean average precision mAP @ 0.25 En-

hanced MLCVNet comparison against current state-of-the-

art methods on SUN RGB-D v1 validation set - Part 2.

Method dresser nightstand sofa table toilet

VoteNet 29.8 62.2 64.0 47.3 90.1

MLVCNet 31.3 61.5 66.3 50.4 89.1

GroupFree3D 36.0 66.7 70.0 53.8 91.1

Ours RANSAC 12.2 35.4 27.4 39.2 95.5

Ours DBSCAN 9.2 23.2 28.0 34.3 86.1

Ours BIRCH 8.6 13.1 21.7 24.0 72.8

Ours KMeans 9.6 26.5 32.1 36.1 87.3

Ours OPTICS 7.8 19.0 25.3 26.1 77.7

Table 7: Mean average precision mAP @ 0.5 En-

hanced MLCVNet comparison against current state-of-the-

art methods on SUN RGB-D v1 validation set - Part 1.

Method bath bed bookshelf chair desk

VoteNet 45.4 53.4 6.8 56.5 5.9

GroupFree3D 64.0 67.1 12.4 62.6 14.5

Ours RANSAC 16.6 32.2 10.0 36.2 3.9

Ours DBSCAN 14.5 29.1 4.7 23.2 4.4

Ours BIRCH 13.9 36.1 2.6 17.5 3.4

Ours KMeans 27.8 31.2 7.1 23.5 29.1

Ours OPTICS 16.5 33.4 3.8 15.9 8.7

coder module to support a multi-head attention mech-

anism for iterative object feature extraction and box

prediction for a fair comparison and the use of the

common benchmark SUN RGB-D training and vali-

dation datasets. The integration of the iterative sam-

pling framework method includes the modifying the

SUN RGB-D detection dataset class to include the

option to run GroupFree3D with RANSAC iterative

sampling. To achieve the results described in Tables 3

and 4 below, we again use 20,000 points as the input

for each point cloud scene. We run 400 epochs with

a batch size of 8 and a learning rate of 0.001 due to

time constraints. A key point to note is that for train-

ing we use only geometric information and no image

data for a fair comparison against methods utilising

image data.

4.2 System Performance

As demonstrated by the experimental results per-

formed for this work, there is signiﬁcant potential

to further enhance existing state-of-the-art 3D point

cloud object detection methods with the use of it-

erative sampling and clustering methods. Our pro-

posed framework demonstrates improvements on the

state-of-the-art VoteNet and MLCVNet methods for

2 classes in each evaluation run. Due to time con-

straints, our experimental works on clustering meth-

ods and the use of the ScanNet V1 dataset was omit-

ted from this version of the paper. We demonstrate the

success of the evaluated RANSAC iterative sampling

method on the SUN RGB-D validation dataset.

For Enhanced VoteNet, we improve on the state-

of-the-art mAP results for the bath and desk classes @

0.25 with +0.4 and +64.4 respectively. For Enhanced

VoteNet with mAP @ 0.5 we improve on the state-of-

Enhanced 3D Point Cloud Object Detection with Iterative Sampling and Clustering Algorithms

679

Table 8: Mean average precision mAP @ 0.5 En-

hanced MLCVNet comparison against current state-of-the-

art methods on SUN RGB-D v1 validation set - Part 2.

Method dresser nightstand sofa table toilet

VoteNet 12.0 38.6 49.1 21.3 68.5

GroupFree3D 21.9 49.8 58.2 29.2 72.2

Ours RANSAC 7.3 7.0 11.6 8.4 56.2

Ours DBSCAN 5.3 32.8 20.3 13.7 39.9

Ours BIRCH 7.1 37.6 42.5 17.9 11.2

Ours KMeans 8.0 39.6 21.0 14.3 45.1

Ours OPTICS 9.8 40.4 36.8 11.9 28.5

Figure 7: Average Object Accuracy of 99.18% and 99.09%

on SUN RGB-D training and validation datasets.

the-art mAP results for the bookshelf and desk classes

with +1.8 and +41.9. We also demonstrate an object

accuracy of 99.27% during training and 99.84% dur-

ing testing.

For Enhanced MLCVNet, we improve on the

state-of-the-art mAP results again for the desk classes

@ 0.25 with +25.8. For Enhanced MLCVNet with

mAP @ 0.5 we did not achieve any improvements on

the state-of-the-art mAP results for any classes in the

validation dataset. We also demonstrate an object ac-

curacy of 99.18% during training and 99.09% during

testing.

For Enhanced GroupFree 3D, we do not improve

on the state-of-the-art mAP results @ 0.25 or @

0.5. Due to time constraints the number of epochs

for training this model was reduced. We do how-

ever demonstrate an object accuracy of 99.23% dur-

ing training and 99.11% during testing. Overall, it

is clear from the experimental results that the addi-

tion of the iterative sampling method to each of the

current state-of-the-art methods can achieve improved

results due to the ﬁltering of outlier points. However,

it is also clear that this is inconsistent across all object

classes in the SUN RGB-D dataset and will require

future works to ﬁne tune and improve results on other

classes to yield improved results.

5 CONCLUSIONS

In this paper, we propose an iterative sampling and

clustering framework to enhance 3D point cloud ob-

ject detection. For iterative sampling we utilize the

popular RANSAC algorithm which allows for the

ﬁltering out outlier points in the input point cloud.

For clustering, we utilize the DBSCAN, K-means,

Figure 8: Input point cloud scene (20k points, Enhanced

MLCVNet bounding box prediction, Ground truth compar-

ison vs prediction.

Table 9: Mean average precision mAP @ 0.25 Enhanced

GroupFree3D comparison against current state-of-the-art

methods on SUN RGB-D v1 validation set - Part 1.

Method bath bed bookshelf chair desk

VoteNet 74.4 93.0 28.8 75.3 22.0

MLVCNet 79.2 85.8 31.9 75.8 26.5

GroupFree3D 80.0 87.8 32.5 79.4 32.6

Ours RANSAC 30.8 38.9 10.0 35.1 58.4

Ours DBSCAN 29.7 36.6 21.8 33.7 37.2

Ours BIRCH 34.3 44.9 24.7 39.1 40.3

Ours KMeans 31.3 34.6 20.3 35.2 38.1

Ours OPTICS 33.4 43.5 23.1 37.8 39.0

Table 10: Mean average precision mAP @ 0.25 Enhanced

GroupFree3D comparison against current state-of-the-art

methods on SUN RGB-D v1 validation set - Part 2.

Method dresser nightstand sofa table toilet

VoteNet 29.8 62.2 64.0 47.3 90.1

MLVCNet 31.3 61.5 66.3 50.4 89.1

GroupFree3D 36.0 66.7 70.0 53.8 91.1

Ours RANSAC 13.2 11.3 17.9 53.6 49.6

Ours DBSCAN 16.6 22.9 23.5 49.7 50.4

Ours BIRCH 19.8 26.3 19.2 51.0 63.8

Ours KMeans 28.5 32.6 28.1 52.0 51.3

Ours OPTICS 21.8 23.7 19.0 50.5 62.1

Table 11: Mean average precision mAP @ 0.5 Enhanced

GroupFree3D comparison against current state-of-the-art

methods on SUN RGB-D v1 validation set - Part 1.

Method bath bed bookshelf chair desk

VoteNet 45.4 53.4 6.8 56.5 5.9

GroupFree3D 64.0 67.1 12.4 62.6 14.5

Ours RANSAC 38.6 41.3 10.2 51.7 12.9

Ours DBSCAN 32.0 42.4 13.8 37.6 36.4

Ours BIRCH 30.8 38.9 20.0 35.1 58.4

Ours KMeans 34.3 44.9 24.7 39.1 40.3

Ours OPTICS 33.1 40.2 16.4 38.0 41.4

Table 12: Mean average precision mAP @ 0.5 Enhanced

GroupFree3D comparison against current state-of-the-art

methods on SUN RGB-D v1 validation set - Part 2.

Method dresser nightstand sofa table toilet

VoteNet 12.0 38.6 49.1 21.3 68.5

GroupFree3D 21.9 49.8 58.2 29.2 72.2

Ours RANSAC 16.3 21.4 18.6 43.3 51.2

Ours DBSCAN 14.1 20.9 22.3 47.8 55.7

Ours BIRCH 13.2 11.3 17.9 53.6 49.6

Ours KMeans 19.8 26.3 29.2 51.0 63.8

Ours OPTICS 20.2 18.7 21.9 29.0 54.7

BIRCH and OPTICS algorithms which are widely

used for data pre-processing techniques. We evaluate

our framework by integrating to the current state-of-

the-art VoteNet, MLCVNet and GroupFree3D meth-

ods which boast the fastest, most accurate and highest

performing results across the benchmark SUN RGB-

D and ScanNet V2 point cloud datasets.

Through the experimental results demonstrated in

this paper, the RANSAC iterative sampling method

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

680

can be a useful addition to enhance current state-of-

the-art 3D point cloud object detection methods, as

shown with the improvements made on the state-of-

the-art mean average precision values @ 0.25 and

@ 0.5 for some classes. However, along with this,

the experimental results proved that the iterative sam-

pling method caused inconsistency across all classes.

This indicates the limitations of utilizing ou unsuper-

vised iterative sampling and clustering framework on

a dataset of varying classes and object shapes/sizes

demonstrating this may be best suited to applications

with primitive shapes or similar point cloud scenes. In

future works, we plan to further extend and ﬁne tune

the framework to achieve superior results on other

common benchmark datasets.

The results show that Enhanced VoteNet and En-

hanced MLCVNet achieved high object accuracy re-

sults for both training and testing on the benchmark

SUN RGB-D dataset with all runs yielding object ac-

curacy results greater than 99.1% which is promising.

The objective of this work is to evaluate the above

dataset and methods using key considerations of in-

dustrial applications which has not been previously

done for raw point cloud object detection methods.

VoteNet and MLCVNet, which were implemented on

the 3D point cloud dataset, show promising results in

terms of accuracy, computation, and real-time capa-

bility for industrial applications. However, one ad-

ditional consideration which needs further evaluation

is the process of updating the models for new ob-

ject classes, changes in ambient conditions or infras-

tructure in an industrial setting as this is important in

modern real-world applications.

REFERENCES

Saifullahi Aminu Bello, Shangshu Yu, and Cheng

Wang. Review: Deep learning on 3d point clouds. 2020.

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh

Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu

Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A

multimodal dataset for autonomous driving. 2020.

B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel,

and A. M. Dollar. The ycb object and model set: Towards

common benchmarks for manipulation research. pages

510–517, 2015.

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan,

Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J.

Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d

model repository. 2015.

Angela Dai, Angel X. Chang, Manolis Savva, Maciej

Halber, Thomas Funkhouser, and Matthias Nießner. Scan-

net: Richly-annotated 3d reconstructions of indoor scenes.

2017.

Bertram Drost, Markus Ulrich, Paul Bergmann, Philipp

Hartinger, and Carsten Steger. Introducing mvtec itodd - a

dataset for 3d object recognition in industry. Oct 2017.

Andreas Geiger, Philip Lenz, Christoph Stiller, and

Raquel Urtasun. Vision meets robotics: The kitti dataset.

2013.

Timo Hackel, N. Savinov, L. Ladicky, Jan D. Wegner,

K. Schindler, and M. Pollefeys. Semantic3d.net: A new

large-scale point cloud classiﬁcation benchmark. volume

IV-1-W1, pages 91–98, 2017.

S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Brad-

ski, K. Konolige, and N. Navab. Modelbased training, de-

tection and pose estimation of texture-less 3d objects in

heavily cluttered scenes. 2012.

T. Hodan, P. Haluza,

S. Obdr

alek, J. Matas, M.

Lourakis, and X. Zabulis. T-less: An rgb-d ˇ dataset for

6d pose estimation of texture-less objects. 2017.

R. Larsen, H. Aanaes, and S. Gudmundsson. Fusion

of stereo vision and time-of-ﬂight imaging for improved 3d

estimation. volume 1, pages 1–9, 2019.

Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xin-

han Di, and Baoquan Chen. Pointcnn: Convolution on x-

transformed points. 2018.

Yongcheng Liu, Bin Fan, Shiming Xiang, and Chun-

hong Pan. Relation-shape convolutional neural network for

point cloud analysis. 2019.

D. Maturana and S. Scherer. Voxnet: A 3d convo-

lutional neural network for real-time object recognition.

pages 922–928, 2015.

Charles R. Oi, Hao Su, Kaichun Mo, and Leonidas J.

Guibas. Pointnet: Deep learning on point sets for 3d classi-

ﬁcation and segmentation. Apr 2017.

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas.

Pointnet++: Deep hierarchical feature learning on point sets

in metric space. 2017.

Charles R. Qi, Or Litany, Kaiming He, and Leonidas J.

Guibas. Deep hough voting for 3d object detection in point

clouds. 2019.

Shuran Song, Samuel P. Lichtenberg, and Jianxiong

Xiao. Sun rgb-d: A rgb-d scene understanding benchmark

suite. pages 567–576, 2015.

Gusi Te, Wei Hu, Zongming Guo, and Amin Zheng.

Rgcnn: Regularized graph cnn for point cloud segmenta-

tion. 2018.

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma,

Michael M. Bronstein, and Justin M. Solomon. Dynamic

graph cnn for learning on point clouds. Jun 2019.

Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang,

and J. Xiao. 3d shapenets: A deep representation for volu-

metric shapes. 2015.

Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yim-

ing Zhang, Kai Xu, and Jun Wang. MLCVNet: Multi-level

contextvotenetfor 3d object detection. 2020.

Ze Liu, Zheng Zhang, Yue Cao, Han Hu, Xin Tong.

Group-Free 3D Object Detection via Transformers. 2021.

Enhanced 3D Point Cloud Object Detection with Iterative Sampling and Clustering Algorithms

681