Stereo and LIDAR Fusion based Detection of Humans and Other

Obstacles in Farming Scenarios

Stefan-Daniel Suvei, Frederik Haarslev, Leon Bodenhagen and Norbert Kr

uger

SDU Robotics, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark,

Keywords:

Agricultural Robot, Sensor Fusion, Neural Network.

Abstract:

In this paper we propose a fusion method which uses the depth information acquired from a LIDAR sensor

to guide a block matching stereo algorithm. The resulting fused point clouds are then used for obstacle

detection, either by processing the raw data and clustering the protruding objects in the scene, or by applying

a Convolutional Neural Network on the 3D points and labeling them into classes. The performance of the

proposed method is evaluated by carrying out a series of experiments on different data sets obtained from the

SAFE robotic platform. The results show that the fusion algorithm signiﬁcantly improves the F1 detection

score of the trained networks.

1 INTRODUCTION

In recent years, a variety of agricultural robotic so-

lutions have been proposed with the purpose of im-

proving farm productivity. The prototypes are gener-

ally targeted towards applications such as fertilizing

and/or application of pesticides (Sharma and Borse,

2016), fruit picking (Song et al., 2016) or autonomous

weeding (Nakamura et al., 2016), thus allowing the

farmers to reduce the environmental impact and in-

crease the efﬁciency and precision of operations.

Farming requires a large variety of processes to

be carried out in order to obtain the required result.

In many of these operations an important element is

accurately traversing the ﬁeld or the crop rows. As

such, the ﬁeld robot has to not only drive and reach

the correct area of the ﬁeld, but do so while avoiding

the humans or animals that could be present and also

any obstacle that might damage the system, such as

trees or fences.

Hence guidance of the agricultural robot is an im-

portant task. A common way of doing this is to guide

the vehicle along a pre-deﬁned path based on input

from the global positioning system (GPS). Alterna-

tively the vehicles can be operated relative to the crop

lines, using machine vision (English et al., 2014).

While these options address some of the problems

outlined above, they are not necessarily the best solu-

tions in terms of precision and safety of the platform.

Another option is to fully automate the robot by using

various sensors in parallel and thus improving the sys-

Figure 1: The SAFE robotic platform, with the Sensor Kit

mounted on the blade of the tractor for a better visibility.

tem’s representation of the surrounding environment,

which can then be used for navigation, detection of

humans and obstacle avoidance.

This paper proposes a vision based 3D scene re-

construction method to generate dense point clouds,

which are processed and used to detect the protrud-

ing objects that are detected in front of the robot. The

advantage of our method is that it fuses the data cues

from both the LIDAR and the stereo camera on a low

level, thus proﬁting from the high reliability of the

LIDAR and the high density of the stereo data and

leading up to denser, more accurate point clouds. Ad-

ditionally, the resulting point clouds are used to detect

and label obstacles in the scene, by using the PointNet

model.

166

Suvei, S-D., Haarslev, F., Bodenhagen, L. and Krüger, N.

Stereo and LIDAR Fusion based Detection of Humans and Other Obstacles in Farming Scenarios.

DOI: 10.5220/0006619801660173

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, pages

166-173

ISBN: 978-989-758-290-5

The main contribution of this paper is generating

dense and accurate fused point clouds, using a sped up

version of a LIDAR-Stereo fusion algorithm, which

are then used by a PointNet based neural network 3D

recognition system, in order to label obstacles in out-

door scenarios. The results show that by using the

fused point clouds, better detection and labeling ac-

curacy is achieved.

The remainder of the paper is organized as fol-

lows: In Section 2, we describe related methods of

fusing different vision sensors for scene reconstruc-

tion and neural network training. Section 3 describes

the experimental platform used for our tests. The fu-

sion algorithm is described in Sections 4, while Sec-

tion 5 contains the Obstacle Detection methodology.

Section 6 shows the results of the methods, while the

concluding remarks are presented in Section 7.

2 RELATED WORK

In order to allow an agricultural robot to au-

tonomously perform tasks in an outdoor environment,

an accurate 3D scene reconstruction and interpreta-

tion must ﬁrst be ensured. As discussed above, 3D

scene reconstruction can be achieved by fusing the

data streams of multiple vision sensors. This is a topic

that has been investigated before in literature and two

main fusion techniques can be identiﬁed: a posteriori

and a priori fusion.

In the a posteriori fusion method, the ﬁnal result

of the vision system is improved by combining the

disparity map of the passive sensor (e.g. a stereo cam-

era) with the data from an active sensor (e.g. LIDAR,

2D scanner, etc.). The fusion can be done either di-

rectly at the object level (Zhang et al., 2014) or by

constructing a cost map (Romero et al., 2016), where

the two types of sensors have different weights ac-

cording to the level of trust in each speciﬁc region.

Another option is to build an elevation model by eval-

uating the consensus of the stereo and laser signals, as

shown in (Aeschimann and Borges, 2015).

In the case of the a priori technique, the stereo

matching process is directly guided by the active sen-

sor’s depth information. In (Badino et al., 2011), LI-

DAR depth data is used to improve the stereo compu-

tation process by limiting the disparity search space

of the stereo matching algorithm. Similarly, in (So-

manath et al., 2013) the Kinect depth information is

used to determine the data and smoothness costs of

the energy minimization function used for the global

stereo matching algorithm.

Accurate obstacle detection is often the main goal

of doing scene reconstruction and machine learn-

ing. For 2D data this can be done using a Convolu-

tional Neural Network (CNN) by predicting bound-

ing boxes (Redmon and Farhadi, 2016) or by seman-

tically segmenting the image to produce pixel labels

(Long et al., 2014; Teichmann et al., 2016). In (Qi

et al., 2016) they extend this idea to 3D. Their Point-

Net model is able to semantically segment unordered

point clouds to produce point-wise class labels.

In this paper, we apply an a priori method which

improves the stereo matching process by limiting the

disparity search range around the depth value ob-

tained from a LIDAR sensor. This method is a re-

ﬁnement of the algorithm in (Suvei et al., 2016). In

contrast to (Somanath et al., 2013) and (Badino et al.,

2011), our method focuses on the Block Matching lo-

cal stereo matching algorithm, due to the advantage

of scaling well for large images and for requiring less

memory and computation time. In this way the LI-

DAR point cloud is used to guide the Block Match-

ing process, leading to a better matching quality and

a denser disparity map and point cloud. The resulting

point cloud is then segmented using PointNet, and the

output is used to detect and label obstacles (e.g. hu-

mans, trees) in the scenes. Compared to (Qi et al.,

2016), our method uses a smaller feature vector and

is applied on outdoor data.

3 EXPERIMENTAL SETUP

The data recordings have been done using the SAFE

Platform and the Sensor Kit (Kragh et al., 2016) as

shown in Figure 1. This is a multi-sensor platform,

consisting of a Multisense S21 stereo camera, a Velo-

dyne HDL-32E LiDAR sensor, a Flir A65 thermal

camera, an RGB camera and a radar. The purpose

of the Sensor Kit is to be used in outdoor environ-

ments on autonomous tractors to ensure safety of the

humans in the ﬁeld and that of the system itself. For

our speciﬁc algorithm, only the stereo camera and

LIDAR inputs are used. The LIDAR sensor rotates

at a frequency of 10 Hz and it can record 700.000

points/second, using 32 horizontal scan beams and a

horizontal ﬁeld of view of 360

◦

. The stereo system

uses two S7 stereo ranging sensors, which operate at

15 Hz and output images with a 1024x544 pixels res-

olution.

4 SENSOR DATA FUSION

The sensor fusion algorithm used in this paper is our

Guided Block Matching (GuBM), presented in (Suvei

et al., 2016). The algorithm has been updated to run

Stereo and LIDAR Fusion based Detection of Humans and Other Obstacles in Farming Scenarios

167

(a) LIDAR disparity map(SDM)

(b) Dilated LIDAR disparity map

Figure 2: Visualization of the LIDAR disparity map - with

and without dilation.

on ROS (Quigley et al., 2009) and improved by us-

ing the OpenMP (Dagum and Menon, 1998) library

to parallelize some of the processes, which leads to a

76.9% increase in run time. With the purpose of of-

fering more context to the overall work, in the follow-

ing paragraphs we will restate details regarding the

GuBM method.

As mentioned, due to its low computation time

and good resolution scaling, one of the most popular

local stereo algorithms is Block Matching (BM),

where the depth information is computed by deter-

mining the pixel-distances of similar pixel-regions in

the stereo images pair. In essence, the disparity of a

pixel is computed by deﬁning a reference block of

neighboring pixels in the left image and then search-

ing for the most similar block in the right image,

which will reference the corresponding matching

pixel. Because the images are rectiﬁed beforehand,

pixel-features in the left image will be in the same

pixel rows than in the right image. This restricts the

search to only horizontal lines and transforms the

correspondence problem into a 1D search problem

which guarantees to ﬁnd the solution (i.e. best-

matching block). However, because the similarity

check is done for all the pixels, the computation load

can be high and it increases with the image resolution.

Fusion Algorithm: Choosing the disparity search

interval D is crucial, because it directly inﬂuences the

computation of the stereo matching process - a small

interval will make the algorithm skip some match-

ing pixels, while a too large interval will increase

the computation time and increase the risk of having

wrong matches. Our method proposes the use of the

depth information from the LIDAR sensor to limit the

disparity search interval around an expected value for

each pixel. An important step in doing this is to efﬁ-

ciently transform the depth data from the LIDAR into

a dense disparity range. Using the external calibration

of the two sensors, each point p ∈ IR

from the active

sensor’s point cloud is mapped to the left camera’s

coordinate system. The depth ranges are transformed

into disparities by using the corresponding image dis-

placements available from the stereo camera’s cali-

bration information ﬁle as follows:

= B f /Z (1)

where d

is a LIDAR obtained disparity value, B is

the baseline, f is the focal length and Z is the depth

value. The resulting Sensor Disparity Map (SDM)

can be observed in Figure 2(a).

Because the depth measurements of the LIDAR

sensor are of low resolution, the obtained disparity

map can be sparse. To compensate for this, a mor-

phological dilation operator is applied. A rectangular

structuring element of size 3x3 pixels is used, with

a number of 11 iterations. These values were chosen

such that the dilated regions around the original points

would barely overlap (see Figure 2(b)).

Using the dilated SDM and the d

values, we

can now perform so called ”Guided Block Match-

ing” (GuBM). First, D becomes D = [d

− R,d

+ R],

where R is a constant ﬁxed at the beginning of the

algorithm with the purpose of limiting the disparity

range around d

. In the areas where there are no

available d

values, the full disparity interval D is

used. Missing data caused by occlusions, texture-

less areas or even light conditions will still generate

empty regions in the Stereo Disparity Map (StDM).

These situations can be accounted for by applying

Guided Block Matching with gap Filling (GuBM-F).

In GuBM-F, if GuBM fails to compute a disparity

value for a pixel, then the ﬁnal stereo disparity value

is assigned from the computed dilated LIDAR dis-

parity map, if the depth information is available for

that speciﬁc pixel. A failed StDM and the resulting

GuBM-F dense disparity map is shown in Figure 3(a).

The evaluation of the algorithm shows that, aside

from leading to a much denser disparity map and

point cloud, a secondary advantage of the GuBM-F

method is that the LIDAR sensor acts as a double-

checker for the stereo camera. This means that when

stereo completely fails to do matching (e.g. due to

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

168

(a) Failed stereo disparity map

(b) GuBM-F disparity map

Figure 3: Result of GuBM-F, when stereo matching fails

due to movement of the camera.

light conditions or movement of the platform), the LI-

DAR data can still be used to generate a point cloud

of the scene (see Figure 3(b)) and as a result obstacle

detection and labeling can still be carried out.

5 OBSTACLE DETECTION

3D scene reconstruction is a vital component in agri-

cultural robotics, because it enables the robot to take

the appropriate actions accordingly to what is happen-

ing around it. To that extent, the purpose of obtaining

a dense point cloud via the GuBM-F method is to en-

hance the result of the obstacle detection process. The

result of the obstacle detection methods can be further

used by the decision layer of the robot, which could

issue a warning or potentially compute the best next

action.

In this paper we tackle the problem of obstacle

detection in two ways: via Data Clustering and by

using a CNN. The following subsections describe

these two methods in greater detail.

5.1 Data Clustering

The data clustering is done by processing the GuBM-

F resulted point cloud using different Point Cloud Li-

brary (PCL) (Rusu and Cousins, 2011) methods, with

(a) Cluster visualization

(b) 2D label of cluster

Figure 4: Result of 2D label generation process.

the intent of detecting any protruding object in the

scene (humans, trees, etc.) and marking them on the

2D image by attaching a bounding box around the ob-

stacle. The ﬁrst step in generating the boxes is remov-

ing the tractor points from the disparity map because

the detection of the tractor is not important for the end

result. After generating the point cloud, Plane Model

Segmentation is applied with a plane ﬁtting tolerance

of 0.35 m. The tolerance value was chosen to account

for the fact that the grass can have patches which are

bigger and therefore it will never be a perfect plane.

This will remove all the ground points that are within

the speciﬁed tolerance. The remaining points repre-

sent the elements in the scene that are protruding from

the ground. Using the PCL Euclidean Cluster Ex-

traction algorithm, they are clustered into obstacles,

with a tolerance of 0.3 m. The minimum accepted

number of points for a cluster is 600, while the maxi-

mum is 12.000. The clusters are then reprojected into

2D onto the left image plane, where bounding boxes

are attached around them (see Figure 4(b)). The clus-

ters can also be visualized in the original point cloud,

as shown in Figure 4(a). The method is not aware

which type of obstacle is there, only that an obstacle

is present.

The advantage of this method is that it can be

used to improve the computation time of 2D based

CNN labeling methods, such as YOLO (Redmon and

Farhadi, 2016), by applying them directly onto the

bounded regions, instead of on the entire image. The

Stereo and LIDAR Fusion based Detection of Humans and Other Obstacles in Farming Scenarios

169

(a) Segmentation of humans (b) Detection of humans

Figure 5: The segmentation and the detection of obstacles. Here green points corresponds to grass, purple points to human,

yellow points to tree and blue points to clutter.

disadvantage is that it is dependent on the system be-

ing able to produce a good-quality point cloud.

5.2 PointNet

As an alternative to the 2D labeling method, the point

clouds are also processed with the PointNet (Qi et al.,

2016) neural network. PointNet can semantically seg-

ment unordered point clouds by passing them through

the network, producing a class label for each point.

Training: The network is trained using a dataset

created with the Sensor Kit. 200 scene clouds have

been annotated into the classes grass, human, tree and

clutter. The clutter class is used for the tractor if it

has not been completely removed, as well as artifacts

from matching errors. The scene is cropped to 24×12

meters and Plane Model Segmentation is then used to

remove as many grass points as possible. The left-

over points are clustered using the before mentioned

data clustering and each cluster is assigned one of the

classes.

The input size of the network is set to n points.

To accommodate this each scene is split into b × b

blocks with a stride of s – resulting in overlap be-

tween the blocks if b > s. Blocks containing more

than n points are subsampled at random, while blocks

containing less have random points duplicated. If b is

large there will be a lot more than n points in a block,

resulting in a signiﬁcant amount of information being

thrown away. Therefore, when b is large each block

is resampled c times, creating c different blocks per

position. If there are less than 100 points in a block

it is discarded. The data sets have an over representa-

tion of grass points in the scenes and are thereby not

balanced. Therefore, blocks containing only grass are

discarded with a probability of 80%.

The authors of PointNet use a 9-dimensional fea-

ture vector per point: xyz values of the point relative

to the block center, rgb values from 0-1, as well as xyz

relative to the whole scene, where {0,0,0} is down in

the bottom left corner, and {1,1,1} is up in the far

right corner. This works well for indoor scenes as

this allows the network to learn that e.g. blackboards

usually are located on the edge of rooms. This is ir-

relevant in our domain as all classes can be located

throughout the whole scene. Therefore the feature

vector is shortened to only consist of the xyz values

relative to the block and the rgb values.

All networks were trained using adam optimizer

with a batch size of 24 and momentum of 0.9, and

an initial learning rate of 0.001 which decays every

300.000 steps by a rate of 0.5.

Detection: When a scene is segmented it is used

for obstacle detection. This is done by ﬁrst extract-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

170

Table 1: Results of the data clustering method, using Block Matching and GuBM-F generated point clouds. The Processed

Frames contains the total number of stereo pairs that where processed. The Human frames column shows how many of the

total processed frames contain a human in the scene. The Labeled Frames column shows for how many of the scenes data

clustering could be applied. The Detected humans column contains the total number of correctly labeled human frames, while

the Accuracy column gives the percentage of correctly detected humans.

BM Processed frames Human frames Labeled frames Detected humans Accuracy

Set 1 82 25 23 22 88%

Set 2 48 23 19 18 78%

Set 3 75 11 9 5 45%

GuBM-F Processed frames Human frames Labeled frames Detected humans Accuracy

Set 1 82 25 28 24 96%

Set 2 48 23 20 18 78%

Set 3 75 11 29 8 72%

ing all the human and tree points into separate clouds.

The separated clouds are then projected to the XZ

plane into 1 × 1 meter grids creating an occupancy

map. If a grid cell contains 800 points or more, it is

marked as containing an obstacle of the given class.

Figure 5 shows the results of semantic segmentation

and obstacle detection in scenes containing a human

or trees.

6 RESULTS

The proposed GuBM-F method has been imple-

mented using and modifying the OpenCV (Bradski,

2000) speciﬁc Block Matching method. The stereo

cameras and the LIDAR sensor have been appropri-

ately calibrated beforehand.

To measure the time required for our method, the

algorithm is tested on 15 consecutive scene frames,

with a resolution of 1024x544 pixels, on an Intel Core

i5-4210U, 1.7GHz machine. On average, the standard

Block Matching algorithm has a runtime of 0.176 s.

The introduction of the fusion component leads to an

increase of the stereo matching time and as a result,

the average runtime of GuBM-F is 0.292 s per frame.

To verify the overall performance and to quantify

the impact of the fusion method, both the Data Clus-

tering and the PointNet detection were tested on the

point clouds generated by the GuBM-F method and

the standard OpenCV Block Matching function.

6.1 Data Clustering Results

For the Data Clustering algorithm, three different data

sets (i.e. rosbags) were used for testing:

Set 1: Scenes with a human traversing the ﬁeld, in

front of the robot.

Set 2: Scenes with two humans walking in the ﬁeld.

Set 3: Scenes with 5 humans walking in the ﬁeld.

Since the scenes contain no other obstacles than

the mentioned humans, the results for detection on

humans is identical with obstacle detection.

On average, the Data Clustering algorithm has a

runtime of 0.097 s. The results of the tests are shown

in Table 1. As mentioned before, there are cases in

which the stereo matching process completely fails to

generate a point cloud, due to the movement of the

robot or the direct sun light. This, in return, heavily

affects the accuracy of the labeling process, which re-

lies on detecting clusters in the scene. This problem

is particularly obvious when looking at the results of

Set 3, where the GuBM-F method labeled 20 more

frames, with an increase in accuracy of 27%. Simi-

larly, in Set 1, we see a difference of 5 labeled frames

between the two methods. On the other hand, when

the Block Matching process does not fail, the result

between the two methods is comparable, as seen for

Set 2. It is worth mentioning that the reason why

there is a difference between the number of processed

frames and that of the labeled frames is because some

of the frames contain only grass scenes.

A practical issue that could be addressed in order

to improve the results is the usage of the thermal in-

formation from the Sensor Kit’s thermal camera. This

would allow for the detection of humans, even when

they would lie below the plane segmentation thresh-

old used by the data clustering method, thus improv-

ing the overall detection accuracy of the system.

6.2 PointNet Results

To further improve performance and to associate more

meaningful labels to the clusters extracted, we ap-

ply the 3D deep neural network approach described

in section 5.2.

Stereo and LIDAR Fusion based Detection of Humans and Other Obstacles in Farming Scenarios

171

Table 2: Ground truth parameters and test results of all trained networks. n is the number of points per block, b is the block

size, s is the stride when seperating the scene into blocks, and c is the sample count per block position.

n b [m] s [m] c F1

human

tree

Time [s]

GuBM-F 4096 1 0.5 1 0.931 0.950 3.67

BM 4096 1 0.5 1 0.839 0.702 3.06

GuBM-F (NF) 4096 1 0.5 1 0.959 0.986 3.75

BM (NF) 4096 1 0.5 1 0.895 0.879 3.80

GuBM-F (B6) 4096 6 3 10 0.902 0.900 0.139

GuBM-F (B12) 4096 12 3 20 0.843 0.917 0.033

GuBM-F (8kB12) 8192 12 3 10 0.863 0.919 0.062

PointNet is used to test the performance of ob-

stacle detection in clouds created with the proposed

GuBM-F method vs. the regular Block Matching gen-

erated point clouds. To do this, a ground truth has

been generated from the 200 scenes for each of the

two methods. The scenes contain ﬁve different peo-

ple in various poses, as well as a mannequin and some

trees. 40 of the 200 scenes are selected as the valida-

tion set, by ﬁnding all 28 scenes which contain per-

son A and 12 random scenes containing trees. The

same scenes are selected for the GuBM-F and Block

Matching (BM) sets. Two networks were then trained

using the remaining 160 scenes, one on the GuBM-

F data and one on BM data. The parameters for the

ground truth creation can be seen Table 2. As Point-

Net requires a GPU to run in real time, the tests in

this section was executed on an Intel Core i7-6700k,

4.0GHz workstation with a GeForce GTX 1080 TI.

The trained networks are used to detect humans

and trees in the validation set, as per section 5.2. Ta-

ble 2 shows that the GuBM-F network scores signif-

icantly higher F1 scores for both human and tree de-

tection than the BM network which again documents

the advantage of using fused information where the

more reliable sensor (LIDAR) guides the less reliable

but high resolution sensor.

A reason for the big difference might be that the

BM some times completely fails, making it impossi-

ble to detect any objects in the scene. In these cases,

the GuBM-F network is still able to detect obstacles,

as the GuBM-F algorithm ﬁlls the disparity map. This

is tested by making new data sets with all the scenes

where the BM fails removed. This leaves 150 scenes

in the training set and 32 scenes in the validation set.

Two networks are then trained on the resulting sets –

seen as GuBM-F (NF) and BM (NF) in Table 2. F1

scores for both networks increases, while the BM net-

work has a signiﬁcant increase in the F1 score for tree

detection. Again, the detection results on the fused

data where the LIDAR information guides the stereo

are signiﬁcantly better than for the pure stereo data.

It can be seen in Table 2 that it takes over 3 sec-

onds for the networks to segment a scene, which is

not suitable for a real time obstacle detection system.

This can be remedied by increasing the size of the

blocks fed to the network. Since a scene is 24 × 12

meters it can be separated into 288 1x1 meter blocks,

requiring the network to run 288 times in order to seg-

ment the whole scene. If instead it is separated into

6 × 6 meter blocks it only has to run eight times, and

only twice for 12 × 12 meter blocks.

Three new networks are trained with an increased

block size – GuBM-F (B6), (B12) and (8kB12) in Ta-

ble 2. c is adjusted to retain approximately the same

amount of training points for each network. All net-

works show a signiﬁcant improvement in processing

speed, while still retaining good F1 scores – although

a little worse than the slower counterparts. All three

networks are able to detect the trees reliably, while

the B12 networks struggles with humans to some de-

gree. This makes sense as small objects are repre-

sented with a minimal amount of points at that block

size. This introduces a balance between how fast the

segmentation should be, versus how small objects it

should be able to detect. By doubling the number

of point per block (8kB12) we increase the F1 score

slightly versus (B12), but the segmentation time is

also doubled as the network has to process twice as

many points.

Using the GuBM-F (B12) network, the whole pi-

beline – including block matching and point cloud

segmentation – takes at average 0.245s. However, the

block matching and segmentation can be run in par-

allel, as one is computed on the CPU while the other

is on the GPU. In practice this means that the system

runs at 10Hz with a 0.245s delay.

7 CONCLUSIONS

In this paper, we have presented an a priori fusion

method which improves a local stereo matching al-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

172

gorithm by using the depth information provided by

a LIDAR sensor to guide the stereo correspondence

process by limiting the disparity search interval. The

advantage of the method is that it generates fused

and dense point clouds which can then be used for

a more accurate detection of protruding obstacles in

the scene. The principle of fusion is that the more

reliable LIDAR sensor which however has a low res-

olution guides the less reliable but higher resolution

sensor. The obstacle detection is performed in two

ways on that data: by data clustering and by using a

CNN.

The results presented in Subsection 6.1 show that

the fused point clouds lead to a better 2D clustering

result, where the accuracy in detection can be up to

27% better than in the case where the Block Matching

point clouds are used. The simple clustering is not

able to detect which type of obstacle is there, only

that it is present.

The CNN is able to detect different classes. Our

results show that it is able to accurately detect humans

and trees in the fused point clouds while distinguish-

ing between them. The guidance through the LIDAR

information signiﬁcantly improves the detection F1

score of the trained networks.

ACKNOWLEDGEMENTS

This work was supported by the project SAFE Per-

ception, funded by the Danish Innovation Fund.

REFERENCES

Aeschimann, R. and Borges, P. V. K. (2015). Ground or

obstacles? detecting clear paths in vehicle navigation.

In 2015 IEEE Int. Conf. on Robotics and Automation

(ICRA), pages 3927–3934.

Badino, H., Huber, D., and Kanade, T. (2011). Integrat-

ing lidar into stereo for fast and improved disparity

computation. In 3D Imaging, Modeling, Processing,

Visualization and Transmission, Int. Conf. on, pages

405–412.

Bradski, G. (2000). The opencv library. Dr. Dobb’s Jour-

nal: Software Tools for the Professional Programmer,

25(11):120–123.

Dagum, L. and Menon, R. (1998). Openmp: an industry

standard api for shared-memory programming. IEEE

computational science and engineering, 5(1):46–55.

English, A., Ross, P., Ball, D., and Corke, P. (2014). Vi-

sion based guidance for robot navigation in agricul-

ture. In 2014 IEEE Int. Conf. on Robotics and Au-

tomation (ICRA), pages 1693–1698.

Kragh, M. F., Christiansen, P., Laursen, M. S., Larsen, M.,

Steen, K. A., Green, O., Karstoft, H., and Jørgensen,

R. N. (2016). Fieldsafe: Dataset for obstacle detection

in agriculture. CoRR.

Long, J., Shelhamer, E., and Darrell, T. (2014). Fully

convolutional networks for semantic segmentation.

CoRR.

Nakamura, K., Kimura, M., Anazawa, T., Takahashi, T., and

Naruse, K. (2016). Investigation of weeding ability

and plant damage for rice ﬁeld weeding robots. In

2016 IEEE/SICE International Symposium on System

Integration (SII), pages 899–905.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2016). Pointnet:

Deep learning on point sets for 3d classiﬁcation and

segmentation. CoRR.

Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T.,

Leibs, J., Wheeler, R., and Ng, A. Y. (2009). Ros: an

open-source robot operating system. In ICRA work-

shop on open source software, page 5. Kobe.

Redmon, J. and Farhadi, A. (2016). Yolo9000: Better,

faster, stronger. CoRR.

Romero, A. R., Borges, P. V. K., Elfes, A., and Pfrunder, A.

(2016). Environment-aware sensor fusion for obsta-

cle detection. In 2016 IEEE Int. Conf. on Multisensor

Fusion and Integration for Intelligent Systems (MFI),

pages 114–121.

Rusu, R. B. and Cousins, S. (2011). 3D is here: Point cloud

library (PCL). In Robotics and automation (ICRA),

2011 IEEE Int. Conf. on, pages 1–4. IEEE.

Sharma, S. and Borse, R. (2016). Automatic agricul-

ture spraying robot with smart decision making. In

Corchado Rodriguez, Juan Manueland Mitra, S. and

Thampi, Sabu M.and El-Alfy, E.-S., editors, Intel-

ligent Systems Technologies and Applications 2016,

pages 43–758. Springer International Publishing.

Somanath, G., Cohen, S., Price, B., and Kambhamettu, C.

(2013). Stereo+kinect for high resolution stereo cor-

respondences. In 3D Vision, Int. Conf. on, pages 9–16.

Song, X., Shan, H., Liu, H., and Guo, J. (2016). An under-

actuated end-effector design for fruit picking. In 2016

23rd Int. Conf. on Mechatronics and Machine Vision

in Practice (M2VIP), pages 1–5. IEEE.

Suvei, S.-D., Bodenhagen, L., Kiforenko, L., Christiansen,

P., Jørgensen, R. N., Buch, A. G., and Kr

uger, N.

(2016). Stereo and active-sensor data fusion for im-

proved stereo block matching. In Int. Conf. Image

Analysis and Recognition, pages 451–461. Springer.

Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., and

Urtasun, R. (2016). Multinet: Real-time joint seman-

tic reasoning for autonomous driving. CoRR.

Zhang, F., Clarke, D., and Knoll, A. (2014). Vehicle detec-

tion based on lidar and camera fusion. In IEEE 17th

Int. Conf. on Intelligent Transportation Systems, pages

1620–1625.

Stereo and LIDAR Fusion based Detection of Humans and Other Obstacles in Farming Scenarios

173