cost systems in the market such as the Intel RealSense
T265 tracking camera that estimates the 6-DOF mo-
tion (Intel, 2019). Thus, it would be worth consid-
ering to investigate such available commercial visual
sensors and discuss about their usability for a reliable
tracking of a human as for instance in the context of
industrial environments.
This paper aims at evaluating the tracking perfor-
mance of the new imaging and tracking system, the
Intel RealSense T265 released by Intel in 2019 by
comparing its performances to ORB-SLAM2 (Mur-
Artal and Tardos, 2016). ORB-SLAM2 has been par-
ticularly chosen as it is one of the most accurate open-
source V-SLAM algorithms that integrates the ma-
jority of state-of-the-art techniques including multi-
threading, loop-closure detection, relocalization, bun-
dle adjustment and pose graph optimization. As has
been previously stated, the context of the benchmark-
ing involves a hand-held camera by a person moving
in an industrial environment.
The paper is organized as follows: section 2
presents some works including V-SLAM and new
imaging systems. Section 3 gives details about the
used sensors in this study as well as the evaluation
metrics used to assess their performance. Section 4
highlights the calibration method used between the
camera estimation and the VICON’s one to put both
of them in the same reference frame. Finally, Sec-
tion 5 presents a comparative study between the Re-
alSense T265 tracking camera and the stereo ORB-
SLAM2 followed by a discussion of the findings and
conclusions.
2 RELATED WORK
Our work is related to the fundamental and heav-
ily researched problem in computer vision: the vi-
sual SLAM, through the comparison of the perfor-
mances of the new low-cost RealSense tracking sen-
sor T265 and the RealSense D435 coupled with the
ORB-SLAM2 algorithm running in the stereo mode.
The history of the research on SLAM has been over
30 years, and the models for solving the SLAM prob-
lem can be divided into two main categories: filtering
based methods and graph optimization based meth-
ods. The filtering based methods usually use the Ex-
tended Kalman Filter (EKF), Unscented Kalman Fil-
ter (UKF) or Particle Filter (PF). These methods first
predict both the pose and the 3D features in the map
and then update these latters when a measurement is
acquired. The state of the art key-methods based on
filtering are the MonoSLAM (Davison et al., 2007)
that uses an EKF and FastSLAM (Montemerlo et al.,
2002) that uses a PF. The methods based on graph
optimization generally use bundle adjustment to si-
multaneously optimize the poses of the camera and
the 3D points of the map which corresponds to an
optimization problem. A key-method is PTAM pro-
posed by Klein et al. (Klein and W. Murray, 2009)
which introduced the separation of the localization
and mapping tasks into different threads and perform-
ing bundle-adjustment on keyframes in order to be
able to meet the real-time constraint. ORB-SLAM
uses multi-threading and keyframes as well (Mur-
Artal et al., 2015) and could be considered as an
extension of PTAM. On top of these functionalities,
ORB-SLAM performs loop-closing and the optimiza-
tion of a pose-graph. ORB-SLAM was first intro-
duced to work with monocular cameras and has sub-
sequently been extended to stereo and RGB-D cam-
eras in (Mur-Artal and Tardos, 2016). It therefore
represents the most complete approach in the state-
of-the-art-methods and has been used as a reference
method in several works. Moreover, a popular re-
search axis in SLAM is the visual-inertial SLAM
based on the fusion of vision sensor measurements
with an Inertial Measurement Unit (IMU). As well as
visual SLAM, VI-SLAM methods can be divided into
filtering-based and optimization-based. A review of
the main VI-SLAM methods has been presented in
(Chang et al., 2018).
In addition, new camera technologies have been
investigated in the context of visual SLAM. RGB-D
cameras have been extensively used in recent years
and several works document their performance. In
(Weng Kuan et al., 2019), a comparison of three
RGB-D sensors that use near-infrared (NIR) light pro-
jection to obtain depth data is presented. The sensors
are evaluated outdoors where there is a strong sun-
light interference with the NIR light. Three kinds of
sensors have been used namely a TOF RGB-D sen-
sor, the Microsoft Kinect v2, a structured-light (SL)
RGB-D sensor, the Asus Xtion Pro Live and an ac-
tive stereo vision (ASV) sensor the Intel RealSense
R200. These three sensors have been as well com-
pared in the context of indoor 3D reconstruction and
concluded that the Kinect v2 has better performance
in returning less noisy points and denser depth data.
In (Yao et al., 2017), a spatial resolution compar-
ison has been presented between Asus Xtion Pro,
Kinect v1, Kinect v2 and the R200. This compari-
son showed that the Kinect v2 performs better than
both the Primesense sensors and the Intel R200 in-
doors. In (Halmetschlager-Funek et al., 2019), ten
depth cameras have been evaluated. The experiments
have been performed in terms of several evaluation
metrics including bias, precision, lateral noise, dif-
VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications
358