Many of the steps of this monocular map initial-
ization process are well known; however, fast auto-
matic selection of the image pairs to use for the ini-
tialization process still remains an open problem. If
the image pairs do not have a sufficient translation
component in the camera movement or if the corre-
spondences are not sufficiently spread across the im-
ages, then the pose estimate is likely to be inaccurate
and the map initialization will cause the SLAM sys-
tem to immediately fail. This work presents a fast
approach for automatically determining if an image
pair is likely to result in an accurate reconstruction
by extracting various information from the collection
of correspondences and feeding that information to
trained logistic regression models.
1.1 Contribution
Given the problem of selecting appropriate im-
age pairs from a video sequence for visual SLAM
map initialization, previous works are too slow for
resource-limited platforms, lack robustness, only
work for non-planar scenes, or struggle in their
decision-making when observing pure rotations. This
work contributes a solution that attempts to address
all of these issues. Namely, this work contributes an
algorithm for selecting good image pairs from a video
sequence for visual SLAM map initialization that:
• is fast enough to be practical for real time use on
resource-limited platforms,
• provides decisions for both planar and non-planar
scenes,
• accurately rejects image pairs that demonstrate
pure rotational movement,
• and yields higher precision than the current state-
of-the-art approach.
2 RELATED WORK
There are a number of visual SLAM systems for real
time use that have been developed over the past sev-
eral years. MonoSLAM (Davidson et al., 2007) is the
first successful application of purely visual monocu-
lar SLAM, but its map initialization is aided by the
use of a known fiducial marker. During initialization,
markers can be used to act as a pre-existing map with
known 3D point locations. This enables the system to
accurately estimate the camera pose each frame be-
fore the system has sufficient baseline length to trian-
gulate new points into the map. Markers also enable
the system to register an accurate scale for their map
points during initialization, as demonstrated in (Xiao
et al., 2017). The use of markers is further expanded
upon in systems such as (Korah et al., 2011), (Maidi
et al., 2011), (Ufkes and Fiala, 2013), (Kobayashi
et al., 2013), and (Arth et al., 2015) which utilize fidu-
cials for pose estimation throughout the entire runtime
of the system, rather than just using them for map ini-
tialization.
Though the use of markers often makes camera
pose estimation more reliable and straightforward, it
also makes pose estimation systems less versatile, as
the systems will break down and have no reference
points to measure if the fiducial markers fall out of
frame. (Berenguer et al., 2017) proposes a SLAM
system that heavily utilizes global image descriptors
to perform localization, which frees the system from
any use of fiducial markers. However, the global im-
age descriptors used in the aforementioned work de-
pend on omnidirectional imaging and can take more
than half of a second to perform localization, mak-
ing them unsuitable for systems that require rapid,
real time tracking (such as in AR/VR systems or in
autonomous automobiles). Faster and more robust
pose estimation systems exclude the need for markers
by tracking somewhat arbitrary patches of the image
frame, without any predefined knowledge of what the
patches look like or where they should be located in
3D space (Klein and Murray, 2007), (Klein and Mur-
ray, 2009), (Sun et al., 2015), (Mur-Artal et al., 2015),
(Fujimoto et al., 2016), and (Qin et al., 2018).
Without knowledge of the 3D locations of the
tracked points, structure-from-motion (SfM) algo-
rithms are often used to aid in triangulating virtual 3D
locations for these points. For example, the PTAM
system (Klein and Murray, 2007), (Klein and Mur-
ray, 2009) is a real time visual SLAM system that
initializes its map by having the user manually se-
lect two frames (from different points in time), match-
ing FAST features (Rosten and Drummond, 2006)
between the two frames, estimating the pose differ-
ence between the frames with the 5-point algorithm
(Stewenius et al., 2006) or with a homography de-
composition (Faugeras and Lustman, 1988), and fi-
nally using the estimated pose to triangulate the point
matches into a virtual 3D map. (Sun et al., 2015)
adopts the PTAM framework for its map initialization
as well. The clear drawback of this approach is that
it is not automatic; it requires the user to select the
frames that should be used for initialization. (Huang
et al., 2017) presents an approach for map initializa-
tion that can initialize the map in a single frame with-
out user intervention, but it requires that the system
is operating inside of a typical indoor room and can
only initialize map points that coincide with the walls
of the room.
Towards Fast and Automatic Map Initialization for Monocular SLAM Systems
23