scene, i.e. that they are acquiring information at
the same time. However, multiple cameras running
independently often capture images asynchronously.
Therefore, such an assumption is no longer valid
when synchronous camera systems are not used. This
makes proper 3D reconstruction difficult when us-
ing the epipolar geometry that is common in asyn-
chronous cameras.
In order to solve this problem, a method has
been proposed to transform the trajectory of the cor-
responding points into frequency space and restore
them as points in frequency space(Kakumu et al.,
2013). This method focuses on the trajectory as a
whole, rather than on each 3D point, and estimates
the frequency components that represent the trajec-
tory. This enables 3D reconstruction with an asyn-
chronous camera. However, in this method, the recon-
struction is carried out using an affine camera model
so that the 2D projected points and the 3D points can
be represented in a linear relationship. This makes it
difficult to apply when the cameras are located very
close to each other, as is the case in this study.
Here, viewing the frequency components recov-
ered by this method as parameters for parametrically
constructing the 3D trajectory, the reconstruction of
the 3D trajectory can be considered as the estima-
tion of parameters for constructing the trajectory. In
this case, as long as the necessary constraints for es-
timating the parameters are obtained, 3D reconstruc-
tion can be achieved appropriately even when images
taken at the same time are not available. In this study,
3D reconstruction from an asynchronous camera is
performed using such a parameter representation of
the trajectory.
Figure 2: Example of the 3D trajectory and projected 3D
points.X is a 3D point, x is a projection point, and t is time.
4.2 Representation of 3D Trajectories
Using Neural Networks
A typical representation of the parametric representa-
tion of a 3D trajectory is the interpolation method us-
ing spline interpolation, etc. In the method, a 3D tra-
jectory can be constructed from multiple basis points.
Therefore, the estimation of the 3D trajectory in this
method is equivalent to the estimation of the basis
points. However, when using such an interpolation
method, the 3D trajectory that can be represented by
the chosen interpolation method is limited. In addi-
tion, if the corresponding points cannot be observed
due to occlusion or other reasons, appropriate estima-
tion will not be possible.
Therefore, this study adopts the representation of
trajectories using neural networks. This method fo-
cuses on the fact that neural networks are general-
purpose functions that can represent various func-
tions, and uses this functional representation to repre-
sent trajectories. In other words, when a certain time
t is input, the neural network is trained as a function
that outputs a 3D point at that time. This learning is
achieved by minimising the reprojection error at each
camera and each time, defined as follows.
E
00
=
1
2
∑
t
∑
i
∑
j
[{(u
j,t
i
− ¯u(P
t
j
, X
t
i
))
2
+ (v
j,t
i
− ¯v(P
t
j
, X
t
i
))
2
}
+
∑
k6= j
{(e
t
j
−
¯
e(P
t
k
, T
t
j
))
2
}]
(8)
where X
t
i
is the 3D point obtained when time t is in-
put to the neural network. Also, ¯u and ¯v are the pro-
jected points obtained by projecting the 3D point by
the camera matrix. By minimizing the loss function,
we can obtain a neural network that represents the
3D trajectory of the observed points taken by asyn-
chronous cameras.
Note that when using a neural network to repre-
sent an arbitrary function, it is known that if variables
such as time are input directly, it becomes difficult to
represent high-frequency components. To avoid this,
it is necessary to map these variables to a higher-order
space in advance using positional encoding. This
method is also used in this study, and t is input to the
neural network after being mapped to a higher-order
space. In addition, appropriate initial values are re-
quired for this non-linear minimisation. For this rea-
son, in this study, 2D points are interpolated in ad-
vance to create a set of pseudo-synchronised corre-
sponding points. The interpolated values are used for
synchronous bundle adjustment. The results obtained
are optimised using the method described above to es-
timate the final reconstruction result.
Furthermore, the parameter representation using
such a neural network is applicable not only to the 3D
points to be restored, but also to all parameters includ-
ing the camera position. Therefore, in this research,
the same representation is used for these parameters,
and the camera position and 3D trajectory are esti-
mated by minimising the reprojection error shown by
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
150