Self-Mounted Motion Capture System Using Mutual Projection of

Asynchronous Cameras

Kazusa Ozaki, Fumihiko Sakaue and Jun Sato

Nagoya Institute of Technology, Nagoya, Japan

Keywords:

Motion Capture Systems, Bundle Adjustment, Neural Network Representation, Asynchronous Stereo.

Abstract:

In this research, we propose a method for restoring three-dimensional motion from time-series images captured

by an asynchronous camera in order to realise a motion capture system using a camera attached to the body

surface. For this purpose, we represent the motion trajectory of each marker using a neural network, and

estimate the motion trajectory by optimising the neural network from the input images. It is also shown that

stable 3D restoration can be achieved using a method called mutual projection, assuming that the cameras are

reﬂecting each other. We show that it is possible to estimate 3D motion from asynchronous cameras with high

accuracy.

1 INTRODUCTION

In recent years, motion capture technology (Moes-

lund et al., 2006), which measures and analyses hu-

man movements into numerical values, has been used

in a wide range of ﬁelds. However, the current main-

stream motion capture method, the optical motion

capture method(Guerra-Filho, 2005), has high mea-

surement accuracy, but the system tends to be large

because it requires cameras to be set up around the

area to be captured.

Other than optical motion capture systems, there

are mechanical motion capture systems that use ac-

celeration sensors or angular acceleration sensors for

measurement(Roetenberg et al., 2009). Mechanical

systems use sensors for measurement and can capture

in various environments without the need to install

equipment such as cameras and markers, but have

lower accuracy than optical systems. In addition, the

sensors used for capturing are affected by magnetic

ﬁelds, making them unsuitable for locations with un-

stable magnetic ﬁelds.

In a previous study of motion capture in small-

scale systems, a method for acquiring motion infor-

mation by restoring the motion of a camera attached

to the human body using structure from motion tech-

nology has been proposed(Shiratori et al., 2011). This

method is relatively inaccurate compared to the actual

measurement method, as it does not directly capture

the markers, etc. There is also a method for captur-

ing motion by attaching an omni-directional camera

or other camera capable of capturing the target per-

son(Miura and Sako, 2020). Although this method

directly captures the object, there are many areas that

are hidden by occlusion, resulting in low measure-

ment accuracy.

As described above, methods for motion capture

without using an external camera have been proposed,

but each method has problems in terms of accuracy

and stability. In this study, we propose a method

to solve these problems by attaching a camera and a

marker to the object and using the camera as a marker

and a ﬁlming device. In this method, the camera at-

tached to the body is regarded as a marker, while other

markers and cameras are photographed and their 3D

positions are recovered for motion measurement. In

this case, a very strong geometric constraint called

mutual projection can be used, which enables stable

estimation and restoration of the camera position.

However, when using a stereo camera system such

as the one proposed in this system to perform restora-

tion, the cameras need to take pictures synchronously,

as the corresponding points taken at the same time are

required. However, as described above, the cameras

need to be connected to each other for synchronous

shooting, which makes it unsuitable for a method in

which the cameras are attached to the human body.

For this reason, this study presents a method that can

stably realise 3D restoration even from asynchronous

cameras. This method focuses on the motion trajec-

tory of each marker and restores parameters related to

the trajectory, enabling stable stereo restoration even

Ozaki, K., Sakaue, F. and Sato, J.

Self-Mounted Motion Capture System Using Mutual Projection of Asynchronous Cameras.

DOI: 10.5220/0012473500003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

147-153

ISBN: 978-989-758-679-8; ISSN: 2184-4321

147

from images captured by an asynchronous camera. In

this way, a method for stable motion restoration from

images taken by a group of cameras attached to the

human body is presented.

2 EPIPOLAR GEOMETRY AND

STEREO RECONSTRUCTION

In this section, we will explain the camera model used

in this research and a method for restoring 3D shapes

based on image information obtained from the camera

based on epipolar geometry.

2.1 Epipolar Geometry

First, 3D restoration using the stereo camera system

used in this study is described. In this study, the

3D information is recovered from the image infor-

mation obtained from the cameras. For this purpose,

the epipolar geometry(Hartley and Zisserman, 2003)

is used, which can represent multiple cameras. Let us

consider the case where two cameras capture a point

X =[X Y Z]

in 3D space and the points m =[u v]

and m

=[u

]

on the image are obtained. The fol-

lowing relation holds between these two points using

the fundamental matrix F that represents the relative

relationship between cameras.

m = 0 (1)

where

m is the homogeneous representation of

, m and

m = [m

. Also, F is the fundamental

matrix, which represents the relative relationship be-

tween the two cameras. Images taken by two cameras,

the corresponding points will always satisfy this equa-

tion. In addition, F contains information about the

position and orientation between the cameras. There-

fore, this epipolar geometry can be used to determine

the relative attitude information between the cameras.

2.2 3D Reconstruction by Stereo

Camera System

Next, the camera projection matrix P is obtained from

the F matrix, and the 3D reconstruction is performed

using this matrix. The following relationship is es-

tablished between the camera matrix P and the points

X =[X Y Z]

in 3D space and m =[u v]

on the im-

age.





























(2)

where p

∼ p

are components of the camera pro-

jection matrix. where p

to p

are the elements of

the camera matrix. Eliminating λ from the equation

(2) and summarising for [X Y Z]

, we obtain the fol-

lowing.



u − p

v − p













− p



(3)

This equation shows that two constraint equations for

X can be obtained from one camera if the camera ma-

trix and the point m on the image are known. Thus,

a 3D point X can be recovered if the corresponding

points taken by two or more cameras are available.

2.3 Bundle Adjustment

In 3D shape reconstruction with actual camera im-

ages, it is often not possible to ﬁnd a suitable solution

due to various noise effects. This is because not only

the image points used for reconstruction, but also the

parameters in the camera matrix are strongly inﬂu-

enced by noise. Therefore, in many cases, a method

called bundle adjustment is used to optimise the cam-

era parameters and the 3D restoration result to obtain

a more accurate estimation of the restoration result.

Bundle adjustment(Triggs et al., 2000) is a method

for optimising multiple parameters in a batch in order

to improve the estimation accuracy. This optimisation

is achieved by estimating the camera matrix P and the

three-dimensional point X in such a way that the error

in the reprojection of the estimated 3D shape onto the

image plane, i.e. the reprojection error, is minimised,

as described above in the following formula.

E =

∑

{(u

− ¯u(P

, X

))

+ (v

− ¯v(P

, X

))

} (4)

where u

, v

are the coordinates of the observa-

tion point obtained by imaging the three-dimensional

point X

with camera j, ¯u(P

, X

) , ¯v(P

, X

) are the

coordinates of the three-dimensional point X

pro-

jected onto the image plane with coordinates obtained

by projecting the 3D point X

onto the image plane

by the camera P

. As this reprojection error is a non-

linear function, some reasonable initial values are re-

quired for its minimisation. For this reason, in general

3D reconstruction methods, an initial estimate of P is

made from the F matrix obtained based on epipolar

geometry, and bundle adjustment is carried out using

this as the initial value.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

148

3 MUTUAL CAMERA

PROJECTION IN EPIPOLAR

GEOMETRY

This section describes a stereo reconstruction method

using mutual projection.

3.1 Mutual Camera Projection

By using the bundle adjustment presented in the pre-

vious section, the position and orientation of the cam-

era and the 3D shape can be estimated simultaneously.

However, in this research, both the camera and the

marker are mounted on the human body for measure-

ment, so the camera position varies signiﬁcantly com-

pared to normal scenes. Therefore, an important issue

in this research is to stabilise the camera position. In

order to solve this problem, we utilise the mutual pro-

jection system (ITO and SATO, 2002).

The self-attached motion capture system proposed

in this study requires a camera mounted on the body

surface to capture other markers. Therefore, wide-

angle cameras that can capture a very large area are

used. As described above, not only the markers whose

positions are to be measured, but also the camera for

taking the images will be incorporated in the images

taken by each camera. Considering that the epipole

in the epipolar geometry coincides with the point at

which the optical centre of the camera is captured, it

can be seen that in this situation the epipole can be

directly obtained from the observed image. Since the

epipoles contain the position information of the cam-

eras, the relative positions of the cameras can be de-

termined very stably by obtaining them directly.

When two cameras are projected onto each other’s

image plane as epipoles e,e

as in Fig.1, the following

relationship is established between the basis matrix F

and the epipoles e,e

e and e

e = 0 (5)

= 0 (6)

where

are homogeneous representations of e,e

Since this epipole places a strong constraint on the ba-

sis matrix F, the relative attitude information between

the two cameras can be obtained stably by calculating

F using the directly observed epipole.

3.2 Bundle Adjustment Using Mutual

Projection

The mutual projection in the estimation of the F ma-

trix described above allows a stable estimation of the

Figure 1: Epipolar geometry and epipole.Points e and e’ are

epipoles of cameras c and c’.

camera parameters. This is optimised by bundle ad-

justment to perform 3D reconstruction. In this case,

the strong constraints obtained by the mutual projec-

tion can also be used in the bundle adjustment. As

mentioned above, in mutual projection the camera po-

sition is captured directly. Therefore, when perform-

ing bundle adjustment, the reprojection error can be

calculated for the estimated camera position in addi-

tion to the reprojection error of the 3D points. Consid-

ering this, bundle adjustment using mutual projection

can be deﬁned as minimising the reprojection error

of the reconstructed points as well as the reprojection

error of the epipoles by the following equation.

(P, X) = E +

∑

k6= j

−

e(P

, T

))

(7)

where e is the epipole observed by the camera and

e(P, T) are the coordinates of the estimated 3D posi-

tion of the camera T projected onto the image plane.

The reprojection error calculated in this way can

directly optimise the information on the camera po-

sition. Therefore, the estimation of the camera posi-

tion is more accurate than when estimating the camera

position only from the relation of the corresponding

points.

4 STEREO RECONSTRUCTION

USING ASYNCHRONOUS

CAMERAS

This section describes a method for stereo reconstruc-

tion using asynchronous cameras.

4.1 3D Trajectory Reconstruction Based

on Parameter Representation of 3D

Trajectories

At last, 3D reconstruction using asynchronous cam-

eras is described. All the methods described above

assume that multiple cameras are capturing the same

Self-Mounted Motion Capture System Using Mutual Projection of Asynchronous Cameras

149

scene, i.e. that they are acquiring information at

the same time. However, multiple cameras running

independently often capture images asynchronously.

Therefore, such an assumption is no longer valid

when synchronous camera systems are not used. This

makes proper 3D reconstruction difﬁcult when us-

ing the epipolar geometry that is common in asyn-

chronous cameras.

In order to solve this problem, a method has

been proposed to transform the trajectory of the cor-

responding points into frequency space and restore

them as points in frequency space(Kakumu et al.,

2013). This method focuses on the trajectory as a

whole, rather than on each 3D point, and estimates

the frequency components that represent the trajec-

tory. This enables 3D reconstruction with an asyn-

chronous camera. However, in this method, the recon-

struction is carried out using an afﬁne camera model

so that the 2D projected points and the 3D points can

be represented in a linear relationship. This makes it

difﬁcult to apply when the cameras are located very

close to each other, as is the case in this study.

Here, viewing the frequency components recov-

ered by this method as parameters for parametrically

constructing the 3D trajectory, the reconstruction of

the 3D trajectory can be considered as the estima-

tion of parameters for constructing the trajectory. In

this case, as long as the necessary constraints for es-

timating the parameters are obtained, 3D reconstruc-

tion can be achieved appropriately even when images

taken at the same time are not available. In this study,

3D reconstruction from an asynchronous camera is

performed using such a parameter representation of

the trajectory.

Figure 2: Example of the 3D trajectory and projected 3D

points.X is a 3D point, x is a projection point, and t is time.

4.2 Representation of 3D Trajectories

Using Neural Networks

A typical representation of the parametric representa-

tion of a 3D trajectory is the interpolation method us-

ing spline interpolation, etc. In the method, a 3D tra-

jectory can be constructed from multiple basis points.

Therefore, the estimation of the 3D trajectory in this

method is equivalent to the estimation of the basis

points. However, when using such an interpolation

method, the 3D trajectory that can be represented by

the chosen interpolation method is limited. In addi-

tion, if the corresponding points cannot be observed

due to occlusion or other reasons, appropriate estima-

tion will not be possible.

Therefore, this study adopts the representation of

trajectories using neural networks. This method fo-

cuses on the fact that neural networks are general-

purpose functions that can represent various func-

tions, and uses this functional representation to repre-

sent trajectories. In other words, when a certain time

t is input, the neural network is trained as a function

that outputs a 3D point at that time. This learning is

achieved by minimising the reprojection error at each

camera and each time, deﬁned as follows.

∑

[{(u

j,t

− ¯u(P

, X

))

+ (v

j,t

− ¯v(P

, X

))

}

∑

k6= j

{(e

−

e(P

, T

))

}]

(8)

where X

is the 3D point obtained when time t is in-

put to the neural network. Also, ¯u and ¯v are the pro-

jected points obtained by projecting the 3D point by

the camera matrix. By minimizing the loss function,

we can obtain a neural network that represents the

3D trajectory of the observed points taken by asyn-

chronous cameras.

Note that when using a neural network to repre-

sent an arbitrary function, it is known that if variables

such as time are input directly, it becomes difﬁcult to

represent high-frequency components. To avoid this,

it is necessary to map these variables to a higher-order

space in advance using positional encoding. This

method is also used in this study, and t is input to the

neural network after being mapped to a higher-order

space. In addition, appropriate initial values are re-

quired for this non-linear minimisation. For this rea-

son, in this study, 2D points are interpolated in ad-

vance to create a set of pseudo-synchronised corre-

sponding points. The interpolated values are used for

synchronous bundle adjustment. The results obtained

are optimised using the method described above to es-

timate the ﬁnal reconstruction result.

Furthermore, the parameter representation using

such a neural network is applicable not only to the 3D

points to be restored, but also to all parameters includ-

ing the camera position. Therefore, in this research,

the same representation is used for these parameters,

and the camera position and 3D trajectory are esti-

mated by minimising the reprojection error shown by

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

150

Figure 3: Examples of input images.The cross mark is an

epipole.

the equation (8). This enables stable estimation of the

3D trajectory even from asynchronous cameras.

5 EXPERIMENTAL RESULTS

In this section, we present the results of simulation-

based restoration using the proposed method.

5.1 Environment

The results of 3D reconstruction from images cap-

tured by an asynchronous camera using the pro-

posed method are presented. In this experiment,

the motion of doing a standing long jump was se-

lected from the CMU Graphics Lab Moiton Capture

Database(Carnegie Mellon University, 2003), and the

image taken by the virtual camera was used. The

scenes were taken by three cameras at different po-

sitions at different times, as shown in Fig.4. The red

dots in the ﬁgure represent the 3D point cloud and the

surrounding rectangles represent the cameras. In or-

der to reproduce the asynchronous situation where the

cameras were not synchronised, the 3D point cloud

was taken at a time when each camera was off by

Figure 4: Target 3D points.The red dots in the ﬁgure rep-

resent the 3D point cloud and the surrounding rectangles

represent the cameras.The red camera is the camera that is

projecting at that time.

three frames. The camera indicated in red in Fig.4

is the camera that is projecting at that time.

The images used for the actual reconstruction are

shown in Fig3. The images show the ﬁrst image

taken by each camera. The black dots are the pro-

jected points of the 3D point cloud and the X marks

the epipoles of the cameras. From these images, the

3D reconstruction was carried out using the proposed

method. For comparison, the following methods were

used for the interpolation of asynchronous cameras,

with and without mutual projection restoration was

performed using neural networks, linear interpolation

and cubic spline interpolation, respectively.

We experimented with a NN structure consisting

of only one fully connected layer with 256 units, with

the input being 4-dimensional by positional encoding

and the 3-dimensional point X being the output. In

addition, the results of the neural network are opti-

mized using the equation (8) after initial learning of

the network using the spline results.

5.2 Results

The results of the restoration using each method are

shown in Fig.5. In all of these results, the restoration

is carried out using mutual projection. In the result

images, the true values are shown in red and the re-

covered results in blue. In addition, Table 1 shows

the restoration error (RMSE) with and without mu-

tual projection and when the restoration is carried out

using each interpolation method. These results show

that the 3D points reconstructed using the neural net-

work are more accurate than those reconstructed us-

ing linear interpolation or spline interpolation. It can

also be seen that the results of reconstruction using

mutual projection are more accurate for all interpola-

Self-Mounted Motion Capture System Using Mutual Projection of Asynchronous Cameras

151

(a) NN

(b) Spline interpolation

Figure 5: Examples of 3D recunstruction.In all of these results, the restoration is carried out using mutual projection. The true

values are shown in red and the recovered results in blue.

Table 1: Reconstruction errors (RMSE) by each method

(mm).

w/ epipoles w/o epipoles

Linear interpolation 22.661 24.366

Cubic spline polation 6.881 6.970

NN 5.572 5.722

tion methods. These results conﬁrm that the use of

mutual projection enables high-precision 3D restora-

tion even with an asynchronous camera.

The results of calculating the average restoration

error from 10 movements in the dataset are shown

in Table 2. These results conﬁrm that the proposed

method can achieve highly accurate restoration.

6 CONCLUSION

In this study, a self-attached motion capture system

using mutual projection in an asynchronous camera

Table 2: Reconstruction errors (RMSE) by each method

from 10 movoments (mm).

w/ epipoles w/o epipoles

Linear interpolation 19.921 21.057

Spline interpolation 4.649 4.940

NN 4.313 4.596

is proposed as a method to realise small-scale mo-

tion capture without location constraints. To this

end, a method for stable 3D restoration even with

asynchronous cameras using mutual projection is pre-

sented.

REFERENCES

Carnegie Mellon University (2003). CMU Graphics Lab

Motion Capture Database. http://mocap.cs.cmu.edu/.

Guerra-Filho, G. (2005). Optical motion capture: Theory

and implementation. RITA, 12(2):61–90.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

152

Hartley, R. and Zisserman, A. (2003). Multiple view geom-

etry in computer vision. Cambridge university press.

ITO, M. and SATO, J. (2002). Robust computation of

epipolar geometry from mutual projection of cam-

eras. IEICE transactions on information and systems,

85(3):600.

Kakumu, Y., Sakaue, F., Sato, J., Ishimaru, K., and Iman-

ishi, M. (2013). High frequency 3d reconstruction

from unsynchronized multiple cameras. In BMVC.

Miura, T. and Sako, S. (2020). 3d human pose estimation

model using location-maps for distorted and discon-

nected images by a wearable omnidirectional camera.

IPSJ Transactions on Computer Vision and Applica-

tions, 12:1–17.

Moeslund, T. B., Hilton, A., and Kr

uger, V. (2006). A sur-

vey of advances in vision-based human motion cap-

ture and analysis. Computer vision and image under-

standing, 104(2-3):90–126.

Roetenberg, D., Luinge, H., Slycke, P., et al. (2009). Xsens

mvn: Full 6dof human motion tracking using minia-

ture inertial sensors. Xsens Motion Technologies BV,

Tech. Rep, 1:1–7.

Shiratori, T., Park, H. S., Sigal, L., Sheikh, Y., and Hodgins,

J. K. (2011). Motion capture from body-mounted

cameras. In ACM SIGGRAPH 2011 papers, pages 1–

10.

Triggs, B., McLauchlan, P. F., Hartley, R. I., and Fitzgibbon,

A. W. (2000). Bundle adjustment - a modern synthe-

sis. In Proceedings of the International Workshop on

Vision Algorithms: Theory and Practice, ICCV ’99,

pages 298–372, London, UK, UK. Springer-Verlag.

Self-Mounted Motion Capture System Using Mutual Projection of Asynchronous Cameras

153