Proﬁciency Estimation by Motion Variability

obtained from Single Camera Input

Ayumi Matsumoto, Dan Mikami, Harumi Kawamura, and Akira Kojima

NTT Media Intelligence Laboratories, Nippon Telegraph and Telephone Corporation, Kanagawa, Japan

Keywords:

Motor Learning, Proﬁciency Estimation, Variability of Movements, Markerless Motion Capture.

Abstract:

This paper proposes a motor learning assist system that estimates the proﬁciency of trainees in making sports

motions on the basis of variability of his/her own 3D motions in trials captured by a single camera. Most

existing systems assume that a sequence of human poses must be obtained by multiple calibrated cameras

or a marker-based motion capture system. Such systems can be effectively used by professional athletes or

broadcast station personnel who specialize in sports, but not by casual sports fans who have no particular

athletic skills. We propose a method for evaluating proﬁciency on the basis of obtained for trainees in repeated

trials. Usable by not only elite athletes but also casual sports fans, the method has two important features. First,

it requires only a freely positionable single camera since its 3D pose estimation methodology is independent of

camera position. Second, it estimates proﬁciency only from a trainee’s own motions and thus does not require

any reference movements. In this paper, the golf swing is used as the target of motor learning. Experiment

results show that variability in 3D motion in trials is inversely proportional to the test subjects’ degree of

proﬁciency.

1 INTRODUCTION

Computer vision-based sports assistance has been

widely studied in recent years (Yu and Farin, 2005).

Reported techniques, include player tracking (Pin-

gali et al., 1998), semantic annotation of sports game

videos (Assfalg et al., 2002), and sports event detec-

tion (Luo et al., 2003). In addition, a real-time track-

ing technique called Hawk-eye (Owens et al., 2003)

is now being used for assisting umpires in making

judgements.

Computer vision assistance in motor learning for

casual sports fans, amateure sports teams, and artists

has also been considered. Chua et al. proposed a sys-

tem for learning sports motions by using virtual real-

ity (Chua et al., 2003). Choi et al. proposed a sys-

tem that compares the motions of a trainee with those

of an expert to provide information that is useful for

practicing sports (Choi et al., 2008). So far, however,

these systems have rarely been used in practice.

One main reason such systems have not gained

popularity is they require making troublesome prepa-

rations. Most existing systems assume that a sequence

of human poses must be obtained by multiple cali-

brated cameras or a marker-based motion capture sys-

tem. Such systems can be effectively used by pro-

fessional athletes or broadcast station personnel who

specialize in sports, but not by casual sports fans who

have no particular athletic skills.

Aiming to overcome the need for troublesome

preparations, Mikami et al. proposed a method for

analyzing human motions from a video sequence

captured by a single camera (Mikami et al., 2013;

Mikami et al., 2012). Their method enables analysis

of subtle changes in repetitive motions, but to apply

it the relative positions of the camera and the subject

should be the same when capturing motions. To ad-

dress these problems, we propose a system that can

provide assistance in motor learning to all sports fans

without troublesome preparations.

Generally speaking, it can be assumed that motor

learning is divided into three phases: cognitive, asso-

ciative, and autonomous (Schmidt and Lee, 1988). In

the cognitive phase, a trainee seeks to make correct

movements through a trial-and-error process. In the

associative phase, a trainee knows the check points to

follow to make correct movements and how to modify

incorrect movements. In the autonomous phase, the

motions become ingrained; the trainee can make them

almost automatically without thinking about them.

On the basis of this assumption, we propose a

method for evaluating proﬁciency based on the vari-

147

Matsumoto A., Mikami D., Kawamura H. and Kojima A..

Proﬁciency Estimation by Motion Variability obtained from Single Camera Input.

DOI: 10.5220/0004608301470153

In Proceedings of the International Congress on Sports Science Research and Technology Support (icSPORTS-2013), pages 147-153

ISBN: 978-989-8565-79-2

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

ability of the trial when it is subjected to repeated

trials. When the motor learning is in the cognitive

phase, we assume there is large variation in move-

ments. When the learning progresses and moves to

the associative phase, we assume the variation be-

comes rather small because a trainee gets to know the

check points and how to modify incorrect movements.

Finally, when it reaches the autonomous phase, we as-

sume that the variation in movements becomes quite

small.

To achieve our purpose, we use a markerless mo-

tion capture system proposed by Matsumoto et al.

(Matsumoto et al., 2012) that can obtain human poses

in 3D world coordinates from one camera input that

is robust under camera positions. This feature en-

ables motion variations to be compared from different

views for the same criteria. In this paper, we use the

golf swing as the motor learning target since there are

so many persons who enjoy the game of golf.

The reminder of this paper is organized as follows.

Section 2 overviews the markerless human motion

capture method proposed by Matsumoto et al., Sec-

tion 3 outlines our proﬁciency estimation method in

detail, Section 4 describes experiments we conducted,

and Section 5 concludes the paper with a brief sum-

mary.

2 GPDM-BASED MARKERLESS

MOTION CAPTURE WITH

SINGLE CAMERA

Our method uses a GPDM-based markerless motion

capture system proposed as a method for obtain-

ing human poses under robust camera position (Mat-

sumoto et al., 2012). Consisting of a training step and

a pose estimation step, it estimates the poses made

when subject movements are similar to trained move-

ments.

The method we propose has two principal fea-

tures. First, it requires only a single camera to work

well. Due to the innate characteristics of single cam-

era input, it is unable to obtain depth information.

To compensate for this, it includes a training step

in which use is made of 3D motion data captured

by a marker-based motion capture system. 3D mo-

tion data of various motions in sports can be found

in databases that are publicly available. For exam-

ple, the CMU mocap library includes motions of golf

swings, basketball dribbles, and soccer ball kicks.

Our method can employ 3D motion data from such

motion databases and thus skirt troublesome prepara-

tions. Second, it is robust against differences in the

relative positions of the camera and the target human.

The GPDM-based markerless motion capture method

learns the state dynamics of all possible views. In

the pose estimation step, it jointly estimates a 3D hu-

man pose and the camera position relative to it. This

enables robust tracking against camera position. The

following subsections describe each of these steps in

more detail.

2.1 Training Step

The GPDM-based markerless motion capture method

uses sequences of 3D human poses in the training

step. That is, it requires pose sequences obtained

by a marker-based motion capture system or a multi-

camera system. Note that since our system is de-

signed for speciﬁc motor learning, this requirement

is not especially problematic.

The training step actually comprises three steps.

The ﬁrst is estimating a view-dependent trajectory. It

should be noted that the a view-dependent observa-

tion can be virtually generated from training data be-

cause it consists of 3D motion data. The second is

reducing the dimensionality of data by using GPDM

(Wang et al., 2005). The third is learning the state dy-

namics for each view. Through this training step, the

view-dependent dynamics of human movement in a

low-dimensional feature space are obtained.

2.2 Pose Estimation Step

In the pose estimation step, state parameters, i.e., view

and pose parameters trained in the training step, are

estimated by particle ﬁltering (Isard and Blake, 1998)

of a video sequence captured by a single camera. Be-

cause this step estimates not only pose parameters but

also the view (i.e., the relative positions of human and

camera), it is robust with respect to the latter. An

HSV histogram of joints was used for the observation

model.

3 PROPOSED METHOD

This section outlines the details of the proﬁciency es-

timation system we propose for assisting motor learn-

ing. It has been said that motor learning can be di-

vided into three phases: cognitive, associative, and

autonomous (Schmidt and Lee, 1988). On the basis

of this knowledge, the proposed method estimates a

trainee’s proﬁciency in making motions from the vari-

ability of his/her own movements.

Figure 1 shows the system environment that we

assume. The proposed method uses a single cam-

icSPORTS2013-InternationalCongressonSportsScienceResearchandTechnologySupport

148

Camera

Person

Off line training

3D motion

data

Model

learning

Display

On line estimation

Video data

3D motion

estimation

Variability

calculation

Proficiency

estimation

3D motion

model

Figure 1: Assumed system environment. The proposed

method captures a trainee’s movements with a single cam-

era. The video is transferred to a PC, which then estimates

3D human poses and the view on the basis of a pre-training

3D motion model. Through multiple pose estimation trials,

the variability of movements is calculated and then proﬁ-

ciency is estimated and displayed. This simple setting is a

most important feature in practical use.

era to capture a trainee’s movements. The video is

transferred to a PC, which then uses the method de-

scribed in Sect. 2 to estimate 3D human poses and the

view on the basis of a pre-training 3D motion model.

Through multiple pose estimation trials, the variabil-

ity of movements is calculated and then proﬁciency is

estimated and displayed.

The variability estimation step consists of a phase

division step and a variability calculation step. The

proposed method ﬁrst divides the movements into

predeﬁned phases. The variability calculation step

then calculates variabilitices of duration, trajectory,

and pose at the switching point of phases. Specif-

ically, for a motion that has N phases, the method

yields 3N + 1 variabilities: N for duration, N for tra-

jectory, and N +1 for poses of terminal points. Figure

2 illustrates the proposed variability estimation.

3.1 Phase Division Step

Phase division is motion dependent and requires

knowledge of the motions being made. In this sub-

section, we use the swinging of a golf club and the

throwing of a ball to show examples of phase divi-

sion. Note that the phase deﬁnitions given are merely

Time (t)

State

Time (t)

State

Phase 1

Phase 2

Phase 3

Phase 4

Trial 1

Trial 2

(a) For each trial, a motion is divided into phases; in this ﬁg-

ure, the two trials (Trial 1 and Trial 2) displayed are divided

into four phases. The variability estimation step calculates

variabilities of duration, trajectory, and pose at the switching

point of the phases.

Phase 1

Trial 1

Trial 2

Trial 1

Trial 2

Temporal normalization

and

Resampling

(b) Duration varies by trial. Therefore, to calculate the vari-

ability of trajectory, the proposed method coordinates the du-

rations of trials. It then resamples by a speciﬁed time step.

Figure 2: Variability estimation of our method.

examples.

3.1.1 Golf Swing

As shown in Fig. 3, a golf swing consists of three

phases: backswing, downswing, and follow-through.

The backswing starts with the picture labeled “setup”

and ends with the one labeled “to”. The downswing

goes from “top” to “impac” and the follow-through

ProficiencyEstimationbyMotionVariabilityobtainedfromSingleCameraInput

149

Phase 1

Back swing

Phase 2

Down swing

Phase 3

Follow through

Setup

Top

Impact

Finish

Figure 3: Three phases of golf swing; back swing, down

swing, and follow-through.

Phase 1

Take-back

Phase 2

Cocking

Phase 3

Acceleration

Phase 4

Follow-through

Figure 4: Four phases of ball throwing: take-back, cocking,

acceleration, and follow-through.

from “impact” to “ﬁnish”.

3.1.2 Ball throwing

As shown in Fig. 4, the action of throwing a ball con-

sists of four phases: take-back, cocking, acceleration,

and follow-through.

These phase deﬁnitions are associated with body

conﬁguration parameters. Therefore, by using the se-

quence of pose estimates, automatic phase division

can be executed.

3.2 Variability Calculation Step

The variability calculation step calculates variabili-

ties of duration V(d), trajectory V(t), and pose at the

switching point of phases V(p). Because the motions

are dividedinto phases in the phase divisionstep, vari-

abilities of duration V(d) and pose V(p) can be cal-

culated by normal form.

For calculating the variability of trajectory V(t),

the proposed method uses a three-step solution. First,

it coordinates the durations of trials. Second, it re-

samples by a speciﬁed time step s. If a resampled

time does not have an observed pose data, it estimates

the pose by interpolation. Finally, it calculates the av-

erage of variabilities at resampled time steps.

4 EXPERIMENTAL

To verify the effectiveness of the proposed method,

we conducted experiments in which the golf swing

was used as the target motion. In this section we ﬁrst

show the experimental settings and then the results.

4.1 Experimental Settings

4.1.1 Subjects

In the experiments, each subject was asked to swing a

golf club ten times. The subjects varied in their expe-

rience in playing golf as shown in Table 1. Since the

main topic of this study was proﬁciency estimation,

we asked the subjects to wear colored bands on their

wrists and ankles to facilitate stable tracking.

4.1.2 Motion Data and Capturing

We used golf swing data of an adult male of average

build from the CMU motion capture library (Subject

#64) in the training step. The frame rate of this motion

data was 120 fps.

We captured videos of the subjects’ motions with

two cameras at different angles with 640 x 480 pixel

resolution. Capturing frame rate was 30 fps and each

trial had about 60 frames.

We downsampled the motion data to eliminate the

gap between its frame rate and that for the captured

videos.

4.1.3 Calculation

We used an Intel Core i7 3.20GHz CPU for calcula-

tion. For all experiments, 3000 particles were used

for the particle ﬁltering to estimate 3D human pose.

The phases were automatically divided on the basis

of the pose estimates; the phase lengths were normal-

ized in 5 samples. We caluculated the variability from

ten trials in each phase.

4.2 Result of 3D Pose Estimate

Snapshots of captured videos are shown in Fig. 5. The

overlay lines denote pose estimates. Note that the es-

timates included 3D pose information and the view;

the lines were generated from 3D poses and projected

on the basis of the estimated view. We veriﬁed that

we successfully obtained correct 3D poses despite the

difference of view point. Figure 6 (the solid lines)

shows left hand horizontal positions in 3D pose co-

ordinate. The different markers on the lines denote

camera location. The estimated positions were almost

same results in despite of a different camera position.

We also calculated 2D color tracking results to

show the beneﬁt of using 3D poses in Fig. 6 (the dot-

ted lines). The position provided at a different camera

place was different in the timing of the peak. This is

icSPORTS2013-InternationalCongressonSportsScienceResearchandTechnologySupport

150

Table 1: Subject s golf experiences.

Subjects Experience years Frequency Average score

A 18 Once a week 95

B 7 Five times a year 118

0 None None

Figure 5: Results of 3D pose estimation shown by red lines.

Top: Camera 1, Subject C. Bottom: Camera 2, Subject A.

because that information is degenerated nonlinearly

by obtaining 3D position with a camera. This result

affects subsequent calculation, for example, the divi-

sion for a phase. This means that using 3D poses al-

lows an equal trajectory to be obtained independent

of the camera position and that the proposed method

is not affected by the camera position relative to the

trainee; this is an important feature for practical use.

4.3 Result of Proﬁciency estimation on

the Basis of Motion Variability

Figure 7 shows the estimated variability of motions.

The variability was calculated from results obtained

in seven trials (out of ten, after removal of unstable

tracking). As Fig. 7 shows, the well experienced sub-

ject A showed very small variability of motions, while

the beginner subject C showed large variability. The

tendency that dispersion became small so that an ac-

100

150

200

250

300

350

400

450

-4

-3

-2

-1

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67

3D pose, Camera 1 3D pose, Camera 2

2D Tracking, Camera 1 2D Tracking, Camera 2

Hand positionin 2D image

sample

Hand position in 3D pose space

space

Figure 6: Results of left hand horizontal position calculated

from 3D estimated pose in 3D pose coordinate and 2D color

tracking results in image coordinate. Solid lines denote 3D

estimated pose, dotted lines denote 2D color tracking, and

different markers on lines denote camera location. The po-

sition provided at a different camera place was the same by

3D pose evaluation substantially, but the timing of the peak

is different in the 2D tracking.

quisition degree became higher was provided.

4.4 Discussion

In these experiments, the elapsed time for proﬁciency

calculation from ten trials was about ﬁfty minutes.

Since we will attempt to use this system to obtain

real-time feedback for assisting motor learning, we

assume we need to obtain about a 50-fold reduction in

the computational time. This would appear to be quite

difﬁcult but we believe it is not unachievable. Particle

ﬁlters are very suitable for use in parallel computing

techniques. In recent research there have been many

reports of using GP-GPU to achieve a more than 10-

fold increase in particle ﬁlter computation speed.

To enable the proposed method to estimate proﬁ-

ciency with greater precision, we consider it needs to

be able to deal with two important types of informa-

tion that it currently does not. One is the variability

of motions within a phase. The method currently cal-

culates the variability of trajectory on the basis of the

average of pose variabilities within a phase. How-

ever, the variability of motions within a phase may

also be informative. An example illustrating this point

is shown in Fig. 8. The results show that the well

experienced subject A showed small variability at all

times, while subjects B and C showed signiﬁcant vari-

ProficiencyEstimationbyMotionVariabilityobtainedfromSingleCameraInput

151

V(p) V(d) V(t) V(p) V(d) V(t) V(p) V(d) V(t) V(p)

Set-up Back-swing Top Down-swing Impact Follow-through Finish

Subject A

0.33 2.12 0.69 0.14 2.49 0.44 0.70 1.06 2.07 0.81

Subject B

18.61 26.48 8.41 18.12 34.94 9.37 33.96 15.11 10.33 64.37

Subject C

34.67 107.14 13.63 37.25 18.08 14.10 74.38 28.35 16.66 80.07

100

Variance

Figure 7: Estimated variability of motions. V(d) denotes variabilities of duration, V(t) denotes the average of variability of

trajectory at each time step, and V(p) denotes the average of variability of pose at each time step. Experienced subject showed

small variability of motions, while the beginner subject showed large variability.

1 2 3 4 5 6

Phase 1

1 2 3 4 5 6

Phase 2

1 2 3 4 5 6

Phase 3

Normalized time

Sub.A

Sub.B

Sub.C

Figure 8: Variability of motions within an experiment

phase. Results show that the well experienced subject A

showed small variability at all times, while subjects B and

C showed signiﬁcant variability with a similar tendency.

ability with a similar tendency. We only show these

results at this time, but more detailed results may be

obtained by using a more precise analysis method in

the acquisition process. The other is coordination be-

tween body parts. It may be considered that the mo-

tion of one part is changed to compensate for that of

another part, so that the total performance is adjusted

well. Therefore, the method’s effectiveness could be

enhanced by enabling it to handle this kind of coordi-

nation between body parts.

5 CONCLUSIONS

We have proposed a motor learning assist system that

estimates the proﬁciency of trainees in performing

motions such as those made in swinging a golf club or

throwing a ball. In this paper, we proposed a method

for evaluating proﬁciency on the basis of variability

in 3D motion obtained for trainees in repeated trials,

under the assumption that less variability over a num-

ber of trials correlates with the trainees’ proﬁciency.

Its principal features are that ﬁrst, it requires only a

single camera and can be used not only by elite ath-

letes but also by casual sports fans, and second, it esti-

mates proﬁciency only from a trainee’s own motions

and thus does not require any reference movements.

We found there was a tendency for dispersion over 10

trials to fall in proportion to the increased degree of

proﬁciency the trainees acquired.

In future work we will need to face two chal-

lenges. One is to validate our assumption that

small/large variability leads to good/poor perfor-

mance. The other is to enhance our system’s effec-

tiveness. We will examine means to accelerate its

calculation speed, perform experiments on it with an

increased number of subjects including experts, and

conﬁrm its applicability to other types of motions. In

addition, we want to inspect vision-feedback methods

to be effective in the improvement of the athletic abil-

ity.

icSPORTS2013-InternationalCongressonSportsScienceResearchandTechnologySupport

152

REFERENCES

Assfalg, J., Bertini, M., Colombo, C., and Bimbo, A. D.

(2002). Semantic annotation of sports videos. IEEE

Multimedia, 9(2):52–60.

Choi, W., Mukaida, S., Sekiguchi, H., and Hachimura, K.

(2008). Quantitative analysis of Iaido proﬁciency by

using motion data. In ICPR, pages 1–4.

Chua, P. T., Crivella, R., Daly, B., Hu, N., Schaaf, R.,

Ventura, D., Camill, T., Hodgins, J., and Pausch, R.

(2003). Training for physical tasks in virtual environ-

ments: Tai Chi. In IEEE Virtual Reality 2003, pages

87–94.

Isard, M. and Blake, A. (1998). CONDENSATIN-

conditional densitiy propagation for visual tracking.

IJCV, 29(1):5–29.

Luo, Y., Wu, T.-D., and Hwang, J.-N. (2003). Object-

based analysis and interpretation of human motion

in sports video sequences by dynamic bayesian net-

works. CVIU, 92(1-2):196–216.

Matsumoto, A., Wu, X., Kawamura, H., and Kojima, A.

(2012). 3D motion estimation of human body from

video with dynamic camera work. In MPRSS, pages

71–78.

Mikami, D., Kimura, T., Kadota, K., Kashino, M., and

Kashino, K. (2012). Inter-trial difference analysis

through appearance-based motion tracking. In 30th

International Society of Biomechanics in Sports.

Mikami, D., Kimura, T., Kadota, K., Kawamura, H., and

Kojima, A. (2013). Human motion analysis under

actual sports game situations -sequential multi-decay

motion history image matching-. In VISAPP 2013.

Owens, N., Harris, C., and Stennett, C. (2003). Hawk-eye

tennis system. In Internationl Conference on Visual

Information Engineering, pages 182–185.

Pingali, G. S., Jean, Y., and Carlbom, I. (1998). Real

time tracking for enhanced tennis broadcasts. In IEEE

CVPR, pages 260–265.

Schmidt, R. A. and Lee, T. (1988). Motor control and learn-

ing. Human Kinetics.

Wang, J. M., Fleet, D. J., and Hertzmann, A. (2005). Gaus-

sian process dynamical models. In NIPS, pages 1441–

1448.

Yu, X. and Farin, D. (2005). Current and emerging topics

in sports video processing. In ICME, pages 526–529.

ProficiencyEstimationbyMotionVariabilityobtainedfromSingleCameraInput

153