an alphabetic, hiragana or kanji character.
2 RELATED WORK
There has been much research into recognizing air-
drawn characters. The projects described below
aimed to recognize isolated air-drawn characters, but
recognition from a video stream of connected air-
drawn characters has not yet been investigated. Okada
and Muraoka et al. (Okada and Muraoka, 2003;
Kolsch and Turk, 2004; Yang et al., 2002) proposed a
method for extracting hand area with brightness val-
ues, together with the position of the center of the
hand, and evaluated that technique. Horo and Inaba
(Horo and Inaba, 2006) proposed a method for con-
structing a human model from images captured by
multiple cameras and obtaining the barycentric po-
sition for this model. By assuming that the finger-
tip voxels would be those furthest from this posi-
tion, they could extract the trajectory of the fingertips
and were then able to recognize characters via con-
tinuous dynamic programming (CDP) (Oka, 1998).
Sato et al. (Sato et al., 2010) proposed a method that
used a time-of-flight camera to obtain distances, ex-
tract hand areas, and calculate some characteristic
features. They then achieved recognition by compar-
ing reference features and input features via a hidden
Markov model. Nakai and Yonezawa et al. (Nakai and
Yonezawa, 2009; Gao et al., 2000) proposed a method
that used an acceleration sensor (e.g., Wii Remote
Controller) to obtain a trajectory which was described
in terms of eight stroke directions. They then recog-
nized characters via a character dictionary. Scaroff
et al. (Sclaroff et al., 2005; Alon, 2006; Chen et al.,
2003; Gao et al., 2000) proposed a matching method
for time-space patterns using dynamic programming
(DP). Their method used a sequence of feature vectors
to construct a model of each character. Each feature
vector was composed of four elements, namely the lo-
cation (x,y) and the motion parameters (v
x
,v
y
) (more
precisely, their mean and variance). Their method
therefore requires users to draw characters within a
restricted spatial area of a scene. Moreover, move-
ment in the background or video captured by a mov-
ing camera is not accommodated, because the motion
parameters for the feature vector of the model would
be strongly affected by any movement in the input
video.
These conventional methods (except for the
method of Ezaki et al. (Ezaki et al., 2010), which
used an acceleration sensor) use local features com-
prising depth, color, location parameters, and motion
parameters, etc., to construct each character model.
They then applied algorithms such as DP or a hidden
Markov model to match models to the input patterns.
Such methods remain problematic because such lo-
cal features are not robust when confronted with the
demandingly severe characteristics of the real world.
For recognizing air-drawn characters, conventional
methods perform poorly if there are occlusions, spa-
tial shifting of the characters drawn in the scene, mov-
ing backgrounds, or moving images captured by a
moving camera.
3 CDP
CDP (Oka, 1998) recognizes a temporal sequence
pattern from an unbounded, non-segmented, tempo-
ral sequence pattern. TSCDP is a version of CDP that
is extended by embedding the space parameter (x, y)
into CDP. To show how TSCDP differs from CDP,
we first explain CDP. The algorithm in eqn. (3) calcu-
lates the optimal value of the evaluation function in
eqn. (1). Define a reference sequence g(τ), 1 ≤ τ ≤ T
and an input sequence f (t),t ∈ (−∞,∞). Define no-
tations P = (−∞,t],Q = [1,T ], i = 1,2,...,T,t(i) ∈
P,τ(i) ∈ Q, a function r(i) mapping from τ(i) to t(i)
and a vector of functions r = (r(1),r(2),...,r(T )).
There is a constraint between r(i) and r(i + 1) as de-
termined by the local constraint of CDP, as shown in
Figure 1(a). Then the minimum value of the evalua-
tion function is given by
D(t,T ) = min
r
T
∑
i=1
{d(r(i),t(i))} (1)
where t(1) ≤ t(2) ≤ · · · ≤ t(T ) = t.
The recursive equation in eqn. (3) gives the minimum
for the evaluation function in equation (1) by accumu-
lating local distances defined by
d(t,τ) = || f (t) − g(τ)||. (2)
The recursive equation for determining D(t, T ) is then
described by
D(t,τ) = min
D(t − 2,τ − 1) + 2d(t − 1,τ) + d(t, τ);
D(t − 1,τ − 1) + 3d(t, τ);
D(t − 1,τ − 2) + 3d(t, τ − 1) + 3d(t, τ).
(3)
The boundary condition is D(t,τ) = ∞, t ≤ 0, τ 6∈
[1,T ]. When accumulating local distances optimally,
CDP performs time warping to allow for variation
from half to twice the reference pattern. The selec-
tion of the best local paths is performed by the recur-
sive equation in eqn. (3). Figure 1 shows two types of
local constraints used in CDP for time normalization.
In this paper, we use type (a). Other normalization
such as from one quarter to four times can be realized
in the similar way.
Time-segmentation-andPosition-freeRecognitionfromVideoof
Air-drawnGesturesandCharacters
589