(a) (b)
Figure 1: Pose models in a still frame. (a) is the classic
articulated limb model of (Marr and Nishihara, 1978), (b) is
the model of (Yang and Ramanan, 2011).
body and a thigh are several examples of such parts.
Modern models defines a human pose in tersm of a
set of human body joints (fig. 1, b). A shoulder, a
knee and an elbow are some examples of such joints.
This definitions are equivalent. Indeed, the location of
each part is uniquely defined by locations of the cor-
responded joints, and a location of a joint is uniquely
defined by a location of the corresponded human body
part. In the research we exploit the second definition
as it makes the inference easier (Yang and Ramanan,
2011).
Most of the previous approaches to the human
pose estimation task work with still frames only. In
such case, the problem is usually reduced to infer-
ence in a tree structured graphical model. In this
models vertices correspond to locations of the joints
and edges define limitations on relative joint locations
(Yang and Ramanan, 2011). Due to the structure of
the graphical model the best configuration of a joint
location in a still frame can be obtained efficiently. In
spite of significant progress in techniques of the hu-
man pose estimation in a still frame, their accuracy is
far from ideal. Therefore we propose to improve the
accuracy of pose estimation by considering all video
frames simultaneously. The proposed approach uses
evidence of the pose from the other frames for the re-
sult gaining in the current frame. We use the tracking
approach (Shalnov and Konushin, 2013) to initially
estimate trajectory of the person. A trajectory means
an approximate location of the person in each frame
of the input video sequence. Hence, the formal input
of our algorithm consists of:
• video sequence W =
{
I
t
}
;
• trajectory of the person Ba =
{
B
t
}
.
The output of the algorithm is:
• human pose in the input video sequence. It means
location of joints and a scale parameter of the per-
son in each frame of the input video Pa =
{
P
t
}
.
2.2 Basic Model
Our research was inspired by the work of (Park and
Ramanan, 2011) on human pose estimation in video
sequence. They use the mathematical model of hu-
man pose in a still frame (Yang and Ramanan, 2011)
and expand the inference algorithm. Compared to
the previous method the extension by (Park and Ra-
manan, 2011) allows inference N-best configurations
from the model, ensuring that they do not overlap ac-
cording to some user-provided definition of overlap.
Moreover, they include a simple temporal context
from neighboring frames in the model. It allows them
to select better pose hypothesis in each frame of the
input video sequence. This way they converts the
problem of the human pose estimation in video to the
following maximization task:
Pa
∗
= argmax Score(Pa)
Score(Pa) =
∑
t
Φ(P
t
) + α
∑
t
Ψ(P
t
, P
t−1
)
(1)
where Φ(P
t
) is the score of candidate pose P
t
com-
puted by the proposed detector, and Ψ(P
t
, P
t−1
) is
the (negative of the) total squared pixel difference be-
tween each joint in pose P
t−1
and pose P
t
.
A set of available inference algorithms is the key
distinction between human pose models in still frame
and in video. A dynamic programming algorithm is
usually applied to infer optimal pose in a still frame.
However, it cannot be utilized to infer the optimal set
of poses in video. Indeed, the poses in instant of time
t
1
and t
2
are conditionally independent given a pose
at instant of time t ∈ [t
1
,t
2
] at least. Therefore, in-
ference with the dynamic programming algorithm re-
quires O(L
2K
) elements stored in memory, where L is
a number of possible joint locations in the frame and
K is a number of joints in the model. The authors use
an approximate algorithm. They restrict the possible
poses in each frame with best hypotheses. It makes
the dynamic programming tractable.
2.3 Proposed Model
We use the same model of human pose in a still frame
of (Yang and Ramanan, 2011), but with different tem-
poral context. The temporal context of the original
model requires the shift of joint location between sub-
sequent frames to be small. In practice, this constraint
is a poor motion model for a majority of body joints.
For example, the Brownian movement and the con-
stant motion with the same velocity have equal impact
to the objective function.
IMTA-52015-5thInternationalWorkshoponImageMining.TheoryandApplications
72