
view video can be used to estimate pose and shape
more accurately because it contains depth informa-
tion directly. However, this method requires record-
ing system by multiple cameras and synchronising the
videos, which increases the recording cost. Therefore,
it is desirable to recover the 3D pose and shape from
the single-view video.
Furthermore, as one of the methods to directly re-
store shapes from images, an optimization method
based on differentiable rendering has recently been
proposed(Jiang et al., 2020). This method is com-
posed so that the rendering process of the image is
differentiable. This makes it possible to directly min-
imize image errors using gradient methods, etc., and
to easily achieve shape restoration. Therefore, it is
possible to estimate 3D shapes not only from images
taken from multiple viewpoints, but also from images
taken from only a single viewpoint. In addition, it can
be easily combined with a learning-based method us-
ing a neural network, making it applicable to a variety
of scenes. However, accurate restoration from a sin-
gle viewpoint image requires a method that combines
trained models, which causes biases in the training
data to be reflected in the restoration results(Robinette
et al., 2002).
In addition, 3D posture and body shape estimation
from single-view images using a learning model may
produce incorrect estimation results owing to factors
like the race of the person to be reconstructed(Zengin
et al., 2016; Robinette et al., 2002). Our study aims
to alleviate this problem and bring the estimation re-
sults closer to the correct answer by treating a group
of time-series images taken from a single viewpoint
as a group of images taken from practically different
directions.
3 3D RECONSTRUCTION BASED
ON DIFFERENTIABLE
RENDERING
First, we describe an overview of the differentiable
rendering (Kato et al., 2020) that is used in this re-
search for shape restoration. Rendering in computer
graphics refers to the process of converting informa-
tion such as shape, illumination, and camera informa-
tion that compose a 3D space into a 2D image. Be-
cause rendering usually involves discrete processing,
the image obtained by rendering cannot be directly
differentiated by the information to be estimated, such
as shape. Differentiable rendering, on the other hand,
replaces the inherently discrete operation of render-
ing with a differentiable operation in the 3D represen-
tation using a mesh. This makes it possible to dif-
ferentiate the error between the input image and the
rendered image in the same way, thus enabling direct
minimization of image errors using gradient descent
and other similar methods.
This method can be combined with various meth-
ods of optimization using gradient descent methods.
Therefore, 3D restoration methods using differen-
tiable rendering are often used in combination with
3D shape representation methods using neural net-
works. In this study, the posture and shape of a person
are represented based on the SMPL model, which is
then optimized using differentiable rendering to esti-
mate the detailed shape of the object.
4 REPRESENTATION OF HUMAN
POSTURE AND SHAPE USING
SMPL
Next, we describe the SMPL, the posture and shape
representation model of the human body used in
this study. The Skinned Multi-Person Linearmodel
(SMPL)(Loper et al., 2015; Pavlakos et al., 2019)
can represent various body shapes by adding shape
changes that can be expressed with a few parameters
to a standard template mesh. By assigning posture
parameters to the shape, models with a variety of pos-
tures can be easily generated. Let
¯
T be a template
mesh, and consider the possibility of representing 3D
models of various shapes and postures by changing
this mesh. In this case, let β be the shape deforma-
tion parameter and θ be the posture parameter, and let
M(β, θ) be the shape represented by these parameters.
M(β, θ) = W (
¯
T + B
s
(β), θ), (1)
where M is a function that represents the deformation
of the shape represented by β, and represents the shift
of each vertex in the template mesh. Various human
body shapes can be created by using M and β as in-
puts. In the SMPL model, each mesh representing a
shape is associated with a human body, so that various
postures can be represented by changing the posture
parameter θ, which indicates the angle of each joint.
This makes it possible to create arbitrary postures for
deformed shapes, and to create human body shapes
with various body shapes and postures.
The differentiable rendering described above can
be combined with SMPL to estimate the appropriate
shape and orientation for a given image. Let I be the
input image and R be the rendering function by the
differentiable renderer, and define the image error ε
as follows:
ε = |I − R(W (
¯
T + B
s
(β), θ))|, (2)
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
752