smooth surface but still allows sharp edges, so it can
generate a 3D model for objects of different shapes.
Kholgade, Simon, Efros, and Sheikh (2014) pro-
posed a method to manipulate the 3D object in a sin-
gle image. This method supports full 3D operations,
including translation, rotation, and scaling, but it as-
sumes that the 3D model and the texture of the object
in the image is known. This assumption limits the ap-
plication of their method to objects whose 3D model
is available. With our method, we aim to create a 3D
model with constraints provided by the user and do
not require the original 3D model of the object.
3 ALGORITHM
The appearance of the scene in an image I is deter-
mined by its geometry X, texture T, and light source
L.
I = v(L, X) lam(L,X) T (1)
where v(·) denotes the visibility of light source L at
scene X and lam(·) is the Lambertian cosine function.
Since I is the only known variable in Equation 1, and
because estimating X, T, and L from I is highly ill-
posed, we ask the user to specify the light source L
and to label the background, the object, and its cast-
shadow pixels, as shown in Figure 1(b), in which pix-
els labeled as the object are cyan, shadow-labeled pix-
els are blue, and the background is white. One object
region must have one shadow region. This user input
provides powerful cues to solve X and T. The texture
of the scene, T, is obtained by inpainting the user-
labeled object regions and shadow regions with the
PatchMatch algorithm proposed by Barnes, Shecht-
man, Finkelstein, and Goldman (2009). Figure 1(c)
shows the scene texture T, in which the objects and
their cast-shadows are removed. Typical shadow re-
moval methods, e.g., Wu et al. (2007), can be applied
to generate T, too, but these may require a greater
number of user annotations.
We separate the 3D geometry of the scene into
the ground and walls, denoted as X
S
, and several 3D
models for the objects, denoted as X
i
. The 3D model
X
i
corresponds to the i-th object region, O
i
, labeled
by the user. Different approaches are used to create
X
S
and X
i
.
For X
S
, we assume that the walls are perpendic-
ular to the horizontal ground and that the camera is
at the origin and has no yaw or roll rotation. We
first generate a rough 3D geometry of the ground and
walls, X
0
S
, using the approach proposed by Iizuka,
Kanamori, Mitani, and Fukui (2011). This method
automatically generates 3D geometry for the ground
and walls as well as camera parameters with edges
Figure 2: The bumpy surface derived by using superpixels
(left) and bilateral filter (right).
between the ground and walls marked by the user, as
illustrated in Figure 1(a). The rough 3D geometry X
0
S
is refined to X
S
with self-shadows using the method
presented in Section 3.1.
For X
i
, our method processes one object at a time.
We first generate a rough 3D model for the current
object and then refine the 3D model by the relation-
ships between the light source, the object, and its cast-
shadow using the method presented in Section 3.2.
After deriving X
S
and all X
i
, the final 3D geome-
try X is constructed by inserting all X
i
into X
S
. The
user is allowed to modify X and L by moving the
foreground objects and the light source, and the re-
sulting image is then synthesized by Equation 1. In
our implementation, the shadow mapping algorithm
proposed by Williams (1978) is used to generate the
visibility term v(·) in Equation 1.
3.1 Bumpy Surface Recovery
In general, self-shadows appear at the concaves of a
bumpy surface, while convexes are usually brighter
than the other parts of the surface. Khan, Reinhard,
Fleming, and B
¨
ulthoff (2006) proposed a method to
derive bump heights for the surface by intensity. In
that method, bilateral filter is applied to obtain the
albedo component and the intensity of albedo com-
ponent is then directly used as the bump heights of
the surface. However, if a surface has multiple colors,
this approach tends to separate the surface into mul-
tiple pieces with different altitudes according to the
color (see Figure 14).
We overcame this problem by considering local
texture, which is obtained by superpixels. A super-
pixel is a set of pixels containing pixels of similar
color, and thus we assume that all pixels in a super-
pixel have the same albedo and variation in intensity
mostly caused by lighting. We chose not to use bi-
lateral filter to recover the albedo because it might
slightly blur the edges between different colors, and
produce an uneven surface like the one in Figure 2
(right). The advantage of using superpixels is that
they do not cross the edge between different textures,
SynthesizingRealisticShadowswithBumpySurfaceandObjectModelEstimatedbyShadowsinSingleImage
37