Adaptive Reference Image Selection for Temporal Object Removal from

Frontal In-vehicle Camera Image Sequences

Toru Kotsuka

, Daisuke Deguchi

, Ichiro Ide

and Hiroshi Murase

Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya-shi, Aichi, Japan

Information & Communication Headquarters, Nagoya University, Furo-cho, Chikusa-ku, Nagoya-shi, Aichi, Japan

Keywords:

In-vehicle Camera, Temporal Object Removal, Adaptive Reference Image Selection.

Abstract:

In recent years, image inpainting is widely used to remove undesired objects from an image. Especially,

the removal of temporal objects, such as pedestrians and vehicles, in street-view databases such as Google

Street View has many applications in Intelligent Transportation Systems (ITS). To remove temporal objects,

Uchiyama et al. proposed a method that combined multiple image sequences captured along the same route.

However, when spatial alignment inside an image group does not work well, the quality of the output image

of this method is often affected. For example, large temporal objects existing in only one image create regions

that do not correspond to other images in the group, and the image created from aligned images becomes

distorted. One solution to this problem is to select adaptively the reference image containing only small

temporal objects for spatial alignment. Therefore, this paper proposes a method to remove temporal objects

by integration of multiple image sequences with an adaptive reference image selection mechanism.

1 INTRODUCTION

In recent years, image inpainting is widely used to re-

move undesired objects from an image. Especially,

there is a strong need for removal of temporal ob-

jects (ex. pedestrians and vehicles) from street-view

databases such as Google Street View

so that they

can be used for Intelligent Transportation Systems

(ITS) technologies such as geo-localization of vehi-

cles (Matsumoto et al., 2000).

Methods to remove temporal objects in an image

can be categorized into three approaches; (1) using

a single image, (2) using a single image sequence,

and (3) using multiple image sequences. The ﬁrst ap-

proach synthesizes the background scene of a target

region selected manually (Bertalmio et al., 2000). It

requires only one image as an input, but it is impossi-

ble to restore the true background scene.

The second approach integrates frames captured

as one image sequence (Kawai et al., 2014). Using

the difference of appearances between frames, this

method can remove temporal objects automatically

and restore most of the background scene. How-

ever, some temporal objects, for example, parked ve-

hicles, cannot be removed since they are observed as

http://www.google.co.jp/help/maps/streetview/

static objects in the image sequence. The third ap-

proach integrates multiple image sequences captured

along the same route (Uchiyama et al., 2010). By us-

ing multiple image sequences, this approach can re-

move temporal objects even if they are observed as

static objects in a certain image sequence. Therefore a

method that removes temporal objects using multiple

image sequences is the most suitable for constructing

a street-view database.

The method proposed by Uchiyama et al. uses

frame alignment between image sequences and spa-

tial alignment inside an image group as a prepro-

cessing step to integrate multiple image sequences.

Through frame alignment, all images captured at the

same location are grouped. Then, spatial alignment

is performed by aligning all images with a reference

image selected from each image group. Finally, an

image without temporal objects is generated by fusion

of images in each image group. However, this method

can result in poor output image quality due to failure

of spatial alignment inside an image group. For ex-

ample, when a temporal object region existing in only

one image cannot be aligned with other images. Gen-

erally, in such cases, it is necessary to estimate cor-

respondences from the surroundings of the temporal

object. But, if the temporal object in the reference

image is large, this estimation will not work correctly.

233

Kotsuka T., Deguchi D., Ide I. and Murase H..

Adaptive Reference Image Selection for Temporal Object Removal from Frontal In-vehicle Camera Image Sequences.

DOI: 10.5220/0005357102330239

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 233-239

ISBN: 978-989-758-089-5

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

(a) Reference image (b) Input image

Figure 1: The generated image is distorted depending on the reference image. If there are temporal objects in the reference

image, the pixel correspondence fails.

(a) (b) (c) (d) (e) (f)

Figure 2: Examples of temporal objects.

This causes image deformation due to inappropriate

spacial alignment, as shown in Figure 1.

One way to solve this problem is to select the ref-

erence image for spatial alignment adaptively such

that it contains only small temporal objects. Thus,

this paper proposes a method to remove temporal ob-

jects by fusion of multiple image sequences with an

adaptive reference image selection mechanism. The

aim of this method is to reduce deterioration of im-

age quality that is one of the biggest problems in the

state-of-the-art methods for temporal object removal.

In the following, Section 2 explains the details of

the proposed method. Next, Section 3 describes the

experiments. The results of the experiments are dis-

cussed in Section 4. Finally, we conclude this paper

in Section 5.

2 TEMPORAL OBJECTS

REMOVAL

2.1 Deﬁnition of Temporal Objects

We deﬁne temporal objects as follows:

• Objects that exist at a certain time, but not at all

times.

In other words, temporal objects do not exist con-

stantly in the same location. Figure 2 shows exam-

ples of temporal objects in typical road scenes, such

as pedestrians (a, b), a bicycle (c), and a moving ve-

hicle (d). Parked vehicles (e, f) are also temporal ob-

jects since they do not exist in every image sequence.

On the other hand, a vehicle existing in all image se-

quences (every time) is not treated as a temporal ob-

ject.

2.2 Strategy for Adaptive Reference

Image Selection

The proposed method makes the following assump-

tion for selecting a reference image:

• If we obtain multiple images at the same location,

only background pixels can be aligned correctly.

Therefore, a reference image can be selected as the

image which has the maximum number of pixel cor-

respondences in the image group. This can be repre-

sented by the following objective function G(i), and

the i-th image in an image group is selected as a ref-

erence image that maximizes G(i).

G(i) =

∑

j=1, j̸=i

∑

(x, i, j), (1)

(x, i, j) =

{

1 if ||g

(x, i, j)|| ≤ l

0 otherwise

(x, i, j) = v

i→ j

(x) + v

j→i

(x + v

i→ j

(x)),

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

234

Sequence 1 Sequence 2 Sequence n

Output Image

Frame Alignment

Background Image Generation

Divide into Image Groups

Spatial Alignment

Sequence 1 Sequence 2 Sequence n

Image Group 1

Adaptive Reference Image Selection

Image Group 2

Image Group 3

Figure 3: Process ﬂow of the proposed method.

where x is a pixel in the image, v

i→ j

(x) is the dis-

placement between pixel x in image i and the pixel

in image j which is matched to pixel x. The number

of pixel correspondences are calculated, such that the

matching error is less than l = 4 pixels. The image

which has the maximum number of pixel correspon-

dences is selected as the reference image in the image

group.

2.3 Algorithm

According to the deﬁnition in Section 2.1, we assume

that parts of the images captured at the same loca-

tion at different times do not include temporal objects.

Thus we combine parts of the images considered as

background to remove temporal objects.

The proposed method removes temporal objects

using a majority voting scheme. Temporal object

removal is performed by integration of frontal in-

vehicle camera image sequences captured along the

same route at different times. Figure 3 shows the

framework of the proposed method. The proposed

method consists of four phases; (1)frame alignment

between image sequences, (2)adaptive reference im-

age selection, (3)spatial alignment inside an image

group, and (4)background image generation.

Frame alignment between image sequences is the

process to compensate for different vehicle speeds be-

tween image sequences. After frame alignment, the

aligned images are grouped and the following pro-

cesses are performed for every result.

Spatial alignment inside each image group is the

process to compensate for the difference in appear-

ance depending on the vehicle position. In this pro-

cess, a reference image is adaptively selected.

Background image generation is the process that

removes temporal objects by combining images in the

image group. From the assumption in Section 2.2, we

remove temporal objects by majority voting.

The following sections describe each process in

detail.

2.3.1 Frame Alignment Between Image

Sequences

Frame alignment between image sequences is per-

formed by DP matching (Dynamic Time Warping)

using measurement of the positional relationship be-

tween two camera locations based on epipolar geom-

etry (Kyutoku et al., 2012). This method is known for

effective matching of frontal camera image sequences

captured along a common route and is robust to oc-

clusions. The epipole is calculated in all image pairs

between the reference image sequence and input im-

age sequences. The nearer the captured location is

between two cameras, the longer the distance is be-

tween the image center and the epipole. This property

is used as a measure for DP matching of the image

sequences and image groups captured along the same

route.

2.3.2 Adaptive Reference Image Selection

Appearances of images in an image group differ de-

pending on the vehicle position. Before overcoming

this, adaptive reference image selection is performed.

A reference image is selected for each image

group by the method outlined in Section 2.2 using

SIFT ﬂow (Liu et al., 2011). SIFT ﬂow calculates

SIFT features at each pixel and associates all pixels

AdaptiveReferenceImageSelectionforTemporalObjectRemovalfromFrontalIn-vehicleCameraImageSequences

235

between images correctly using belief propagation.

By using SIFT features, the robustness to rotation,

scale, and illumination changes of the standard SIFT

algorithm are achieved, and the technique is more ef-

fective than the optical ﬂow method.

2.3.3 Spatial Alignment Inside an Image Group

After adaptive reference image selection, spatial

alignment inside image groups is performed relative

to the reference image. In each case, pixels in the in-

put image are rearranged based on their correspond-

ing positions in the reference image. In detail, all

images in an image group are warped to the refer-

ence image according to the ﬂow ﬁeld obtained by

the SIFT ﬂow algorithm. Figure 4 shows the result of

this process. The colored regions show the difference

between two images in Figures 4(c) and (e). Figure

4(d) is the output image which is rearranged to make

the image more similar to the reference image shown

in Figure 4(a).

2.3.4 Background Image Generation

Temporal object removal to create a background im-

age is performed by fusion of images in each image

group. First, each image is divided into W patches,

and vectors consisting of the pixel values of each

patch are created. In this paper, we formulate the

problem of background image generation as optimal

patch selection in each image group. Here, we in-

troduce an index vector n consisting of the selected

image indices in the image group. The index vector

n is selected by minimizing the following objective

function F(n),

F(n) =

∑

w=1

[(1 − λ) f

) + λg

)], (2)

where n

is the w-th patch of the selected image and

is the penalty term for temporal objects and calcu-

lated by using the vector median ﬁlter (Astola et al.,

1990). The vector median ﬁlter is a ﬁlter which out-

puts the central vector of a ﬁeld of input vectors. The

central vector is deﬁned as the vector which mini-

mizes the sum of the distance between itself and other

vectors, and f

is the sum of the distance between the

central vector and other vectors. Since we assume that

for a majority of timings, the input patches do not con-

tain temporal objects, the central vector is considered

to be the background.

Each vector median ﬁlter match selection is per-

formed independently, so illumination conditions of

the neighborhood patches are not preserved. To solve

this problem, a penalty g

to represent the discontinu-

ity of neighborhood patches is employed. If the index

(a) Reference image

(b) Input image (c) Difference between (a) and

(b)

(d) Output image (e) Difference between (a) and

(d)

Figure 4: Spatial alignment using SIFT ﬂow. The colored

region in (c) and (e) shows the difference between two im-

ages.

(a)

(a) Input image 1

(b)

(b) Input image 2

(c)

(d)

(d) Output image

Figure 5: Result of fusion of an image group. Output (d) is

generated by fusion of images (a), (b) and (c).

w+α

in the neighborhood patch is different from n

in a certain patch, g

adds the distance between the

neighborhood patch’s vector n

w+α

and the one in n

λ is the weight between f

and g

Figure 5 shows the selection of the patch by

F(n) and image fusion. In fact, each patch is over-

lapped and localized, and the integration of over-

lapped patches is done by alpha-blending.

3 EXPERIMENT

3.1 Dataset

We prepared a dataset composed of ﬁve frontal in-

vehicle camera image sequences captured along the

same route. The route was a straight main road and

contained some temporal objects. Each image se-

quence had an image resolution of 720 × 340 pixels,

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

236

recorded at 24 fps, and contained 1,000 images. Ev-

ery one image out of 25 images was used for evalua-

tion.

3.2 Evaluation Experiments

We performed the following two experiments to eval-

uate the effectiveness of the proposed method. First is

an experiment for evaluating the accuracy of the adap-

tive reference image selection. Second is an experi-

ment for evaluating an accuracy of removal of tem-

poral objects. In addition to the proposed method,

we prepared a comparative method without adaptive

reference image selection, where all reference images

are selected from a speciﬁc image sequence.

3.2.1 Accuracy of Adaptive Reference Image

Selection

We evaluated the accuracy of the adaptive reference

image selection. First, we prepared the ground truth,

which consisted of the most suitable reference im-

ages containing the fewest temporal objects in each

image group. Second, we counted the pixels of the

temporal objects in each reference image selected by

the proposed and the comparative methods. We com-

pared the number of temporal object pixels between

the ground truth and the proposed and the compara-

tive methods.

3.2.2 Accuracy of Temporal Objects Removal

We evaluated the accuracy of temporal object removal

by comparing the number of pixels corresponding to

temporal objects in the proposed and the comparative

methods.

4 RESULTS AND DISCUSSIONS

4.1 Adaptive Reference Image Selection

Table 1 shows the average difference of the number of

temporal object pixels in each reference image. Fig-

ure 6 shows the time transition of the number of tem-

poral object pixels compared with the ground truth.

The result of the proposed method was closer to the

ground truth than that of the comparative method.

These show how adaptive reference image selection

helps to choose an appropriate reference image with

fewer temporal object regions as a starting point for

temporal object removal. However, the proposed

method could not choose the best reference image in

Table 1: The average number of pixels of temporal objects

in each reference image (pixels).

Ground Proposed Comparative

truth method method

693 2,106 4,906

5000

10000

15000

20000

25000

30000

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

850

900

950

# of Pixels of the Temporal Objects

Frame #

Comparative

method

Proposed

method

Figure 6: Time transition of the number of temporal object

pixels in reference images selected by the proposed and the

comparative methods compared with the ground truth.

100

150

200

250

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

850

900

950

# of Pixels of the Temporal Objects

Frame #

Comparative

method

Proposed

method

Figure 7: Time transition of the number of temporal ob-

ject pixels in the result of the proposed and the comparative

methods.

some image groups due to failure of SIFT ﬂow calcu-

lation. In future works, we need to investigate more

accurate spatial alignment methods.

4.2 Temporal Object Removal

The average number of pixels for temporal objects

in each output image was 16 pixels in the proposed

method and 36 pixels in the comparative method. Fig-

ure 7 shows the time transition of the number of pixels

corresponding to the temporal objects remaining in

each output image. Both the proposed and the com-

parative methods could remove almost all temporal

objects. This shows the effectiveness of temporal ob-

ject removal using multiple image sequences. Fig-

ure 8 shows examples of the output image for both

methods. As seen in Figure 8(a), the image quality de-

AdaptiveReferenceImageSelectionforTemporalObjectRemovalfromFrontalIn-vehicleCameraImageSequences

237

(a) Comparative method (b) Proposed method

Figure 8: Examples of output images.

teriorated in the comparative method. Because adap-

tive reference image selection was not performed,

there were many regions where the alignment did not

work well. Conversely, since the proposed method

employs adaptive reference image selection, the im-

age alignment failed in fewer areas. This is why the

proposed method gives better output image quality

than the comparative method. As seen in Figure 8(b),

the proposed method removed temporal objects and

reduced the distortion of visual image quality when

compared to the comparative method.

Therefore, we conﬁrmed that adaptive reference

image selection is one of solution to prevent image

distortion when integrating multiple image sequences

for the removal of temporal objects.

5 CONCLUSION

In this paper, we introduced the concept of adaptive

reference image selection, and proposed a method to

remove temporal objects by fusion of multiple image

sequences. We conﬁrmed that adaptive reference im-

age selection is one solution to prevent image distor-

tion when integrating multiple image sequences for

the removal of temporal objects.

For future work, we will apply the proposed

method to image sequences with more crowded

scenes. In this situation, the assumption stated in Sec-

tion 2.3 may not be satisﬁed. To solve this problem,

use of additional input image sequences and between-

frames information (Kawai et al., 2014) may be effec-

tive. Furthermore, we would like to develop a spatial

alignment method more robust to occlusions than the

standard SIFT ﬂow method.

ACKNOWLEDGEMENTS

Parts of this research were supported by a Grant-in-

Aid for Young Scientists from MEXT, a Grant-In-Aid

for Scientiﬁc Research from MEXT, and a CREST

project from JST, and Nagoya University COI. The

authors would like to thank Mr. David Robert Wong

for his useful comments.

REFERENCES

Astola, J., Haavisto, P., and Neuvo, Y. (1990). Vector me-

dian ﬁlters. Proc. IEEE, 78(4):678–689.

Bertalmio, M., Sapiro, G., Caselles, V., and Ballester, C.

(2000). Image inpainting. In Proc. 27th Int. Conf.

and Exhibition on Computer Graphics and Interactive

Techniques (SIGGRAPH2000), pages 417–424.

Kawai, N., Inoue, N., Sato, T., Okura, F., Nakashima, Y.,

and Yokoya, N. (2014). Background estimation for a

single omnidirectional image sequence captured with

a moving camera. IPSJ Trans. on Computer Vision

and Applications, 6:68–72.

Kyutoku, H., Deguchi, D., Takahashi, T., Mekada, Y., Ide,

I., and Murase, H. (2012). Substraction-based for-

ward obstacle detection using illumination insensi-

tive feature for driving support. In Proc. ECCV2012

Workshop on Computer Vision in Vehicle Technology

(CVVT2012): From Earth to Mars, pages 515–525.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

238

Liu, C., Yuen, J., and Torralba, A. (2011). SIFT ﬂow:

Dense correspondence across scenes and its applica-

tions. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 33(5):978–994.

Matsumoto, Y., Sakai, K., Inaba, M., and Inoue, H. (2000).

View-based approach to robot navigation. In Proc.

2000 IEEE/RSJ Int. Conf. on Intelligent Robots and

System (IROS2000), pages 1702–1708.

Uchiyama, H., Deguchi, D., Takahashi, T., Ide, I., and

Murase, H. (2010). Removal of moving objects from

a street-view image by fusing multiple image se-

quences. In Proc. 20th IAPR Int. Conf. on Pattern

Recognition (ICPR2010), pages 3456–3459.

AdaptiveReferenceImageSelectionforTemporalObjectRemovalfromFrontalIn-vehicleCameraImageSequences

239