TempSeg-GAN: Segmenting Objects in Videos Adversarially using

Temporal Information

Saptakatha Adak

and Sukhendu Das

Visualization and Perception Lab,

Indian Institute of Technology Madras, Chennai - 600036, India

Keywords:

Video Object Segmentation, Generative Adversarial Network (GAN), Deep Learning.

Abstract:

This paper studies the problem of Video Object Segmentation which aims at segmenting objects of interest

throughout entire videos, when provided with initial ground truth annotation. Although, variety of works in

this ﬁeld have been done utilizing Convolutional Neural Networks (CNNs), adversarial training techniques

have not been used in spite of their effectiveness as a holistic approach. Our proposed architecture consists

of a Generative Adversarial framework for the purpose of foreground object segmentation in videos coupled

with Intersection-over-union and temporal information based loss functions for training the network. The

main contribution of the paper lies in formulation of the two novel loss functions: (i) Inter-frame Temporal

Symmetric Difference Loss (ITSDL) and (ii) Intra-frame Temporal Loss (IFTL), which not only enhance the

segmentation quality of the predicted mask but also maintain the temporal consistency between the subsequent

generated frames. Our end-to-end trainable network exhibits impressive performance gain compared to the

state-of-the-art model when evaluated on three popular real-world Video Object Segmentation datasets viz.

DAVIS 2016, SegTrack-v2 and YouTube-Objects dataset.

1 INTRODUCTION

Video Object Segmentation has emerged as a popu-

lar ﬁeld of research in Computer Vision in the recent

decade. The popularity of this domain mainly lies in

its profound impact in the domains of bio-medical re-

search, self-driving cars, video editing, robotics, etc.

The recent years have noticed a dramatic advance in

Autonomous Driving where instance segmentation in

videos have found a huge scope through tasks like

segmenting lanes in roads, detecting obstacles in the

path of motion, segmentation of street signs, etc. With

the recent advancement of deep learning techniques,

there has been many works based on Convolutional

Neural Networks (CNNs) which not only have impro-

ved the performance for problems like image classiﬁ-

cation (Krizhevsky et al., 2012; Simonyan and Zis-

serman, 2014), object detection (Ren et al., 2015;

Redmon et al., 2016), etc., but also in the ﬁeld of

image segmentation (Maninis et al., 2016; Caelles

et al., 2017; Voigtlaender and Leibe, 2017), using

pre-trained weights of image recognition models on

ImageNet (Deng et al., 2009). The major disadvan-

tage of these CNNs are their hunger for large training

data. Recently, approaches based on Generative Ad-

versarial Networks (GAN) (Goodfellow et al., 2014)

have also been used for the task of image segmenta-

tion (Souly et al., 2017; Luc et al., 2016).

In Video Object Segmentation (VOS), with the in-

troduction of pixel-level annotated DAVIS 2016 (Pe-

razzi et al., 2016) dataset, a shift from segmentation

using bounding box tracking (Li et al., 2016; Wang

and Yeung, 2013; Perazzi et al., 2017) to pixel-level

tracking (Caelles et al., 2017; Voigtlaender and Leibe,

2017) has been noticed. Deep CNNs pre-trained

on large image classiﬁcation datasets show decent

performance in VOS (Caelles et al., 2017; Perazzi

et al., 2017) and other related tasks like single-object

tracking (Bertinetto et al., 2016; Nam and Han, 2016)

and back-ground modeling (Braham and Van Droo-

genbroeck, 2016; Wang et al., 2017). Among these,

Caelles et al. (Caelles et al., 2017) shows promising

result in VOS, by ﬁne-tuning the pre-trained CNN

with the initial frame of the target video. But, this

method fails to adapt to the drastic appearance chan-

ges in the subsequent frames of the video sequen-

ces. To cope up with this problem, some recent met-

hods have solved the task in the temporal domain

using optical ﬂow (Tsai et al., 2016; Khoreva et al.,

2017), spatio-temporal MRFs (Bao et al., 2018), gui-

ded mask propagation (Wug Oh et al., 2018; Cheng

et al., 2018), etc. In our work, we have followed

Adak, S. and Das, S.

TempSeg-GAN: Segmenting Objects in Videos Adversarially using Temporal Information.

DOI: 10.5220/0007254302210232

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 221-232

ISBN: 978-989-758-354-4

221

a GAN based approach for semi-supervised instance

segmentation of an object of interest throughout the

video along with the usage of two temporal informa-

tion based objective functions for end-to-end training

of the network.

To summarize, the novelty of this paper lies in

• an adversarial training based framework

(TempSeg-GAN) to generate segmentation

masks using the initial ground-truth and genera-

ted masks in videos (refer to section 2),

• use of a encoder-decoder model with skip con-

nections as generator network in the proposed ar-

chitecture,

• variation in discriminator training by adding enco-

der module between generator and discriminator,

such that the encoded output of the generator is

used as an input for the discriminator,

• incorporation of two temporal information based

loss functions which not only enhances the seg-

mentation quality of predicted masks but also sta-

bilizes the motion features between them.

The proposed Inter-frame Temporal Symmetric Dif-

ference Loss (ITSDL) is calculated between predicted

and optical ﬂow warped target mask, formed from the

ground-truth mask of previous time step. Thus, it

not only captures motion features between consecu-

tive frames but also enhances the segmentation qua-

lity by minimizing the erroneously identiﬁed pixels

in the predicted masks (section 3). The Intra-frame

Temporal loss (IFTL) along with its long-range vari-

ant (L-IFTL) generally preserve the temporal relati-

onship between the generated masks (section 4).

2 TEMPORALLY AIDED

SEGMENTATION NETWORK

The proposed network for Video Object Segmentation

consists of two sub-networks: (i) the Generator (G)

and (ii) the Discriminator (D). The generator G ge-

nerates images close to the ground-truth by extracting

features from the true data distribution p

data

, thereby

making it difﬁcult for the discriminator to differen-

tiate between generated and real images. Whereas,

the discriminator D, is optimized to predict whether

the generated output is synthetic or real. This process

of alternate learning of the two sub-networks in this

framework is similar to the two player min-max ga-

mes (Goodfellow et al., 2014). The overall objective

function for simultaneous minimizing the loss at G

Figure 1: Proposed TempSeg-GAN architecture. GT de-

notes the ground-truth and L9999K refers to the Inter-frame

Temporal Symmetric Difference Loss (L

IT SDL

) estimation

using GT mask at time t, predicted mask at time (t − 1)

and optical ﬂow vectors between RGB input images at time

(t − 1) and t.

and maximizing the distinguisher D is as follows:

min

max

u(G,D) =E

x∼p

data

[log(D(x))]

+ E

z∼p

[log(1 − D(G(z)))]

(1)

where, x is a real image from the true distribution

data

and z is a vector sampled from an uniform or

Gaussian random distribution p

. Since our work is

based on videos, a sequence of video frames are pro-

vided as an input to the network and is trained using

two novel loss functions which provides temporal as-

sistance. The adversarial loss employed in this paper

is a variant of that in equation 1, as an additional en-

coder module is implemented between the generator

and the discriminator.

The proposed architecture of Temporally aided

Segmentation GAN (TempSeg-GAN) is illustrated in

ﬁgure 1. The generator of the model is an encoder-

decoder structure consisting of convolutional modu-

les coupled with pooling and unpooling layers (Zeiler

and Fergus, 2014) with ReLU non-linearity. Batch-

normalization (Ioffe and Szegedy, 2015) and dropout

have also been incorporated in this network. The

contracting path of the encoder captures the context,

while the symmetric expanding path of the decoder

localizes the information precisely. Following “U-

net” (Ronneberger et al., 2015), skip connections are

added between each layer i and (n − i), where n de-

notes the total number of layers. The skip connection

concatenates all feature maps of layer i with those at

the (n − i)

layer and aids in sharing low-level infor-

mation like prominent edge details, etc. between the

initial and ﬁnal layers of the generator network. The

input to the encoder-decoder like generator are frames

of videos of dimension (W

×H

×3) passed indepen-

dently in a sequential manner, while the output is a

segmentation map of dimension (W

×H

×1), corre-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

222

sponding to each frame of input. The predicted output

and the ground-truth mask are then encoded with two

separate encoders of same conﬁguration (see ﬁgure

1). Each encoder is formed of convolutional layers

combined with pooling modules to down-sample the

input. The encoded predicted and ground-truth mask

obtained as output from the encoder form the input for

the discriminator, which is a collection of convoluti-

onal modules with fully-connected layers at the end

to generate 0 or 1, denoting synthetic or real data re-

spectively.

The main difference between the proposed

TempSeg-GAN model and the traditional one (Good-

fellow et al., 2014) is that the encoder modules have

been used to encode the output of the generator and

the ground-truth before passing it into the discrimina-

tor, instead of using the generator output directly. The

motivation behind the encoding mask lies in the fact

that, the discriminator will be able to distinguish be-

tween the real and synthetic outputs more efﬁciently

in the projected feature space than in the RGB image

space. The details of the network architecture is men-

tioned in section A of the Appendix.

2.1 TempSeg-GAN training

Our proposed GAN framework follows the training

of the conventional generative adversarial networks

with a few variations. The generator G in the adver-

sarial network is a segmentation model that predicts

the mask considering the joint data distribution of the

input video frame (I) and its corresponding ground-

truth mask (Y ). On the other hand, the discrimina-

tor D distinguishes between predicted and the origi-

nal mask, thus facilitating the training procedure by

minimizing the dissimilarity between the prediction

and the ground-truth. The objective function of the

adversarial training is as follows:

min

max

∑

bce

(Y,O

(I; θ

))

− λ



bce

(1,O

(Y );θ

))

+ L

bce

(0,O

(I; θ

));θ

))



(2)

where θ

and θ

are generator and discriminator

parameters respectively; L

bce

represents the binary

cross-entropy loss; O

and O

denote the generator

and discriminator output respectively; O

(X) refers

to the encoded segmented mask output of the input

X; 1 and 0 denotes the discriminator labels when the

input is from the ground-truth Y and the generator

(I; θ

) respectively and λ is a regularization pa-

rameter.

Thus, the adversarial objective function for the ge-

nerator, obtained by minimizing equation 2 w.r.t. θ

is as follows

adv

(I) =min

∑

bce

(Y,O

(I; θ

))

+ λL

bce

(1,O

(I; θ

));θ

))

(3)

where, the ﬁrst term deals with the consistency of the

predicted segmentation with the target mask at each

position, while the unﬁtting structure between output

and ground-truth is penalized with the help of the se-

cond term.

Again, equation 2 is minimized w.r.t. θ

and the

adversarial discriminator loss function is achieved as

follows

adv

= min

∑



bce

(1,O

(Y );θ

))

+ L

bce

(0,O

(I; θ

));θ

))



(4)

where, O

(I; θ

)) and O

(Y ) are the encoded

predicted and target masks fed into the discrimi-

nator, which is trained in such a way that it clas-

siﬁes (O

(I), O

(Y )) into class 1 and class 0 for

(I; θ

)),O

(Y )).

Though the theoretical foundation of this alternate

optimization process of generator and discriminator

is logically ﬁrm and well-established, in reality it is

susceptible to mode collapse leading to instabilities

in training. To overcome this implicit instability and

produce better predicted segmented masks, two novel

loss functions based on Intersection-over-Union (IoU)

and optical ﬂow vectors (refer to section 3, 4) have

been formulated and are used in addition to the exis-

ting conventional adversarial losses described in eqns.

3 and 4. The ﬂow vectors are obtained by passing

consecutive RGB input frames through the FlowNet

2.0 (Ilg et al., 2017) optical ﬂow generation module

during training.

3 INTER-FRAME TEMPORAL

SYMMETRIC DIFFERENCE

LOSS

Unlike images, the advantage of the videos lie in the

fact that it provide a latent space of data distribution

by combining the temporal information with the spa-

tial one. CNNs are capable of capturing short-range

consistencies in the spatial domain, which only forms

a small part of the rich input data. Thus, to maintain

the temporal coherency between the masks along with

enhancing the segmentation quality, an Intersection-

over-Union (IoU) based temporal objective function

has been incorporated. It measures the region of simi-

larity between the predicted mask (

Y ) and the ground-

truth mask (Y ), by computing the number of overlap-

ping pixels. In other words, it gives an idea of the

TempSeg-GAN: Segmenting Objects in Videos Adversarially using Temporal Information

223

Algorithm 1: optical f low warp(P

t−1

∗

t−1

)

Input: Ground-truth mask patch (P

t−1

) at time t − 1, Predicted mask patch (

) at time t, Optical ﬂow

vector map patch (W

∗

t−1

Output: Optical ﬂow warped ground-truth mask patch (P

∗

) at time t of dimension same as

// s = height and width of

// s + 4 = height and width of P

t−1

and W

∗

t−1

1 Initialize: patch P

∗

with each pixel value equal to 0.;

2 for u = 0 to s + 4, i ← u + 1 do

3 for v = 0 to s + 4, j ← v + 1 do

/* (u,v) is the spatial location of pixel at time (t − 1). */

4 u

← u +V

∆t;

5 v

← v +V

∆t;

/* V

and V

are horizontal and vertical flow vectors of (u,v) obtained from

∗

t−1

, (u

) is the new spatial location of (u,v) at time t. Here, ∆t = 1.

6 Label(u

) ← Label(u,v);

/* The label of (u

) is updated with that of (u, v). */

7 if (u

) lies within the patch P

∗

then

8 Update corresponding pixel value of P

∗

with Label(u

9 end

10 end

11 end

number of mispredicted pixels present in the estima-

ted mask compared to the ground-truth. To formulate

the function in a patch-wise manner, we calculate the

number of mispredicted pixels (|M

|) in a patch of the

segmented mask. The set of mispredicted pixels (M

)

for class c ∈ {0, 1} obtained by using symmetric dif-

ference of the two vectors p

p and

p, is as follows:

p) = {p

= c,

6= c} ∪ {p

6= c,

= c}

∀k = 1,· ·· ,s

(5)

where,

p denotes the vector of predicted labels in the

patch

P, with top-left pixel index (i, j), of the estima-

ted mask (

Y ) and p

p is the vector of ground-truth labels

in the corresponding patch P of the target mask (Y );

p ∈ {0, 1}

; s is the height and width of the patches

P and

In Video Object Segmentation, we implement the

symmetric difference by extracting non-overlapping

patches of dimension s × s (1 < s ≤ 4), represented

{i, j, s}, where (i, j) is the top-left index of the

patch, from the predicted mask at time t and then

evaluating the number of mislabeled pixels with the

corresponding target patch P

∗

{i, j, s} at same time t.

The mechanism of formation of the target patch from

the ground-truth patch at time (t − 1), denoted by

t−1

{i − 2, j − 2,s + 4}, by warping with optical ﬂow

vector patch W

∗

t−1

{i − 2, j − 2, s + 4}, at time (t − 1),

is explained step-wise in Algorithm 1.

In simpler terms, we calculate the symmetric dif-

ference score between small portions of the pre-

dicted mask and the corresponding optical ﬂow war-

ped ground-truth mask. According to our assumption,

the motion features are effectively transfered from the

ground-truth mask of the previous time step to the

warped target mask of current time step using opti-

cal ﬂow vectors. Thus the motion related features can

be well approximated with the low resolution patches

both in the spatial as well as temporal domains. The

smoothness of the features are also guaranteed, un-

less there is a sudden change of scene or rapid mo-

vement in the videos. Computing the loss function

in a patch-wise manner enhances the attention over

small disjoint sections formed as a result of occlusi-

ons, in the segmented masks, which are often ignored

when estimated globally.

The Inter-frame Temporal Symmetric Difference

score for each patch (IT SD

patch

) of a predicted mask

with top-left index (i, j) is denoted as:

IT SD

patch

∗

= 1} ∪ M

∗

(6)

which estimates a ratio of mispredicted pixels to the

total number of pixels for the patch having top-left

pixel index (i, j) (refer to line 11 in algorithm 2).

The step-by-step procedure of evaluating the

ITSD score by matching the corresponding local pat-

ches of the estimated and optical ﬂow warped masks

is explained in algorithm 2. The objective function

modeled by calculating the ITSD score for the gene-

rator (G) network not only minimizes the score for

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

224

Algorithm 2: Inter-frame Temporal Symmetric Difference score to estimate the similarity between optical ﬂow war-

ped ground-truth mask(s) and predicted mask(s).

extract patch(X

,a,b,d) is used to extract a patch of dimension d × d starting from the top-left pixel index (a,b) of

the frame X

at time t.

Input: Ground-truth masks (Y), Predicted masks (

Y ), Optical ﬂow vector maps (W)

Output: Inter-frame Temporal Symmetric Difference score (Score

IT SD

)

// s = height and width of an patch on the mask

// S = height and width of the masks

// t = current time

// T = Number of masks predicted

1 Initialize: Score

ISD

= 0;

2 for t = 1 to T do

3 for i = 0 to S, i ← i + s do

4 for j = 0 to S, j ← j + s do

← extract patch(

,i, j,s);

6 P

t−1

← extract

patch(Y

t−1

,i − 2, j − 2, s + 4);

7 W

∗

t−1

← extract patch(W

t−1

,i − 2, j − 2, s + 4);

8 P

∗

← optical f low warp(P

t−1

∗

t−1

) (refer to Algo. 1);

p ← vector of predicted labels of pixels obtained from the patch

;

10 p

∗

← vector of labels of pixels obtained from the patch P

∗

;

11 M

∗

← {p

∗

= 1,

= 0} ∪ {p

∗

= 0,

= 1}, ∀k = 1, ·· · ,s

(see eqn. 5);

/* M

∗

denotes the set of mislabeled pixels in the

p compared with p

∗

12 Score

IT SD

← Score

IT SD

∗

p=1}∪M

∗

(see eqn. 6);

13 end

14 end

15 Score

IT SD

←

Score

IT SD

; // Average over all the patches

16 end

17 S core

IT SD

←

Score

IT SD

/T ; // Average over all the masks

batch inputs but also maintains the temporal data dis-

tribution by stabilizing the motion features generated

by the network with the help of ﬂow vectors. The loss

function, L

IT SDL

is deﬁned as:

IT SDL

(Y,

Y ) = Score

IT SD

(Y,

Y ) (7)

where,

Y and Y represents the predicted and optical

ﬂow warped mask, and Score

IT SD

denotes the mean

symmetric difference score over all the masks, obtai-

ned using the process mentioned in algorithm 2.

4 INTRA-FRAME TEMPORAL

LOSS

The ITSD Loss, mentioned in section 3, estimates the

motion features that change slowly with respect to

time using the local symmetric difference measures

which in turn also enhances the segmentation quality.

Thus to maintain the temporal relationship between

the frames globally, we introduce the idea of Intra-

frame Temporal Loss over the network output masks.

A few works (Goroshin et al., 2015; Mobahi et al.,

2009) in the recent past exploits the idea of the tempo-

ral coherence to learn the motion features. Assuming

slow variation of motion features over time, we consi-

der 2 consecutive frames

and

t+1

as temporal pair,

where

and

t+1

are TempSeg-GAN generated out-

put masks at time t and t +1 respectively with value of

the discriminator (D) outputs, q

and q

t+1

being equal

to 1 for both the masks. The slow variation of motion

features is modeled through an objective function as

IFT L

(

Y ,

−→

q )

T −1

∑

t=0

(

t+1

× q

t+1

)

T −1

∑

t=0



× q

t+1

× d(

t+1

)

+ (1 − q

× q

t+1

) × max(0,δ − d(

t+1

))



(8)

where, T is the total time duration of the masks

generated by the network, q

∈ {0,1} gives the value

of the discriminator output, d(x,y) is the measure for

eucledian distance and δ is a positive constant. Thus

speaking in simpler terms, equation 8 minimizes the

TempSeg-GAN: Segmenting Objects in Videos Adversarially using Temporal Information

225

intra-frame distance between the predicted masks

which have been generated correctly while penalizing

the disparity between the incorrectly predicted frames

with a positive margin δ.

Long-range Intra-frame Temporal Loss: Though

the IFT Loss maintains the temporal consistency

between the consecutive frames, it does not guarantee

the same for the long term frames. Thus, to keep the

stability intact in the spatio-temporal feature space

a Long-range Intra-frame Loss (L-IFTL) is incorpo-

rated by extending the IFT Loss as an estimation of

the distance between initial predicted mask (

) and

all other predicted masks (

) at time t(> 0). The

proposed loss is deﬁned as

L−IFT L

(

Y ,

−→

q )

∑

t=1

(

× q

)

∑

t=1



× q

× d(

)

+ (1 − q

× q

) × max(0,δ − d(

))



(9)

where, the symbols have the same meaning as in

equation 8.

Thus, L-IFT preserves the temporal coherency

among the distant frames by estimating the distance

between the initial and rest of the generated frames.

5 MULTI-COMPONENT

OBJECTIVE FUNCTION

Finally, the overall objective function is formed by

combining the loss functions given in eqns. 7 - 9 with

the adversarial loss (refer to eqn. 3) and the traditional

objective with respective weights as follows

combined

= α

adv

(I) + α

(Y,

Y ) + α

IT SDL

(Y,

Y )

+ α

IFT L

(

Y ,

−→

q ) + α

L−IFT L

(

Y ,

−→

q )

(10)

where, the weights viz. α

, α

IT SDL

, α

IFT L

and

L−IFT L

are set to 0.25 while α

adv

is kept at 0.1. This

combined loss is minimized during the training of

TempSeg-GAN using Adam optimizer (Kingma and

Ba, 2014).

6 EXPERIMENTAL RESULTS

AND DISCUSSIONS

In this section, we discuss the performance analysis

of our proposed model for video object segmentation

on three popular benchmark real-world datasets. Se-

quences of video frames with corresponding ground-

truth masks are used to train the model. During va-

lidation, the network generates segmented mask of

the object of interest when provided with frames of

video sequences as input. To specify the particular

object to be segmented the model is ﬁne-tuned with

ﬁrst two annotated frames of the videos. The net-

work generated segmented masks are again used by

our proposed architecture as reference masks to pro-

duce the predicted masks of the next time steps. Op-

tical ﬂow vectors are also calculated using FlowNet

2.0 (Ilg et al., 2017) between the consecutive frames

which helps in formulating the ITSD Loss (described

in section 3). Three metrics: Region similarity (J ),

Contour accuracy (F ) and Temporal (in-)stability (T )

(Perazzi et al., 2016) have been used to compare the

results of our network with the existing state-of-the-

art techniques.

6.1 Datasets

Evaluation of the proposed adversarial training ba-

sed method is done on three benchmark datasets

with challenging characteristics like occlusion, mo-

tion blur, background clutter, change of appearance,

etc.

DAVIS 2016 (Perazzi et al., 2016) consists of 50 high

resolution video sequences with 30 being used for

training and remaining for validation purposes. Sin-

gle or multiple connected objects in each of the 3,455

frames of the dataset are provided with pixel-level

segmentation.

SegTrack-v2 (Li et al., 2013) contains 14 videos with

a total of 947 frames. Sequences with multiple objects

are annotated with instance-level segmentation where

each annotation is treated as an individual object.

YouTube-Objects Dataset (Prest et al., 2012) inclu-

des 126 videos with 10 object classes. The ground-

truth segmentation masks with pixel level accuracy

of ∼4,250 frames have been obtained from (Jain and

Grauman, 2014).

6.2 Evaluation Metric for Segmentation

Three methods (Perazzi et al., 2016) used for quanti-

tative assessment of the predicted masks in compari-

son with the ground-truth masks are as follows : (a)

Region similarity (J ), (b) Contour accuracy (F ) and

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

226

Region similarity (J ) or intersection-over-union

(IoU) measures the similarity in segmentation by

computing the region overlap between the estimated

(

Y ) and ground-truth (Y ) masks and is deﬁned as:

J =

Y ∩Y |

Y ∪Y |

Contour accuracy (F ) gives a measure based on the

precision and recall of the contour points forming the

segmentation boundary.

Temporal (in-)stability (T ) evaluates how smooth

are the motion features in a video, thus measuring the

consistency of sequences of frames over time.

6.3 Performance Analysis of Video

Object Segmentation

The proposed TempSeg-GAN model is ﬁrst pre-

trained on the ImageNet (Deng et al., 2009) dataset.

During training on DAVIS 2016 (Perazzi et al.,

2016), 30 sequences of annotated video frames

pre-allocated for training purpose has been used.

Augmentation of the frames by random rotation,

ﬂipping and zooming is also done for training, in

such a way that it does not disturb the temporal

consistency among the frames. While validating,

ﬁrst two frames of each of the remaining 20 video

sequences along with its ground-truth mask is fed

into the GAN to ﬁne-tune the generator before pro-

viding the remaining video frames sequentially into

the network as input to generate the corresponding

segmented masks. Fine-tuning aids in capturing the

appearance of the speciﬁc object of interest required

for foreground segmentation. Also, to maintain the

temporal relationship between the predicted masks,

the network-output masks of previous time steps

are used as reference, during generation of next

time step masks, by the framework. A well-tuned

Conditional Random Field (CRF) (Kr

ahenb

uhl and

Koltun, 2011) is used as a post-processing module

on top of our method for fair comparative study of

results among the recent state-of-the-art methods.

Augmentation of the frames at ﬁne-tuning stage

(Test-time augmentation) is also done to improve the

quality of segmentation. TempSeg-GAN++ refers to

the modiﬁed version of our baseline model with the

aforementioned add-ons attached.

Ablation Studies on DAVIS 2016

Variations have been made in the proposed archi-

tecture to study its performance and the results

obtained in different cases are exhibited in table

1. First, we remove the ﬁne-tuning based on the

initial frames of the videos and then the output on

the validation set is studied in an unsupervised setup.

Table 1: Ablation study of our proposed method on DAVIS

2016 dataset. Keeping the entire system intact, one varia-

tion is made at a time to see contribution of each module.

The last row consists of the result after adding test-time aug-

mentation and CRF on the top of our base method. The

result of the best conﬁguration is in bold. The right-most

column gives the J

Mean

difference (∆J

Mean

) of different set-

tings in comparison with the baseline method (in row 5).

Aspect System variant J

Mean

∆J

Mean

Training

w/o ﬁne-tune 76.8 -8.3

w/o DAVIS training 68.7 -16.4

w/o ImageNet weights 79.3 -5.8

Loss

loss 81.2 -3.9

+ IT SDL 84.6 -0.5

TempSeg-GAN 85.1 -

Add-ons TempSeg-GAN++ 86.3 + 1.2

A substantial decrease in J

Mean

was noticed relying

only on pre-trained ImageNet (Krizhevsky et al.,

2012) weights and DAVIS training data, thus making

ﬁne-tuning indispensable for expanding the tracking

capabilities in the video sequences. Again, relying on

only pre-trained ImageNet weights and ﬁne-tuning,

skipping the training on DAVIS 2016 (Perazzi et al.,

2016) dataset, shows a drastic drop (68.7 J

Mean

)

in the performance of the model. Removing the

pre-trained ImageNet weights results in decrease in

Mean

, owing to the loss of scale information. We

argue that tracking a speciﬁc object in a video re-

quires a reasonable amount of pre-knowledge which

comes from pre-training the network on ImageNet

(Deng et al., 2009) dataset which consists of ∼10 M

objects belonging to 1000 categories. Thus, these

pre-trained weights assist the model to learn the

general objectness prior, while the training set of

DAVIS 2016 provides an advantage in evaluation by

aiding the model to adapt to the characteristics of

the dataset. The initial frames of the validation set

videos guide the network to track the speciﬁc object

of interest throughout the sequences.

The proposed objective functions play an impor-

tant role in the generation of segmentation masks in

the sequence of video frames. Using only L

loss

produces holes in the segmented mask and some

of them contain small blobs generated outside the

region of interest causing inaccurate segmentation.

On the other hand, the proposed Inter-frame Tempo-

ral Symmetric Difference Loss (ITSDL), combined

with L

and Intra-frame Temporal Loss (IFTL)

produces impressive results (see ﬁgure B.2 in the

Appendix), where the ITSDL helps in removing the

blob like artifacts, thereby improving the contours

of the output masks. We also add a well-tuned

post-processing CRF on top of our proposed method

along with augmentation of initial frames during

TempSeg-GAN: Segmenting Objects in Videos Adversarially using Temporal Information

227

Table 2: Quantitative analysis of TempSeg-GAN in comparison with other existing semi-supervised methods on DAVIS 2016,

YouTube-Objects and SegTrack-v2 datasets. Other results used for comparison are from the respective papers. Best results

are in bold. Values underlined represents the next best results. ↑ = ‘higher the value better’; ↓= ‘lower the value better’.

Method

DAVIS 2016 YouTube-

Objects

SegTrack-

Mean

↑ F

Mean

↑ T

Mean

↓ J

Mean

↑ J

Mean

↑

BVS (M

arki et al., 2016) 60.0 58.8 34.7 59.7 58.4

OFL (Tsai et al., 2016) 68.0 63.4 22.2 70.1 67.5

OSVOS (Caelles et al., 2017) 79.8 80.6 37.8 72.5 65.4

Masktrack (Perazzi et al., 2017) 80.3 75.8 18.6 72.6 70.3

RGMP (Wug Oh et al., 2018) 81.5 82.0 13.3 - 71.1

LucidTracker (Khoreva et al., 2017) 80.5 - - 76.2 77.6

FAVOS (Cheng et al., 2018) 82.4 79.5 26.3 - -

OnAVOS (Voigtlaender and Leibe, 2017) 85.7 84.2 18.5 77.4 -

CINM (Bao et al., 2018) 83.4 85.0 28.0 78.4 77.1

TempSeg-GAN (ours) 85.1 83.3 15.1 77.6 76.8

TempSeg-GAN++ (ours) 86.3 85.2 14.2 78.4 77.9

ﬁne-tuning stage to boost the J

Mean

value further.

It is evident from table 1 that, each of the above

factors is important and removing any one of them

causes deterioration in terms of quantitative as well

as qualitative outputs.

Quantitative Analysis with Existing Methods

The major part of our experiments are performed on

the DAVIS 2016 (Perazzi et al., 2016) dataset, which

consists of high-resolution video sequences with all

of their frames annotated with pixel-level segmen-

tation. For DAVIS, 3 metrics: (i) region similarity

in terms of mean Jaccard index (J

Mean

), (ii) mean

contour accuracy (F

Mean

) and (iii) mean temporal

(in-)stability of the segmented masks (T

Mean

), have

been relied upon for evaluation. The validation set

of DAVIS 2016 has been used for computation and

comparison purposes.

We compare our work with a number of recent and

state-of-the-art semi-supervised methods like OnA-

VOS (Voigtlaender and Leibe, 2017), Masktrack (Pe-

razzi et al., 2017), OSVOS (Caelles et al., 2017),

LT (Khoreva et al., 2017), CINM (Bao et al., 2018),

RGMP (Wug Oh et al., 2018), FAVOS (Cheng et al.,

2018), OFL (Tsai et al., 2016) and BVS (M

arki

et al., 2016). The quantitative results of our met-

hod in comparison with other techniques are shown

in table 2. In terms of region similarity J

Mean

, our

baseline TempSeg-GAN w/o adapt model beats all

other existing techniques except OnAVOS (Voigt-

laender and Leibe, 2017) which uses online adap-

tation, test time augmentation and CRF. On using

CRFs and Test-time augmentation on the top of our

base network (TempSeg-GAN++), the result obtai-

ned surpasses OnAVOS. In terms of contour accu-

racy F

Mean

, TempSeg-GAN++ outperforms all other

methods, though the base model falls short when

compared with CINM (Bao et al., 2018) and OnA-

VOS. Temporal (in-)stability measure T

Mean

of both

Temp-GAN and TempGAN++ exhibits dominant per-

formance over all the recent and state-of-the-art semi-

supervised methods except RGMP (Wug Oh et al.,

2018) which used guided mask propagation as a part

of the model (refer column 3 of results under DA-

VIS 2016, in table 2). Thus, the overall performance

of TemSeg-GAN base model along with its modiﬁed

variant (TemSeg-GAN++) is better than most of the

existing methods with small exceptions in few cases.

The Inter-frame Temporal Symmetric Difference Loss

(ITSDL) used in our network can be accounted for the

success of our base model which has outperformed

majority of the existing state-of-the-art techniques by

minimizing the number of mispredicted pixels in seg-

mentation. Thus it has not only increased the J

Mean

value, but also has improved the segmented contour

by working on small patches. Again, ITSDL along

with Intra-Frame Temporal Loss (IFTL) and its long-

range variant has contributed to the temporal stability

in between the generated masks. The quantitative re-

sults in table 2 clariﬁes the effectiveness of our pro-

posed ITSDL and IFTL objective functions.

For complete evaluation, experimentations are

also done on SegTrack-v2 (Li et al., 2013) and

YouTube-Objects (Prest et al., 2012) datasets and

compared our results (refer table 2) with recent state-

of-the-art methods. Due to the lack of proper training

set in YouTube-Objects dataset, the same parameters

as of DAVIS 2016 has been used and the pre-training

step on DAVIS training set is removed to evaluate the

generalization capability of our method. While eva-

luating on this dataset, we have been consistent with

(Khoreva et al., 2017) i.e. the frames in which the ob-

ject of interest are absent has also been included. Both

TempSeg-GAN base network (77.6 J

Mean

) and its mo-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

228

Figure 2: Qualitative results on three benchmark real-world datasets exhibit that our proposed method gives impressive

results in challenging situations like change of appearance, occlusions, background clutter and motion blur, when compared

to OnAVOS (Voigtlaender and Leibe, 2017) and MSK (Perazzi et al., 2017) (best viewed in color).

diﬁed variant (78.4 J

Mean

) give better results than On-

AVOS (refer table 2). TempSeg-GAN++ performs at

par with the state-of-the-art CINM model.

Assessment on the SegTrack-v2 dataset is perfor-

med following the similar choice of setting as done

in recent existing methods (Perazzi et al., 2017; Kho-

reva et al., 2017). Same protocols as of DAVIS 2016

evaluation are carried out by ﬁne-tuning on the ini-

tial frame of the test video sequences. Table 2 shows

that modiﬁed version of our base network (TempSeg-

GAN) exhibits promising result in comparison with

the recent state-of-the-art models. The comparative

study of the qualitative results of our proposed net-

work with OnAVOS (Voigtlaender and Leibe, 2017)

and MaskTrack (Perazzi et al., 2017) on three popular

real-world datasets is shown in Figure 2, where our

method performs well in segmenting speciﬁc object

of interest under difﬁcult conditions like background

clutter, viewpoint change, motion blur, occlusions and

shape deformation of object. More visual results of

TempSeg-GAN on the three real-world datasets are

shown in ﬁgure B.1 of the Appendix.

7 CONCLUSION

The paper proposes a temporally aided Generative

Adversarial Network for the purpose of Video Object

Segmentation. The generator of the model is modi-

ﬁed by implementing an encoder-decoder type archi-

tecture with skip connections, along with a variation

in the discriminator training by introducing an addi-

tional encoder module. Introduction of Inter-frame

Temporal Symmetric Difference Loss (ITSDL) and

Intra-frame Temporal Loss (IFTL) not only provides

a signiﬁcant improvement in the segmentation results

over the existing state-of-the-art techniques, but also

preserves the motion features among the generated

masks. Quantitative results on three benchmark da-

tasets reveals the superiority of TempSeg-GAN over

other recent state-of-the-art methods. This work can

be effectively implemented to segment trafﬁc signs,

vehicles and other obstacles in the context of autono-

mous cars.

TempSeg-GAN: Segmenting Objects in Videos Adversarially using Temporal Information

229

REFERENCES

Bao, L., Wu, B., and Liu, W. (2018). Cnn in mrf: Video ob-

ject segmentation via inference in a cnn-based higher-

order spatio-temporal mrf. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 5977–5986.

Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A.,

and Torr, P. H. (2016). Fully-convolutional siamese

networks for object tracking. In European Conference

on Computer Vision (ECCV), pages 850–865. Sprin-

ger.

Braham, M. and Van Droogenbroeck, M. (2016). Deep

background subtraction with scene-speciﬁc convolu-

tional neural networks. In IEEE International Con-

ference on Systems, Signals and Image Processing

(IWSSIP), pages 1–4. IEEE.

Caelles, S., Maninis, K.-K., Pont-Tuset, J., Leal-Taix

e, L.,

Cremers, D., and Van Gool, L. (2017). One-shot video

object segmentation. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR). IEEE.

Cheng, J., Tsai, Y.-H., Hung, W.-C., Wang, S., and Yang,

M.-H. (2018). Fast and accurate online video object

segmentation via tracking parts.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 248–255. Ieee.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In Advan-

ces in Neural Information Processing Systems (NIPS),

pages 2672–2680.

Goroshin, R., Bruna, J., Tompson, J., Eigen, D., and LeCun,

Y. (2015). Unsupervised learning of spatiotemporally

coherent metrics. In IEEE International Conference

on Computer Vision (ICCV), pages 4086–4093.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A.,

and Brox, T. (2017). Flownet 2.0: Evolution of optical

ﬂow estimation with deep networks. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), volume 2, page 6.

Ioffe, S. and Szegedy, C. (2015). Batch normalization:

Accelerating deep network training by reducing in-

ternal covariate shift. In International Conference on

Machine Learning (ICML), pages 448–456.

Jain, S. D. and Grauman, K. (2014). Supervoxel-consistent

foreground propagation in video. In European Con-

ference on Computer Vision (ECCV), pages 656–671.

Springer.

Khoreva, A., Benenson, R., Ilg, E., Brox, T., and Schiele,

B. (2017). Lucid data dreaming for object tracking. In

The 2017 DAVIS Challenge on Video Object Segmen-

tation - CVPR Workshops.

Kingma, D. P. and Ba, J. (2014). Adam: A method for sto-

chastic optimization. arXiv preprint arXiv:1412.6980.

ahenb

uhl, P. and Koltun, V. (2011). Efﬁcient inference in

fully connected crfs with gaussian edge potentials. In

Advances in Neural Information Processing Systems

(NIPS), pages 109–117.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Advances in Neural Information Pro-

cessing Systems (NIPS), pages 1097–1105.

Li, F., Kim, T., Humayun, A., Tsai, D., and Rehg, J. M.

(2013). Video segmentation by tracking many ﬁgure-

ground segments. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 2192–2199.

Li, H., Li, Y., and Porikli, F. (2016). Deeptrack: Lear-

ning discriminative feature representations online for

robust visual tracking. IEEE Transactions on Image

Processing (TIP), 25(4):1834–1848.

Luc, P., Couprie, C., Chintala, S., and Verbeek, J. (2016).

Semantic segmentation using adversarial networks. In

NIPS Workshop on Adversarial Training.

Maninis, K.-K., Pont-Tuset, J., Arbel

aez, P., and Van Gool,

L. (2016). Convolutional oriented boundaries. In Eu-

ropean Conference on Computer Vision (ECCV), pa-

ges 580–596. Springer.

arki, N., Perazzi, F., Wang, O., and Sorkine-Hornung, A.

(2016). Bilateral space video segmentation. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 743–751.

Mobahi, H., Collobert, R., and Weston, J. (2009). Deep le-

arning from temporal coherence in video. In Internati-

onal Conference on Machine Learning (ICML), pages

737–744. ACM.

Nam, H. and Han, B. (2016). Learning multi-domain con-

volutional neural networks for visual tracking. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 4293–4302.

Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., and

Sorkine-Hornung, A. (2017). Learning video object

segmentation from static images. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), volume 2.

Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L.,

Gross, M., and Sorkine-Hornung, A. (2016). A ben-

chmark dataset and evaluation methodology for video

object segmentation. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 724–732.

Prest, A., Leistner, C., Civera, J., Schmid, C., and Ferrari, V.

(2012). Learning object class detectors from weakly

annotated video. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 3282–3289. IEEE.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time ob-

ject detection. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 779–788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in Neural Informa-

tion Processing Systems (NIPS), pages 91–99.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

230

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In International Conference on Medical

image computing and computer-assisted intervention

(MICCAI), pages 234–241. Springer.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Souly, N., Spampinato, C., and Shah, M. (2017). Semi su-

pervised semantic segmentation using generative ad-

versarial network. In IEEE International Conference

on Computer Vision (ICCV), pages 5689–5697. IEEE.

Tsai, Y.-H., Yang, M.-H., and Black, M. J. (2016). Video

segmentation via object ﬂow. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Re-

cognition (CVPR), pages 3899–3908.

Voigtlaender, P. and Leibe, B. (2017). Online Adaptation

of Convolutional Neural Networks for Video Object

Segmentation. In British Machine Vision Conference

(BMVC).

Wang, N. and Yeung, D.-Y. (2013). Learning a deep

compact image representation for visual tracking. In

Advances in Neural Information Processing Systems

(NIPS), pages 809–817.

Wang, Y., Luo, Z., and Jodoin, P.-M. (2017). Interactive

deep learning method for segmenting moving objects.

Pattern Recognition Letters, 96:66–75.

Wug Oh, S., Lee, J.-Y., Sunkavalli, K., and Joo Kim, S.

(2018). Fast video object segmentation by reference-

guided mask propagation. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 7376–7385.

Zeiler, M. D. and Fergus, R. (2014). Visualizing and under-

standing convolutional networks. In European Con-

ference on Computer Vision (ECCV), pages 818–833.

Springer.

APPENDIX

A Network Architecture Details

The details of the generator (G), discriminator (D) and

encoder (E) networks of our proposed TempSeg-GAN

are presented in table A.1 for experimental analysis

. All the convolution layers of G follow ReLU non-

linearity, while batch-normalization and dropout at a

rate of 50% are also included in the network. G is

an encoder-decoder network where skip connections

connect layer i with layer (n −i) by concatenating the

feature maps of former with that of later. Unpooling

layers are also used to upsample the image by a fac-

tor of 2 into higher resolution in terms of both width

and height. G is initialized with pre-trained weights

of ImageNet and the learning rate is ﬁxed to 0.002

for training purpose, which decreases gradually over

time upto 0.0004. The learning rate of the discrimi-

nator (D) network is set to 0.01 and also uses ReLU

non-linearities. For training the entire network, mini-

batches of 50 frames of video sequences were used.

Table A.1: Network architecture details; G, D and E are the

generator, discriminator and encoder networks respectively.

Network G D E

Number

of feature

maps

64, 128,

256, 512,

512, 512,

256, 128,

256, 512,

512

64, 128,

256

Kernel si-

zes

5, 3, 3, 3,

3, 3, 3, 3,

3, 3, 5

3, 5, 5 5, 3, 3

Fully con-

nected

N/A 1024, 512 N/A

B Qualitative Results of

TempSeg-GAN

More qualitative results of our TempSeg-GAN base

network on three benchmark real-world datasets viz.

DAVIS 2016, SegTrack-v2 and YouTube-Objects da-

tasets, have been shown in ﬁgure B.1.

Additional Illustrations

Apart from ﬁgures B.1-B.2, we also provide video

output using .gif format. We perform qualitative eva-

luation on two video clips, one from each of DA-

VIS and SegTrack-v2 respectively. The video outputs

on DAVIS 2016 (DAVIS.gif) and SegTrack-v2 (Seg-

trackv2.gif) contains the comparison of TempSeg-

GAN with the existing state-of-the-art techniques like

OnAVOS (Voigtlaender and Leibe, 2017) and Mas-

kTrack (Perazzi et al., 2017) respectively. Both the vi-

deo clips show the superiority of the TempSeg-GAN

as the result is quite close to the ground-truth.

TempSeg-GAN: Segmenting Objects in Videos Adversarially using Temporal Information

231

Figure B.1: Qualitative results on three benchmark real-world datasets exhibit that our proposed method gives impressive

results in challenging situations like change of appearance, occlusions, camera view change, background clutter and motion

blur (best viewed in color).

Figure B.2: Comparative study of predicted segmentation results on DAVIS 2016 obtained from our TempSeg-GAN models

using only L

loss and Combined (L

+ IT SDL + IFT L) loss (refer eqn. 10) respectively. IT SDL and IFT L denotes the pro-

posed Inter-frame Temporal Symmetric Difference Loss (section 3) and Intra-Frame Temporal Loss (section 4) respectively.

Figures in insets show zoomed-in patches for better visibility of the estimated segmented masks in areas with background clut-

ter, occlusion and signiﬁcant motion blur. Using only L

loss produces holes in the segmented mask and some of them contain

small blobs generated outside the region of interest causing inaccurate segmentation. On the other hand, IT SDL + L

+IFT L

produces impressive results, where IT SDL helps in removing the blob like artifacts, thereby improving the contours of the

output masks, while IFT L preserves the temporal consistency between the generated masks (best viewed in color).

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

232