A Full Reference Video Quality Measure based on Motion Differences

and Saliency Maps Evaluation

B. Ortiz-Jaramillo, A. Kumcu, L. Platisa and W. Philips

UGent-iMinds-IPI, St-Pietersnieuwstraat 41, Gent, Belgium

Keywords:

Dense Optical Flow, Psychophysics, Saliency Maps, Temporal Distortions, Video Quality Assessment.

Abstract:

While subjective assessment is recognized as the most reliable means of quantifying video quality, objective

assessment has proven to be a desirable alternative. Existing video quality indices achieve reasonable predic-

tion of human quality scores, and are able to well predict quality degradation due to spatial distortions but not

so well those due to temporal distortions.

In this paper, we propose a perception-based quality index in which the novelty is the direct use of motion

information to extract temporal distortions and to model the human visual attention. Temporal distortions are

computed from optical ﬂow and common vector metrics. Results of psychovisual experiments are used for

modeling the human visual attention. Results show that the proposed index is competitive with current quality

indices presented in the state of art. Additionally, the proposed index is much faster than other indices also

including a temporal distortion measure.

1 INTRODUCTION

Video-based applications such as video coding, digi-

tal television, surveillance and tele-medicine, are be-

coming more common in everyday life. Quality con-

trol is important in those applications for increasing

the quality of service, which can be degraded due to

compression, network packet delay, and packet loss,

among others. In this ﬁeld, video quality assessment

(VQA) has an important role in evaluating and im-

proving the performance of video-based applications.

VQA can be grouped into two main categories: sub-

jective and objective methods. Subjective VQA is

performed by a group of persons, who evaluate (pro-

cessed or corrupted) video sequences according to

certain well-deﬁned criteria such as ITU standards

(ITU-R-Recommendation-BT.500-11, 1998). The re-

sults are typically explained in terms of Differen-

tial Mean Opinion Scores (DMOS). This methodol-

ogy represents the most realistic system of measur-

ing the opinion of humans toward video or image

quality (ITU-R-Recommendation-BT.500-11, 1998),

(VQEG, 2003). However, such tests are complex, ex-

pensive, and time consuming. Hence, objective (au-

tomatic) assessment is a more desirable alternative.

Objective VQA uses computer algorithms for

computing quality scores that should correlate well

with the subjective assessment provided by human

evaluators. In general, those algorithms are classiﬁed

into natural visual characteristic methods (NVC) or

perceptual oriented methods (POM) (Chikkerur et al.,

2011). NVC methods use statistical measures (mean,

variance, histograms) and/or visual features (blurring,

blocking, texture, visual impairments) for comput-

ing quality scores, for example the Video Quality

Model (VQM) (Pinson and Wolf, 2004), the motion

compensated structural similarity index (MC-SSIM)

(Moorthy and Bovik, 2010) or video quality assess-

ment by decoupling additive impairments and detail

losses (VQDM) (Li et al., 2011). VQM is computed

by using local spatio-temporal statistics which are ex-

tracted and combined to obtain a quality score. MC-

SSIM uses a popular still image quality metric and

block-based motion compensation for computing er-

rors along motion trajectories. VQDM separates de-

tail losses and additive impairments by subtracting a

restored version of each frame from the reference im-

age. Then, both distortions are pooled and combined

for computing a single quality score.

POM methods have been designed based on re-

sults of physiological and/or psychovisual experi-

ments, e.g., weighted structural similarity index (wS-

SIM) (Wang and Li, 2007), and motion-based video

integrity evaluation index (MOVIE) (Seshadrinathan

and Bovik, 2010). wSSIM uses the structural sim-

ilarity index for measuring local image differences,

714

Ortiz-Jaramillo B., Kumcu A., Platisa L. and Philips W..

A Full Reference Video Quality Measure based on Motion Differences and Saliency Maps Evaluation.

DOI: 10.5220/0004870607140722

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (PANORAMA-2014), pages 714-722

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

termed quality maps. For computing a unique qual-

ity score from those quality maps, a spatiotemporal

weighted mean is used. Those weighting factors are

computed based on a Bayesian optimal observer hy-

pothesis. MOVIE uses a Gabor ﬁlter bank specif-

ically designed based on physiological ﬁndings for

mimicking the visual system response. The video

quality evaluation is carried out from two compo-

nents (spatial and temporal distortions). Spatial dis-

tortions are computed as squared differences between

Gabor coefﬁcients of the reference and processed se-

quences. Temporal distortions are obtained from the

mean square error between reference and processed

sequences along motion trajectories computed over

the reference video. Although many video quality

assessment algorithms have been proposed, many of

them do not explicitly account for temporal artifacts

which occur in video sequences. For instance, in

VQM, MC-SSIM, VQDM and wSSIM, motion infor-

mation is only used to design weights to pool qual-

ity maps into a single quality score for the video.

However, weights based on temporal information do

not necessarily account for temporal distortions (Se-

shadrinathan and Bovik, 2010). Despite of the direct

use of motion in MOVIE, the computational complex-

ity of the algorithm makes practical implementation

difﬁcult as it relies on 3-D optical ﬂow and ﬁlter bank

computation (Moorthy and Bovik, 2010).

Since motion is critical for measuring quality,

video quality metrics have to take into account tem-

poral (motion compensation mismatch, jitter, ghost-

ing, and mosquito noise, . ..) and spatial (blocking,

blurring and edge distortion, ...) distortions (Yuen

and Wu, 1998). Considering that the area of still

image quality assessment has attained maturity, spa-

tial distortions are usually well captured by current

quality metrics (image quality metrics achieve cor-

relations above 0.9 between subjective and objective

scores, in most of the databases) (Seshadrinathan and

Bovik, 2010). However, temporal distortions are not

well estimated by existent methods (Seshadrinathan

and Bovik, 2010). In addition, motion information is

only used to compute weights and/or extracted from a

ﬁlter bank response in current methods, which is gen-

erally, as stated previously, inaccurate for capturing

temporal distortions (Wang and Li, 2007) (Seshadri-

nathan and Bovik, 2010) (Moorthy and Bovik, 2010).

Considering that most of the existing VQA algorithms

compute motion informationindirectly, it is necessary

to fully investigate the contribution of motion to hu-

man perception of quality. Therefore, we believe that

temporal distortions or errors due to motion should be

computed directly from the local motion ﬁeld instead

of using the methods listed above.

In this paper, we propose a POM-based quality in-

dex in which the main contribution comes from the

direct use of motion information to extract temporal

distortions and to model the human visual attention

(HVA). On the one hand, since the human visual sys-

tem (HVS) infers motion from the changing pattern

of light in the retinal image (Watson and Ahumada,

1985), we compute motion errors from optical ﬂow

differences because it assumes the same changing pat-

tern of light from one frame to the other. On the

other hand, we performed psychovisual experiments

for modeling directly the HVA instead of using as-

sumptions like in (Wang and Li, 2007). We design

saliency maps based on the results of the psychovi-

sual experiments which are later used in the pooling

strategy. Considering that the proposed quality in-

dex is speciﬁcally designed for measuring temporal

distortions, we combine it with the well know spa-

tial structural similarity index (SSIM) in order to ac-

count for both types of distortions. Results show that

the proposed quality index is competitivewith current

methods presented in the state of art. Additionally, the

proposed index is much faster than other indices also

including a temporal distortion measure.

The rest of the paper is organized as follows. Sec-

tion 2 introduces backgroundinformation and Section

3 describes the proposed quality index. Results are in

Section 4 and conclusions in Section 5.

2 BACKGROUND

2.1 Dense Optical Flow

Dense optical ﬂow is one of the most popular mech-

anisms for estimating motion in video analysis with

great accuracy in several tasks such as tracking, video

quality assessment, and human speed perception,

among others (Barron et al., 1992), (Wang and Li,

2007), (Daly, 1998). Optical ﬂow, in video analy-

sis, refers to the changes in grey values caused by

the relative motion between a video camera and the

content of the scene, for example motion of objects,

surfaces, and edges. Mathematically, optical ﬂow as-

sumes that pixel intensities are translated from one

frame to the next (this is called brightness constancy),

i.e., I(x,y,t − 1) = I(x + u,y + v,t), where I(x,y,t)

is the intensity of the pixel located at position (x,y)

and time t. Here, (u,v) is the velocity vector or

optical ﬂow which can depend on x and y. In this

ﬁeld, the Lucas-Kanade algorithm is one of the most

well-known and widely used optical ﬂow estimation

methods because it accounts for the most desirable

features in optical ﬂow computation (accuracy, low

AFullReferenceVideoQualityMeasurebasedonMotionDifferencesandSaliencyMapsEvaluation

715

computational cost, and good noise tolerance) (Bar-

ron et al., 1992). Therefore, the Lucas-Kanade algo-

rithm is used as optical ﬂow estimator in the proposed

methodology as explained in Section 3.

2.2 Saliency Maps

Spatial regions which attract the HVS are called

salient regions. Usually, humans do not pay atten-

tion to every detail in the visual ﬁeld but concentrate

on some salient regions (Marat et al., 2009). By com-

bining the salient regions of the visual ﬁeld, a map

of attentional ﬁeld called saliency map is obtained.

Particularly, saliency maps are highly related to the

motion and contrast of objects (Wang and Li, 2007).

Usually, salient regions for videos are computed by

using spatial and temporal saliency maps. Both maps

are often inspired by recent results of either physi-

ological or psychovisual experiments (Max-Planck-

institute, 2013), (Marat et al., 2009). In this work,

saliency maps are computed based on local image en-

tropies, motion and the nonlinear mapping of these

features. Here, the nonlinear mapping is designed

based on results of psychovisual experiments. The

psychovisual experiments use alternative force choice

tasks in which one or more stimuli are presented to

several human subjects. In such tasks, human sub-

jects are prompted to indicate whether or not they can

see the stimulus. Thereafter, the behavior of the given

psychovisual task (e.g., proportion of trials the stim-

ulus was seen) is related to some physical character-

istic of the stimulus (e.g., contrast, entropy, motion)

by using a function, termed the psychometric func-

tion. In this work we design an one alternative force

choice (1AFC) task for designing the proposed non-

linear mapping used in the saliency map computation,

see Sections 3.2 and 3.4 for further details.

2.3 Performance of Video Quality

Indices

The performance of video quality indices is evalu-

ated by comparing the predicted quality to the hu-

man scores. The human scores, termed DMOS, are

given by subjective quality assessment of a set of

previously prepared video sequences. The predicted

quality, termed pDMOS, is the output of the objective

quality index after a nonlinear transformation. The

nonlinear transformation is performed with the pur-

pose of facilitating comparisons between models in

a common space of analysis (VQEG, 2003). Such

nonlinear transformation is deﬁned as pDMOS =

1+exp(−(b

q))

, where b

,.. .,b

are constants with

the best ﬁt to the DMOS and q is the output by the

objective video quality index, for instance, the output

of MOVIE algorithm.

Once the nonlinear transformation is applied, the

performance of the objective models is evaluated

by computing various metrics between DMOS and

pDMOS. Such metrics evaluate the following three

aspects: prediction accuracy, prediction monotonic-

ity and prediction consistency (VQEG, 2003). Pre-

diction accuracy refers to the ability of predicting the

subjective quality score with low error, this aspect

is measure by using the Pearson correlation coefﬁ-

cient (PCC). Prediction monotonicity is the degree to

which predictions of the model agree with the magni-

tudes of subjective quality scores, this aspect is mea-

sure with the Spearman rank-order correlation coef-

ﬁcient (SROCC). Prediction consistency is the de-

gree to which the model maintains prediction accu-

racy over a range of different video test sequences,

this can be measure by using root mean-squared error

(RMSE).

3 PROPOSED QUALITY INDEX

It is well know that temporal distortions in video al-

ter motion trajectories of pixels, for example mo-

tion compensation mismatch, jitter and ghosting,

and/or introduce a false perception of motion such

as mosquito noise (Yuen and Wu, 1998), (Seshadri-

nathan and Bovik, 2010). Noteworthy is that tem-

poral distortions affecting video sequences are per-

ceived by humans as motion. By taking advantage of

the convenient relationship between motion percep-

tion and optical ﬂow (Watson and Ahumada, 1985),

one can assume that there is available a good registra-

tion of motion, i.e., there is translation of pixel inten-

sities from one frame to the next. In such case, given

two video sequences I

and I

such that I

is a pro-

cessed or corrupted version of I

, we say that both se-

quences exactly match when optical ﬂows are equal.

That is, U

= U

and V

= V

, where U

, V

and U

are the x and y components of the velocity vec-

tors of the reference and processed video sequences,

respectively. However, when temporal distortions ap-

pear in I

, optical ﬂows in both video sequences do

not match and the optical ﬂow in I

may change.

Thus, a common vector metric can help to measure

optical ﬂow differences for detecting temporal dis-

tortions. Particularly, we use the root squared er-

ror (

−U

)

+ (V

−V

)

) because it is the most

common and widely used metric for comparing op-

tical ﬂows (Barron et al., 1992). Thus, our temporal

quality index is developed by assuming optical ﬂow

equality in the absence of temporal distortions. In

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

716

Optical

ﬂow

Computation

Local

entropy

computation

Spatial

pooling

Temporal

pooling

Egomotion

extraction

(U,V)

SSIM

Computation

) (U

)

temporal

quality map

spatial

quality map

Root

squared error

computation

Nonlinear

mapping

Nonlinear

mapping

Figure 1: Flow chart of the proposed quality indices.

Figure 1 the ﬂow chart of the proposed quality index

is shown.

3.1 Spatial and Temporal Quality Maps

The optical ﬂow for the reference and processed

videos, noted by (U

) and (U

) and shown

in Figures 2(c) and 2(d), are computed from the

video sequences I

and I

, shown in Figures 2(a) and

2(b). For achieving a compromise between simplic-

ity and accuracy, optical ﬂows are computed by using

the pyramidal reﬁned implementation of the Lucas-

Kanade algorithm as discussed in (Lucas and Kanade,

1981). In this paper, optical ﬂows are computed by

using 3 scales and 3 iterations. Thereafter, tempo-

ral distortions are computed pixel by pixel using the

root squared error, which results in a temporal qual-

ity map, termed Q

, (see Figure 2(h)). Also, from I

and I

we compute the spatial quality map (Q

) by us-

ing the SSIM as discussed in (Wang et al., 2004). An

example of the spatial quality map is found in Figure

2(i).

3.2 Spatial-temporal Saliency Map

The saliency maps are computed frame by frame

by transforming visual characteristics of the video

sequences with nonlinear functions. The nonlinear

functions are thoroughly explained in Section 3.4. We

compute the saliency maps from the processed se-

quence because the artifacts may change the HVA.

For computing the spatial saliency map we replace

each pixel in the processed frames with the normal-

ized entropy of its surrounding neighborhood, termed

. The normalized entropy is used assuming that it

is more likely to attract the HVA into regions with

high information content than to regions with low in-

formation content. Then, each normalized entropy is

mapped with a psychometric function to obtaining the

spatial saliency map I

= P

), where P

) is the

proportion of times that the stimulus is perceivedwith

motion given X

(see Section 3.4). We use such non-

linear mapping because it is an indication of the prob-

ability of seen a moving object giving its entropy. An

example of the spatial saliency map is shown in Fig-

ure 2(e).

The temporal saliency map is computed by map-

ping the magnitude of the optical ﬂow with a non-

linear function, i.e., I

= P

(Mg), where Mg =

2f tan

−1



k(U,V)kdx



(see Section 3.4). Here, dx, d,

f and k(U,V)k are the size in centimeters of each

pixel, distance from the viewer to the display, the

frame rate and the magnitude of the optical ﬂow in

the processed sequence after camera motion extrac-

tion, respectively. Camera motion is extracted with

the purpose of approximating the eye ﬁxation of the

HVA (Wang and Li, 2007). The camera motion or

egomotion (U

) is estimated from (U

). Here,

egomotion is estimated by using a six parameters

afﬁne model and lest squares estimation as discussed

in (Yao et al., 2001). The local motion ﬁeld (U,V)

is obtained as (U,V) = (U

) − (U

). This non-

linear mapping is performed with the purpose of ap-

proximating optical ﬂow to the results obtained by

Daly. Figure 2(f) shows an example of the tempo-

ral saliency map. The ﬁnal proposed spatio-temporal

saliency map is obtained by multiplying pixel by pixel

both maps like in Pinson and Wolf (Pinson and Wolf,

AFullReferenceVideoQualityMeasurebasedonMotionDifferencesandSaliencyMapsEvaluation

717

(a) Reference sequence (b) Processed sequence (c) Optical ﬂow of (a)

(d) Optical ﬂow of (b) (e) Spatial saliency map (f) Temporal saliency map

(g) S-T saliency map (h) Root squared error (i) SSIM

Figure 2: Illustrations of the proposed quality metric. S-T: Spatio-temporal. Optical ﬂows of the reference and processed

sequences are color coded (direction is coded by hue, length by saturation). EE, SSIM and saliency maps are color coded

from blue to red.

2004), I

= I

, (see Figure 2(g)).

3.3 Pooling Strategy

The proposed pooling strategy is based in the use of

the saliency maps explained in Section 3.2, i.e., we

use weighting functions for obtaining spatial and tem-

poral quality indices. The spatio-temporal saliency

map (I

) is used in the pooling strategy as follows:

(t) =

∑

x,y

(x,y,t)Q

(x,y,t)

∑

x,y

(x,y,t)

(1a)

(t) =

∑

x,y

(x,y,t)Q

(x,y,t)

∑

x,y

(x,y,t)

, (1b)

where, e

(t) and e

(t) are the evolution across time of

spatial and temporal distortions, respectively. Finally,

(t) and e

(t) are pooled over time to obtain a spatial

and temporal quality indices describing the quality of

the video sequence I

. Here, the temporal pooling is

performed by using a weighted mean of the temporal

evolution of distortions (see Equation (2a) and (2b)).

∑

(t)e

(t)

∑

(t)

(2a)

∑

(t)e

(t)

∑

(t)

, (2b)

where

(t) =

∑

x,y

(x,y,t). Thus, q

and q

give

more importance to frames with high salient regions

than frames with low salient regions, i.e., frames with

objects and/or regions that are most likely to be seen.

For combining the spatial and temporal indices, we

use the following nonlinear mapping: pDMOS =

1+exp(−(b

))

. Here, b

,.. .,b

are also con-

stants with the best ﬁt to the DMOS.

3.4 Psychophysical Experiments for

Nonlinear Transformations

Here, we built the nonlinear transformations by as-

suming that most of the objects are static or nearly to

static. Objects with a signiﬁcant amount of motion

provide a high information content for the HVA and

therefore an object ﬁxation is obtained. Also, it is well

know that the HVA pays more attention to regions

with high contrast because it is more likely to see

motion at high contrast regions than at low contrast

regions (Hammett and Larsson, 2012). To build the

function P

, we ran a 1AFC with two human subjects.

One subject is naive and the other possess knowledge

in the ﬁeld of VQA. For this, we independently pre-

sented several stimuli to each subject. The subject is

prompted to indicate whether or not they can see mo-

tion in the stimulus. The stimuli are generated by tak-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

718

0.1 0.60.5

0.4

0.30.2 0.7 0.8 0.9 1.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

20 40 60 80

[ deg/sec]

(a) (b)

Figure 3: (a) Spatial and (b) temporal mapping functions,

and P

, respectively. Asterisks are the proportion of

times the stimulus is perceived with motion for each nor-

malized entropy X

. The blue line is the best ﬁt of the func-

tions.

ing random blocks of 15× 15 pixels per 200 millisec-

ond from the LIVE database (Seshadrinathan et al.,

2010), which is enough for a human subject to dis-

criminate whether or not they can see motion (Wat-

son and Ahumada, 1985). For each block, the nor-

malized entropy is computed as X

, where h

is the entropy of the block and h

≈ 7.8 bits. The

assumption is that the HVA is attracted more to re-

gions with high information content than to regions

with low information. By using the normalized en-

tropy and the 1AFC collected data a psychometric

function can be estimate. A psychometric function is

typically a sigmoid function, such as the Weibull, lo-

gistic, cumulative Gaussian, or Gumbel distribution.

We choose the Weibull function because is the func-

tion that most easily and accurately ﬁts the obtained

data. Thus, the psychometric function is deﬁned as

) = 1− exp(−(X

/α

)

), where α = 0.58 and

β = 3.16 are constants to best ﬁt the obtained data.

Here, we set the parameters using the maximum like-

lihood method as discussed in (Kingdom and Prins,

2010). Figure 3(a) shows the obtained psychometric

function with its respective real data.

To build the temporal saliency map, we use the

results obtained by Daly (Daly, 1998). In that

work the author found that the human eye can-

not follow objects with velocities higher than 80

deg/sec; in such cases the saliency is null. Also,

he demonstrated that the saliency reaches its maxi-

mum with motions values between 6 deg/s and 30

deg/s. We ﬁt a nonlinear mapping function to the

values obtained by Daly. We found that the fol-

lowing function give good results: P

(Mg) = (1 −

exp(−(Mg/α

)

))(exp(−(Mg/α

)

)), where α

3.5, α

= 66.5, β

= 3.1 and β

= 7. The parameters

for P

function were empirically determined in order

to ﬁt the ﬁndings of Daly. Figure 3(b) shows the ob-

tained function.

4 RESULTS AND DISCUSSION

In this paper two video quality databases are used:

the LIVE Video Quality Database (Seshadrinathan

et al., 2010) and the IVP Subjective Quality Video

Database (Zhang et al., 2011). Each database con-

tains 10 source video sequences, the Live database

contains 10 standard television videos (resolution of

768 × 432) and the IVP contains 10 high deﬁnition

videos (resolution of 1920 × 1088). Each of those

video sequences are processed to acquire 150 and 128

test videos in the LIVE and IVP databases, respec-

tively. Each test video is processed by using one of

the following methods: wireless distortion, IP distor-

tions, H.264 compression or MPEG-2 compression.

Each database provides a DMOS value per sequence,

obtained through subjective evaluations conducted by

the respective authors.

Here, the proposed quality index is compared to

the following popular video quality indices: i) Peak

signal to noise ratio (PSNR) as a benchmark, ii) video

quality model (VQM) (Pinson and Wolf, 2004),

iii) weighted structural similarity index (wSSIM)

(Wang and Li, 2007), iv) motion-based video in-

tegrity evaluation index (MOVIE) (Seshadrinathan

and Bovik, 2010) and v) video quality assessment

by decoupling detail losses and additive impairments

(VQAD) (Li et al., 2011). Performances are com-

puted as explained in Section 2.3.

Table 1 shows the performance of the considered

video quality indices appraised for the LIVE and IVP

video quality databases. The results show that the

proposed quality index performs better than PSNR,

VQM and wSSIM and is competitive with MOVIE

and VQAD in both databases. For instance, the pro-

posed quality index performs better than VQAD in

the IVP database and slightly worse in the LIVE

database. Compared with MOVIE the proposed qual-

ity index performs slightly worst in LIVE database,

however, MOVIE is computationally intensive tak-

ing around 6-8 and 20-24 hours in video sequences

of 250-frames 768 × 432 and 1920 × 1088 (which is

impractical for any application), respectively. Then,

results of methods that cannot be obtained within a

practical time span, i.e. 1 hour or more, are not re-

ported in Table 1. The rest of the methods are compu-

tational simpler than the proposed method. However,

those methods do not include a temporal distortion

measure which is crucial in video quality assessment.

The proposed quality index performs better in one

database than the other. In order to understand the dif-

ference in performance, we study the residuals of the

nonlinear model and the DMOS given in the LIVE

database. Figure 4 shows the residuals of the pro-

AFullReferenceVideoQualityMeasurebasedonMotionDifferencesandSaliencyMapsEvaluation

719

-0.2

-0.1

0.1

0.2

DMOS-pDMOS

pa rb rh tr st sf bs sh mc pr

-Û

2Û

-2Û

Figure 4: Residuals of the proposed quality index in LIVE database. µ and σ are the mean and standard error, respectively.

Video sequences in red means that the proposed quality index achieve the worst performance.

Table 1: Performance of considered video quality metrics

appraised for the LIVE and IVP Video Quality Databases.

∗

The computational time is computed for algorithms run-

ning in Matlab (Laptop with CPU intel core i3 2.27GHz

and 4GB ram) for 250 frames of size 768 × 432.

C++

implementation.

LIVE

Method RMSE PCC SROCC

PSNR 9.109 0.562 0.539

VQM 7.932 0.723 0.702

wSSIM 8.843 0.596 0.584

MOVIE

6.435 0.811 0.789

VQAD 6.216 0.824 0.818

Proposed 6.723 0.791 0.783

IVP

RMSE PCC SROCC

0.759 0.687 0.694

0.767 0.672 0.685

0.803 0.640 0.640

− − −

0.659 0.782 0.787

0.576 0.839 0.843

Time

∗

[min]

0.15

2.5

360

1.2

posed quality index for the LIVE database. The plot

was divided in 10 horizontal sections (blue lines)

representing each video in the LIVE database. The

videos are named according to (Seshadrinathan et al.,

2010): river bed (rb), mobile and calendar (mb), rush

hour (rh), blue sky (bs), park running (pr), station

(st), tractor (tr), sunﬂower (sf), pedestrian area (pa),

shields (sh). In Figure 4, an error is signiﬁcant or is

considered an outlier when the error is greater than or

equal to ±2σ, where σ is the standard error. The worst

performance was achieved on pa, rb and sf. That is,

most of the residuals are close to the red line in Figure

4. This poor performance is mainly due to three rea-

sons: large occluded regions (pa), large regions with

similar texture or without texture (sf, rb). On the one

hand, pa is a sequence in which a considerable num-

ber of people walk in the street. Some of those people

are located or walk close to the camera, hiding the

other people. Thus, when a person or object is hid-

den or occluded the computed motion from the Lucas-

Kanade algorithm is not reliable. On the other hand,

sf is a sequence in which high similar textured regions

are displaced due to camera motion. Since the motion

detection algorithm is based in local region search-

ing, the optical ﬂow can be unreliable when similar

pixel values are found. The same concept applies

to rb with the difference that this sequence is homo-

geneous. Since the proposed quality index was de-

signed assuming a good motion registration and the

three cases presents unreliable optical ﬂow computa-

tion, the expected root squared error is also unreliable

and this affect the quality index. Then, the inconsis-

tencies in performance are mainly due to the number

of occluded areas and large regions with similar tex-

ture presented in each video sequence. That means,

it is necessary to include a mechanism for measuring

reliability of the optical ﬂow for improving the results

presented in this paper. Despite this drawback the

proposed quality index performs well compared with

current methods in the stated of art. Also, the pro-

posed index have the advantage of computing tempo-

ral distortions directly which can be used for further

improvements in designing objective video quality in-

dices.

5 CONCLUSIONS

We proposed a video quality index based on visual

attention models and optical ﬂow errors which suc-

cessfully captures temporal distortions when a good

quality optical ﬂow registration is available. The vi-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

720

sual attention model was built using spatial and tem-

poral saliency maps based on psychovisual studies.

The spatial saliency map is based in novel psycho-

visual experiments run by the authors. The temporal

saliency map was built using recent ﬁndings in the

area of motion analysis. Temporal errors were com-

puted directly from optical ﬂows by using standard

vector metrics. In general, the proposed quality index

is competitive with current methods presented in the

state of art. Additionally, the proposed index is much

faster than other indices also including a temporal dis-

tortion measure. The main drawback of the proposed

quality index is that it is necessary to assume perfect

motion registration which leads to inconsistencies of

performance within the databases tested in this work.

We prove that this behavior is mostly due to three rea-

sons: large occluded regions, large regions with same

texture and/or without texture. Despite this drawback

the proposed quality index performs well and the re-

sults presented in this work could offer a good starting

point for further development. Also, the proposed in-

dex have the advantage of using temporal information

directly instead of using weights to pool quality maps

which in general do not necessarily account for tem-

poral distortions.

Since the proposed methodology is highly depen-

dent on the quality of the optical ﬂow registration,

the study and incorporation of a proper mechanism

for measuring optical ﬂow reliability is proposed as

future work. Also, the comparison with other meth-

ods and the evaluation of different databases remains

as future work. We proposed a video quality index

based on visual attention models and optical ﬂow er-

rors which successfully captures temporal distortions

when a good quality optical ﬂow registration is avail-

able. The visual attention model was built using spa-

tial and temporal saliency maps based on psychovi-

sual studies. The spatial saliency map is based in

novel psychovisual experiments run by the authors.

The temporal saliency map was built using recent

ﬁndings in the area of motion analysis. Temporal er-

rors were computed directly from optical ﬂows by us-

ing standard vector metrics. In general, the proposed

quality index is competitive with current methods pre-

sented in the state of art. Additionally, the proposed

index is much faster than other indices also including

a temporal distortion measure. The main drawback

of the proposed quality index is that it is necessary

to assume perfect motion registration which leads to

inconsistencies of performance within the databases

tested in this work. We prove that this behavior is

mostly due to three reasons: large occluded regions,

large regions with same texture and/or without tex-

ture. Despite this drawback the proposed quality in-

dex performs well and the results presented in this

work could offer a good starting point for further de-

velopment. Also, the proposed index have the advan-

tage of using temporal information directly instead of

using weights to pool quality maps which in general

do not necessarily account for temporal distortions.

Since the proposed methodology is highly depen-

dent on the quality of the optical ﬂow registration, the

study and incorporation of a proper mechanism for

measuring optical ﬂow reliability is proposed as fu-

ture work. Also, the comparison with other methods

and the evaluation of different databases remains as

future work.

ACKNOWLEDGEMENTS

This work was ﬁnancially supported by the iMinds

ICON Telesurgery project, a project co-funded by

iMinds, a research institute founded by the Flem-

ish Government. Companies and organizations in-

volved in the project are BARCO, Unilabs Teleradi-

ology BVBA, and CandiT-Media BVBA, with project

support of IWT. Part of the work has been performed

in the project PANORAMA, co-funded by grants

from Belgium, Italy, France, the Netherlands, and the

United Kingdom, and the ENIAC Joint Undertaking.

REFERENCES

Barron, J., Fleet, D., Beauchemin, S., and Burkitt, T. A.

(1992). Performance of optical ﬂow techniques. In

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition, pages 236–242.

Chikkerur, S., Sundaram, V., Reisslein, M., and Karam, L.

(2011). Objective video quality assessment methods:

A classiﬁcation, review, and performance comparison.

IEEE Transactions on broadcasting, 57(2):165–182.

Daly, S. (1998). Engineering observations from spatiove-

locity and spatiotemporal visual models. In Human

Vision and Electronic Imaging III.

Hammett, S. and Larsson, J. (2012). The effect of con-

trast on perceived speed and ﬂicker. Journal of Vision,

12(12):1–8.

ITU-R-Recommendation-BT.500-11 (1998). Methodology

for the subjective assessment of the quality of televi-

sion pictures. ITU, Geneva, Switzerland.

Kingdom, F. and Prins, N. (2010). Psychophysics A practi-

cal introduction. Elsevier, London, 1st edition.

Li, S., Ma, L., and Ngan, K. (2011). Video quality assess-

ment by decoupling additive impairments and detail

losses. In Third International Workshop on Quality of

Multimedia Experience (QoMEX).

AFullReferenceVideoQualityMeasurebasedonMotionDifferencesandSaliencyMapsEvaluation

721

Lucas, B. and Kanade, T. (1981). An iterative image regis-

tration technique with an application to stereo vision.

In Imaging Understanding Workshop.

Marat, S., Phuoc, T., Granjon, L., Guyader, N., Pellerin, D.,

and Gu´erin-Dugu´e (2009). Modelling spatio-temporal

saliency to predict gaze direction for short videos. In-

ternational journal of Computer Vision, 82(3):231–

243.

Max-Planck-institute, M.-P.-i.-f.-b.-c. (2013). Brain re-

search on non-human primates. http://hirnforschung.

kyb.mpg.de/en/homepage.html.

Moorthy, A. and Bovik, A. (2010). Efﬁcient video quality

assessment along temporal trajectories. IEEE trans-

actions on circuits and systems for video technology,

20(3):1653–1658.

Pinson, M. and Wolf, S. (2004). A new standardized method

for objectively measuring video quality. IEEE trans-

actions on broadcasting, 50(3):312–322.

Seshadrinathan, K. and Bovik, A. (2010). Motion tuned

spatio-temporal quality assessment of natural videos.

IEEE transactions on image processing, 19(11):335–

350.

Seshadrinathan, K., Soundararajan, R., Bovik, A., and Cor-

mack, L. (2010). Study of subjective and objective

quality assessment of video. IEEE transactions on im-

age processing, 19(6):1427–1441.

VQEG, V.-Q.-E.-G. (2003). The validation of objec-

tive models of video quality assessment, phase ii.

http://www.its.bldrdoc.gov/vqeg/vqeg-home.aspx.

Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. (2004).

Image quality assessment: From error visibility to

structural similarity. IEEE transactions on image pro-

cessing, 13(4):600–612.

Wang, Z. and Li, Q. (2007). Video quality assessment using

a statistical model of human visual speed perception.

Journal of optical society of America, 24(12):B61–

B69.

Watson, A. and Ahumada, A. (1985). Model of human

visual-motion sensing. Journal of optical society of

America, 2(2):322–342.

Yao, P., Evans, G., and Calway, A. (2001). Face tracking

and pose estimation using afﬁne motion parameters.

In Proceedings of the 12th Scandinavian Conference

on Image Analysis.

Yuen, M. and Wu, H. (1998). A survey of hybrid

mc/dpcm/dct video coding distortions. Signal pro-

cessing, 70:247–278.

Zhang, F., Li, S., Ma, L., Wong, Y., and Ngan,

K. (2011). Ivp subjective quality video

database. http://ivp.ee.cuhk.edu.hk/research/

database/subjective/index.shtml.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

722