termed quality maps. For computing a unique qual-
ity score from those quality maps, a spatiotemporal
weighted mean is used. Those weighting factors are
computed based on a Bayesian optimal observer hy-
pothesis. MOVIE uses a Gabor filter bank specif-
ically designed based on physiological findings for
mimicking the visual system response. The video
quality evaluation is carried out from two compo-
nents (spatial and temporal distortions). Spatial dis-
tortions are computed as squared differences between
Gabor coefficients of the reference and processed se-
quences. Temporal distortions are obtained from the
mean square error between reference and processed
sequences along motion trajectories computed over
the reference video. Although many video quality
assessment algorithms have been proposed, many of
them do not explicitly account for temporal artifacts
which occur in video sequences. For instance, in
VQM, MC-SSIM, VQDM and wSSIM, motion infor-
mation is only used to design weights to pool qual-
ity maps into a single quality score for the video.
However, weights based on temporal information do
not necessarily account for temporal distortions (Se-
shadrinathan and Bovik, 2010). Despite of the direct
use of motion in MOVIE, the computational complex-
ity of the algorithm makes practical implementation
difficult as it relies on 3-D optical flow and filter bank
computation (Moorthy and Bovik, 2010).
Since motion is critical for measuring quality,
video quality metrics have to take into account tem-
poral (motion compensation mismatch, jitter, ghost-
ing, and mosquito noise, . ..) and spatial (blocking,
blurring and edge distortion, ...) distortions (Yuen
and Wu, 1998). Considering that the area of still
image quality assessment has attained maturity, spa-
tial distortions are usually well captured by current
quality metrics (image quality metrics achieve cor-
relations above 0.9 between subjective and objective
scores, in most of the databases) (Seshadrinathan and
Bovik, 2010). However, temporal distortions are not
well estimated by existent methods (Seshadrinathan
and Bovik, 2010). In addition, motion information is
only used to compute weights and/or extracted from a
filter bank response in current methods, which is gen-
erally, as stated previously, inaccurate for capturing
temporal distortions (Wang and Li, 2007) (Seshadri-
nathan and Bovik, 2010) (Moorthy and Bovik, 2010).
Considering that most of the existing VQA algorithms
compute motion informationindirectly, it is necessary
to fully investigate the contribution of motion to hu-
man perception of quality. Therefore, we believe that
temporal distortions or errors due to motion should be
computed directly from the local motion field instead
of using the methods listed above.
In this paper, we propose a POM-based quality in-
dex in which the main contribution comes from the
direct use of motion information to extract temporal
distortions and to model the human visual attention
(HVA). On the one hand, since the human visual sys-
tem (HVS) infers motion from the changing pattern
of light in the retinal image (Watson and Ahumada,
1985), we compute motion errors from optical flow
differences because it assumes the same changing pat-
tern of light from one frame to the other. On the
other hand, we performed psychovisual experiments
for modeling directly the HVA instead of using as-
sumptions like in (Wang and Li, 2007). We design
saliency maps based on the results of the psychovi-
sual experiments which are later used in the pooling
strategy. Considering that the proposed quality in-
dex is specifically designed for measuring temporal
distortions, we combine it with the well know spa-
tial structural similarity index (SSIM) in order to ac-
count for both types of distortions. Results show that
the proposed quality index is competitivewith current
methods presented in the state of art. Additionally, the
proposed index is much faster than other indices also
including a temporal distortion measure.
The rest of the paper is organized as follows. Sec-
tion 2 introduces backgroundinformation and Section
3 describes the proposed quality index. Results are in
Section 4 and conclusions in Section 5.
2 BACKGROUND
2.1 Dense Optical Flow
Dense optical flow is one of the most popular mech-
anisms for estimating motion in video analysis with
great accuracy in several tasks such as tracking, video
quality assessment, and human speed perception,
among others (Barron et al., 1992), (Wang and Li,
2007), (Daly, 1998). Optical flow, in video analy-
sis, refers to the changes in grey values caused by
the relative motion between a video camera and the
content of the scene, for example motion of objects,
surfaces, and edges. Mathematically, optical flow as-
sumes that pixel intensities are translated from one
frame to the next (this is called brightness constancy),
i.e., I(x,y,t − 1) = I(x + u,y + v,t), where I(x,y,t)
is the intensity of the pixel located at position (x,y)
and time t. Here, (u,v) is the velocity vector or
optical flow which can depend on x and y. In this
field, the Lucas-Kanade algorithm is one of the most
well-known and widely used optical flow estimation
methods because it accounts for the most desirable
features in optical flow computation (accuracy, low
AFullReferenceVideoQualityMeasurebasedonMotionDifferencesandSaliencyMapsEvaluation
715