CGS layer. Inter-layer prediction may be employed
to increase compression efficiency of CGS.
However, the number of available bit rates is
restricted to the number of selected QPs (CGS layers)
and more layers generally imply worse coding
efficiency. To increase the flexibility of bit stream
adaptation and to improve the coding efficiency,
MGS additionally provides the capability to
distribute the CGS enhancement layer transform
coefficients into more layers. Grouping information
of the transform coefficients is signaled in the slice
headers, and thus, a CGS layer that corresponds to a
certain QP can be partitioned into several MGS
layers and separately packetized. Pulipaka et al
(2010) conducted some statistical analyses of SVC,
including the rate distortion and rate variability
distortion performances. Görkemli et al (2010)
compared MGS fragmentation configurations of
SVC, including the slice mode and extraction
methods, for their rate-distortion performance.
In this paper, we test various CGS/MGS options
for H.264 SVC using the official reference software
JSVM (Joint Scalable Video Model) (JSVM
Software Manual, 2010/2011). Throughout the
comprehensive experiments, unusual rate-distortion
behavior for some configurations of SVC options
was discovered. It is generally believed that an
additional quality layer (more received bits) should
always improve the quality for SVC. However, we
find that adding an MGS sub-layer in some cases
may conversely decrease the PSNR. We thus
conduct more tests to explore this anomaly. The rest
of this paper is organized as follows. In Section 2,
we briefly review the H.264 SVC techniques,
particularly in details for CGS and MGS.
Experiments on H.264 quality scalability with
various JSVM CGS/MGS configurations are given
in Section 3, which also demonstrates the
aforementioned oddity. Some discussion and future
work are given in Section 4.
2 H.264 SCALABLE VIDEO
CODING
H.264 includes two layers in structure: video coding
layer (VCL) and network abstraction layer (NAL).
Based on the core coding tools of the non-scalable
H.264 specification, the SVC extension adds new
syntax for scalability (ITU-T Rec. H.264, 2009). The
representation of the video source with a particular
spatio-temporal resolution and fidelity is referred to
as an SVC layer. Each scalable layer is identified by
a layer identifier. In JSVM, three classes of
identifiers, T, D, and Q, are used to indicate the
layers of temporal scalability, spatial scalability, and
quality scalability, respectively. A constrained
decoder can retrieve the necessary NAL units from
an H.264 scalable bit stream to obtain a video of
reduced frame rate, resolution, or fidelity. The first
coding layer with identifier equal to 0 is called the
base layer, which is coded in the same way as non-
scalable H.264 image sequences. To increase coding
efficiency, encoding the other enhancement layers
may employ data of another layer with a smaller
layer identifier.
Temporal scalability provides coded bit streams
of different frame rates. The temporal scalability of
H.264 SVC is typically structured in hierarchical B-
pictures. In this case, each added temporal
enhancement layer doubles the frame rate. These
dyadic enhancement layer pictures are coded as B-
pictures that use the nearest temporally available
pictures as reference pictures. The set of pictures
from one temporal base layer to the next is referred
to as a group of pictures (GOP). It is found from
experiments that the GOP size of 8 or 16 usually
achieves the best rate-distortion performance
(Schwarz and Marpe, 2007). Note that the GOP size
also determines the total number of temporal layers
(no. of temporal layers = (log
2
GOPsize) + 1).
Each layer of H.264 spatial scalability
corresponds to a specific spatial resolution. In
addition to the basic coding tools of non-scalable
H.264, each spatial enhancement layer may employ
the so-called interlayer prediction, which employs
the correlation from the lower layer (resolution).
There are three prediction modes of inter-layer
coding: inter-layer intra prediction, inter-layer
motion prediction, and inter-layer residual prediction.
Accordingly, the up-sampled reconstructed intra
signal, the macroblock partitioning and the
associated motion vectors, or the up-sampled
residual derived from the colocated blocks in the
reference layer, are used as prediction signals. The
inter-layer prediction shall compete with the intra-
layer temporal prediction for determining the best
prediction mode.
Quality scalable layers, which are the main
concern of this paper, have identical spatio-temporal
resolution but different fidelity levels. H.264 offers
two options for quality scalability, CGS (coarse-
grain quality scalable coding) and MGS (medium-
grain quality scalability). An enhancement layer of
CGS is obtained by requantizing the (residual)
texture signal with a smaller quantization step size
(quantization parameter, QP). CGS incorporates the
WHAT ARE GOOD CGS/MGS CONFIGURATIONS FOR H.264 QUALITY SCALABLE CODING?
105