
 
 
CGS layer. Inter-layer prediction may be employed 
to increase compression efficiency of CGS. 
However, the number of available bit rates is 
restricted to the number of selected QPs (CGS layers) 
and more layers generally imply worse coding 
efficiency. To increase the flexibility of bit stream 
adaptation and to improve the coding efficiency, 
MGS additionally provides the capability to 
distribute the CGS enhancement layer transform 
coefficients into more layers. Grouping information 
of the transform coefficients is signaled in the slice 
headers, and thus, a CGS layer that corresponds to a 
certain QP can be partitioned into several MGS 
layers and separately packetized. Pulipaka et al 
(2010) conducted some statistical analyses of SVC, 
including the rate distortion and rate variability 
distortion performances. Görkemli et al (2010) 
compared MGS fragmentation configurations of 
SVC, including the slice mode and extraction 
methods, for their rate-distortion performance. 
In this paper, we test various CGS/MGS options 
for H.264 SVC using the official reference software 
JSVM (Joint Scalable Video Model) (JSVM 
Software Manual, 2010/2011). Throughout the 
comprehensive experiments, unusual rate-distortion 
behavior for some configurations of SVC options 
was discovered. It is generally believed that an 
additional quality layer (more received bits) should 
always improve the quality for SVC. However, we 
find that adding an MGS sub-layer in some cases 
may conversely decrease the PSNR. We thus 
conduct more tests to explore this anomaly. The rest 
of this paper is organized as follows. In Section 2, 
we briefly review the H.264 SVC techniques, 
particularly in details for CGS and MGS. 
Experiments on H.264 quality scalability with 
various JSVM CGS/MGS configurations are given 
in Section 3, which also demonstrates the 
aforementioned oddity. Some discussion and future 
work are given in Section 4. 
2 H.264 SCALABLE VIDEO 
CODING  
H.264 includes two layers in structure: video coding 
layer (VCL) and network abstraction layer (NAL). 
Based on the core coding tools of the non-scalable 
H.264 specification, the SVC extension adds new 
syntax for scalability (ITU-T Rec. H.264, 2009). The 
representation of the video source with a particular 
spatio-temporal resolution and fidelity is referred to 
as an SVC layer. Each scalable layer is identified by 
a layer identifier. In JSVM, three classes of 
identifiers,  T,  D, and Q, are used to indicate the 
layers of temporal scalability, spatial scalability, and 
quality scalability, respectively. A constrained 
decoder can retrieve the necessary NAL units from 
an H.264 scalable bit stream to obtain a video of 
reduced frame rate, resolution, or fidelity. The first 
coding layer with identifier equal to 0 is called the 
base layer, which is coded in the same way as non-
scalable H.264 image sequences. To increase coding 
efficiency, encoding the other enhancement layers 
may employ data of another layer with a smaller 
layer identifier.  
Temporal scalability provides coded bit streams 
of different frame rates. The temporal scalability of 
H.264 SVC is typically structured in hierarchical B-
pictures. In this case, each added temporal 
enhancement layer doubles the frame rate. These 
dyadic enhancement layer pictures are coded as B-
pictures that use the nearest temporally available 
pictures as reference pictures. The set of pictures 
from one temporal base layer to the next is referred 
to as a group of pictures (GOP). It is found from 
experiments that the GOP size of 8 or 16 usually 
achieves the best rate-distortion performance 
(Schwarz and Marpe, 2007). Note that the GOP size 
also determines the total number of temporal layers 
(no. of temporal layers = (log
2
 GOPsize) + 1). 
Each layer of H.264 spatial scalability 
corresponds to a specific spatial resolution. In 
addition to the basic coding tools of non-scalable 
H.264, each spatial enhancement layer may employ 
the so-called interlayer prediction, which employs 
the correlation from the lower layer (resolution). 
There are three prediction modes of inter-layer 
coding: inter-layer intra prediction, inter-layer 
motion prediction, and inter-layer residual prediction. 
Accordingly, the up-sampled reconstructed intra 
signal, the macroblock partitioning and the 
associated motion vectors, or the up-sampled 
residual derived from the colocated blocks in the 
reference layer, are used as prediction signals. The 
inter-layer prediction shall compete with the intra-
layer temporal prediction for determining the best 
prediction mode. 
Quality scalable layers, which are the main 
concern of this paper, have identical spatio-temporal 
resolution but different fidelity levels. H.264 offers 
two options for quality scalability, CGS (coarse-
grain quality scalable coding) and MGS (medium-
grain quality scalability). An enhancement layer of 
CGS is obtained by requantizing the (residual) 
texture signal with a smaller quantization step size 
(quantization parameter, QP). CGS incorporates the 
WHAT ARE GOOD CGS/MGS CONFIGURATIONS FOR H.264 QUALITY SCALABLE CODING?
105