IMPACT OF H.264 ADVANCED VIDEO CODING INTER-FRAME

BLOCK SIZES ON VIDEO QUALITY

Harilaos Koumaras

, Michail-Alexandros Kourtis

, Drakoulis Martakos

and Christian Timmerer

NCSR Demokritos, Institute of Informatics and Telecommunications, Aghia Paraskevi, Greece

Department of Informatics, University of Athens, Athens, Greece

Multimedia Communication, Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria

Keywords: Video Quality, AVC, SSIM, Inter-frame Prediction, Block Size.

Abstract: In this paper, we present a perceptual-based encoding benchmarking of the H.264 Advanced Video Coding

(AVC) inter-frame prediction variable block sizes for various spatial and temporal contents. This paper in

order to quantify the impact on the video quality of the AVC inter-frame variable block sizes and the

responsible prediction algorithm has disabled the motion estimation mechanism of the encoder and

manually each block size is selected. Thus each time only one available block size out of the total seven is

available to be searched for each MB and it is possible to examine the video quality impact of each block

size independently to the remaining ones. The scope of this paper is to study if the use of sophisticated

predictions algorithms and variable block sizes enhance the perceived quality of the encoded video signal or

if there is not any significant quality degradation when the option of variable size is disabled.

1 INTRODUCTION

Among the various video encoding standards,

Advanced Video Coding (AVC)/H.264 has become

the most popular and widely used in many

multimedia applications and services, ranging from

digital TV, mobile video, video streaming to high-

definition media (ISO/IEC, 2006). AVC has been

jointly developed by the ITU-T Video Coding

Experts Group (VCEG) and the ISO/IEC MPEG

and is currently considered as the state-of-the-art

video coding standard since it is able to save up to

50% in bandwidth consumption while maintaining

similar quality levels as compared to existing

standards.

This enhancement of AVC in the encoding

performance is the result of many new features.

Although as a block-based motion-compensated

predictive coder, AVC is similar to its prior

standards in the general framework, however it

contains significant improvements, such as variable

block-size motion estimation, ¼-pel motion

accuracy, allows the use of multiple reference

frames, intra-frame prediction, in-loop deblocking

filter and context-adaptive arithmetic coding.

In the inter-frame prediction mechanism, the

most important improvement is the use of variable

block sizes (shown in Fig. 1) that can be chosen

dynamically for each motion-compensated

macroblock in the AVC inter-frame prediction

mechanisms in comparison to previous video

encoding standards. Partitions with luminance block

sizes of 16×16, 16×8, 8×16, or 8×8 samples, called

the macroblock types or M types (Sullivan and

Wiegand, 2005) are mainly used for low dynamic

video and homogeneous contents, while smaller

blocks intend to better characterize the motion

behavior of high dynamic and heterogeneous

contents. The 8×8 block size in AVC consists of

additional syntax elements for its further division

into smaller blocks of 8×4, 4×8, or 4×4 luma

samples, called the sub-macroblock types or 8×8

types. There is also a special case of the 16×16

block, which is called SKIP mode, where the best

reference frame, motion vector and transform

coefficients are identical to the predicted values.

The innovative use of variable block size in the

inter-frame prediction mechanisms of AVC reduces

the prediction error and enhances the encoding

efficiency due to the higher precision of the motion

vector representation.

118

Koumaras H., Kourtis M., Martakos D. and Timmerer C..

IMPACT OF H.264 ADVANCED VIDEO CODING INTER-FRAME BLOCK SIZES ON VIDEO QUALITY.

DOI: 10.5220/0003846401180122

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 118-122

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: AVC variable block sizes.

In this paper, we benchmark the video quality

impact of each AVC/H.264 inter-frame prediction

block size for various spatial and temporal contents.

This study follows a discrete methodology, meaning

that the testing and evaluation of each block size is

performed independently and not in parallel with the

remaining block sizes. For the purpose of this paper,

in the motion estimation for AVC, there is each time

only one available block size out of the total seven

to be searched for each MB. By this way this paper

researches if the sophisticated inter-frame

predictions algorithms and variable in size blocks

have an impact on the deduced video quality or if

the expected perceptual enhancement is negligible.

If the perceptual performance of the variable block

sizes and motion compensation algorithms is very

low then it means that the required high processing

power for the execution of such algorithms in the

encoding process is practically wasted without

significant perceptual outcome.

The remainder of this paper is organized as

follows. Section 2 performs a brief description of

the inter-frame prediction, Section 3 describes the

video quality metric that used in this paper, Section

4 presents the test signals of the experimental

section, Section 5 discusses the evaluation results,

and, finally, Section 6 concludes this paper.

2 INTER-FRAME PREDICTION

For completeness of the paper, this section provides

some basic background information for the inter-

frame and motion compensation algorithm that is

implemented in the AVC reference encoder.

Motion compensation is an algorithmic

technique employed in the encoding of video data

for video compression, which describes a frame in

terms of the transformation of a reference frame to

the current picture. Variable block-size motion

compensation is the use of block motion

compensation with the ability for the encoder to

dynamically select the size of the used blocks. Thus,

the use of larger blocks can reduce the number of

bits needed to represent the motion vectors (better

compression), while the use of smaller blocks can

result in a smaller amount of prediction residual

information to encode (better prediction). Older

designs such as H.261 and MPEG-1 video typically

use a fixed block size while newer ones such as

AVC give the encoder the ability to dynamically

choose which block size fits better to a specific

region. The overall procedure, which practically

exploits the temporal redundancy of a video

sequence for compression purposes, is called inter-

frame prediction. In the motion estimation for AVC,

there are in total seven possible block sizes to be

searched for each MB (modes 1–7 denote block

sizes of 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8,

and 4 × 4, respectively).

The simplest algorithmic implementation for the

inter-frame prediction is the Full Search (FS)

algorithm, which checks every displacement inside

the designated search window in order to specify the

best block size out of the seven available. The FS

algorithm, which evaluates Mean Absolute

Difference (MAD) at all possible regions of a frame,

has very high computational requirements, making

necessary the development of most sophisticated

algorithms providing a better trade-off between

computational complexity and prediction efficiency.

In this paper, we have modified the inter-frame

prediction mechanism in the reference encoder of

AVC/H.264 in order to perform the search and

match as it normally does, but each time looking for

the better match of only one specific block size (out

of the total seven available). Thus, the inter-frame

prediction continues to run but modified for a fixed

block size each time. So practically with this

modification the processing power consumption and

requirements have been reduced to the minimum.

The scope is to examine the perceptual impact of

the inter-frame prediction algorithms in conjunction

with variable in size blocks in relevance to the

spatiotemporal dynamics of the content.

3 VIDEO QUALITY

ASSESSMENT

The evaluation of the video quality is a matter of

objective and subjective procedures, which take

place after the encoding process. Subjective quality

evaluation processes of video streams require large

amount of human resources, establishing it as a

time-consuming process (Pereira and Alpert, 1997),

IMPACT OF H.264 ADVANCED VIDEO CODING INTER-FRAME BLOCK SIZES ON VIDEO QUALITY

119

(ITU, 2000). Objective methods, on the other hand,

can provide perceived QoS evaluation results faster,

but require sophisticated apparatus configurations.

The majority of the existing objective methods

requires the undistorted source video sequence as a

reference entity in the quality evaluation process,

and due to this, these methods are characterized as

Full Reference (FR) (Tan and Ghanbari, 2000),

providing a good benchmarking tool.

For this reason, in order to quantify the

perceptual difference between the different block

sizes, the use of objective instead of subjective

procedures was preferred. Since the perceptual

difference between the seven available block sizes is

expected to be rather small, the subjective

assessments could

not be able to provide a reliable

result due to the relative high statistical error

(Pinson and Wolf, 2003), which will encompass the

corresponding evaluation. Keeping this restriction in

mind, we decided to use the FR SSIM metric as an

objective metric which will benchmark the encoding

efficiency of different block sizes in relevance to the

spatiotemporal activity level of the video content.

SSIM is a FR metric for measuring the structural

similarity between two image sequences, exploiting

the general principle that the main function of the

human visual system is the extraction of structural

information from the viewing field. If x and y are

two video frames, then the SSIM is defined in

Equation (1):

))((

)2)(2(

),(

yxSSIM

yxyx

xyyx

++++

σσμμ

(1)

Where μ

, μ

are the mean of x and y, σ

, σ

are

the variances of x, y and the covariance of x and y,

respectively. The constants C

and C

are defined in

Equation (2):

= (K

(2)

Where L is the dynamic pixel range and K

= 0.01

and K

= 0.03, respectively (Wang and Lu, 2004),

(Wang and Bovik, 2004b).

4 TEST VIDEO SIGNALS

In order to examine the perceptual efficiency of

each block size, considering identical rest encodings

settings, we use 11 reference test signals of various

spatiotemporal activity levels.

The test signals used in this paper are of QCIF

resolution and their data and spatiotemporal activity

level are presented in the Table 1.

Table 1: Test signals.

Signal Frames

Akiyo 300

Suzzie 150

Claire 494

Carphone 382

Coastguard 300

Container 300

Foreman 300

Hall 300

Mobile 300

News 300

Silent 300

As it can be observed, the selected test signals

cover different range of the spatiotemporal plane,

are of QCIF spatial resolution and 25 fps of

temporal resolution. The next section presents the

results of the experimental section.

5 EVALUATION RESULTS

In order to examine the perceptual efficiency of

each block size, the aforementioned 11 reference

test signals were used as an input in the reference

AVC encoder, which was properly modified to use

each time only one specific block size in the inter-

frame prediction mechanism.

The encoding parameters were set to be similar

to the original source file in terms of spatial and

temporal resolution, while the encoding bit rate was

selected to be variable with the highest possible

quantization parameter in order to produce the best

possible encoding result, without considering the

encoding efficiency. Afterwards each encoded

signal was used along with the original source file in

the SSIM algorithm in order evaluate the deduced

video quality level for each block size as an average

for the whole in duration video signal. The results of

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

120

the procedure are depicted in Figure 2.

Figure 2: Average SSIM curves per block size and signal.

According to the depicted results of Figure 2, it

can be deduced two important observations: i) for a

specific video content (i.e., test signal) the variation

of the deduced video quality is not significant for

each block size, and ii) for specific video contents

(mainly low dynamic ones) it is noticed that the

limitation of the block sizes in the inter-frame

prediction process causes significant perceptual

degradation.

Considering the first observation and in order to

examine thoroughly the perceptual difference

among the different block sizes, we provide the

arithmetic results of the evaluation procedure in

Table 2

Table 2: Average SSIM values per block size and signal.

AverageSSIM 4_4 4_8 8_4 8_8 8_16 16_8 16_16

akiyo_qcif .yuv 0,215950 0,215780 0,215540 0,215850 0,215990 0,215570 0,215180

suzi e_qcif.yuv 0,942910 0,942910 0,944110 0,945040 0,945730 0,945590 0,946240

claire_qcif.yuv 0,975250 0,975870 0,975660 0,976130 0,975820 0,975630 0,975210

carphone_qcif.yuv 0,167450 0,167240 0,167560 0,167310 0,167360 0,167510 0,167000

coastguard_qcif.yuv 0,152820 0,152970 0,152910 0,153090 0,152850 0,153440 0,152810

container_qcif.yuv 0,937450 0,937310 0,937520 0,937200 0,936420 0,937070 0,936380

fore man_qcif.yuv 0,950940 0,951800 0,952180 0,952730 0,952730 0,951970 0,952040

hall_qci f.yuv 0,971720 0,971480 0,971830 0,971650 0,971200 0,970980 0,971130

mobile_qcif.yuv 0,970550 0,970050 0,970410 0,969880 0,969340 0,969170 0,969110

news_qcif .yuv 0,968150 0,968290 0,968150 0,951880 0,951140 0,950830 0,950850

sil ent_qcif.yuv 0,952300 0,951860 0,952090 0,951880 0,951140 0,950830 0,950850

AverageSSIM 0,745954 0,745960 0,746178 0,744785 0,744520 0,744417 0,744255

Interblocksearchmode

Table 2 demonstrates the slight difference in the

perceptual performance due to different block size.

Figure 3 provides a graphical representation of

the arithmetic data listed in Table 2. Based on the

depicted results, it can be observed that the

perceptual impact of the variable block size

selection is content dependent, while the fluctuation

of the quality among the block sizes in significantly

low in all cases. Although it was expected that the

use of smaller block sizes would generate better

perceptual results, but inefficient compression,

however for specific video signals (i.e., Suzie,

Coastguard, Foreman), a reverse analogous behavior

is noticed. More specifically, it is observed that for

these signals the video quality is enhanced when

bigger blocks are used and not small ones.

Based on these results, the suggestion that the

spatiotemporal activity of the content should be

considered as an input in the inter-frame prediction

and motion compensation algorithms is further

supported towards next generation power efficiency

encoders with low carbon footprint.

Moreover, it is important to be also noted that

the average quality level remains at satisfactory

levels for specific type of video contents even if

during the encoding process is used one block size.

However, for specific contents severe degradation

has been observed in the one block size mode.

0,21

0,22

4_4 4_8 8_4 8_8 8_16 16_8 16_16

Interblocksearch mode

akiyo_qcif.yuv

Figure 3: Detailed curves per block size and signal.

IMPACT OF H.264 ADVANCED VIDEO CODING INTER-FRAME BLOCK SIZES ON VIDEO QUALITY

121

Figure 3: Detailed curves per block size and signal (cont.).

6 CONCLUSIONS

This paper has presented a perceptual-based

encoding benchmarking of the AVC inter-frame

prediction variable block sizes for various spatial

and temporal contents. Preliminary results have

been provided showing that the perceptual

efficiency of each block size is dependent on the

content dynamics. Future work includes the

expansion of the experimental section using more

objective metrics in order to extent the sensitivity of

the measured perceptual efficiency. Also detailed

measurement of the spatiotemporal dynamics will be

performed in order to provide a mapping between

the content dynamics and the efficiency of each

block type.

ACKNOWLEDGEMENTS

The authors would like to thank A. Papaioannou and

Ch. Strologa for their contribution at the

experimental part of this paper. This work was

performed in the framework of the ALICANTE

project (FP7-ICT-248652).

REFERENCES

ISO/IEC, 2006. International Standard 14496-10,

Information Technology – Coding of Audio-Visual

Objects – Part 10: Advanced Video Coding, third ed.

Sullivan G., Wiegand, T., 2005. “Video compression –

from concepts to the H.264/AVC standard”. In

Proceedings of the IEEE, vol. 93, no. 1, pp. 18-31.

Pereira, F., Alpert, T., 1997. “MPEG-4 Video Subjective

Test Procedures Results”. In IEEE Trans. on Circuits

and Systems for Video Tech. Vol.7(1), pp.32-51.

ITU 2000. “Methodology for the subjective assessment of

the quality of television pictures”. In Recommendation

ITU-R BT.500-10.

Tan, K., Ghanbari, M., 2000. “A Multi-Metric Objective

Picture Quality Measurements Model for MPEG

Video”. In IEEE Transactions on Circuits and

Systems for Video Technology, Vol.10(7), pp. 1208-

1213.

Pinson M., Wolf, S., 2003. “Comparing subjective video

quality testing methodologies”. In Proceedings of the

SPIE Visual Communications and Image Processing,

Vol 5150, pp. 573-582.

Wang, Z., Lu, L., Bovik, A., 2004. “Video Quality

Assessment Based on Structural Distortion

Measurement”. In Image Communication, Vol. 19(2),

pp. 121-132.

Wang, Z., Bovik, A. C., Sheikh H. R., Simoncelli, E. P.,

2004. “Image quality assessment: From error visibility

to structural similarity”. In IEEE Transactions on

Image Processing, Vol. 13(4), pp. 1-14.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

122