A VIDEO TRANSCODING SCHEME FOR E-LEARNING

MULTIMEDIA APPLICATIONS

Nuno Santos, Pedro A. Amado Assunção

Polytechnic Institute of Leiria / ESTG – Institute of Telecommunications

Morro do Lena – AltoVieiro, 2401-951 Leiria, Portugal

Keywords: Transcoding, multimedia, MPEG, video objects.

Abstract: In this paper, we propose a segmentation based transcoding scheme for adapting MPEG-2 e-learning visual

contents to heterogeneous environments. This is achieved by converting MPEG-2 video into MPEG-4 video

objects with arbitrary shape and different semantic value in e-learning context. The transcoding scheme is

based on a hybrid segmentation method, which employs both compressed and pixel domain techniques, for

extraction of two video objects from MPEG-2 streams. The objective is two-fold: i) to enable individual

object coding and manipulation; ii) to increase the scene coding efficiency. The results show that our hybrid

segmentation method is capable of identifying the video objects of interest with good accuracy. Moreover,

the transcoding efficiency of the proposed scheme is better than straightforward conversion from MPEG-2

to MPEG-4.

1 INTRODUCTION

Digital multimedia is part of an ever increasing field

of applications where visual information plays the

major role. In this context, the MPEG family of

coding standards have undoubtedly contributed to

the widespread use of compressed multimedia in

many different domains. This is particularly true for

the case of MPEG-2 (

ISO/IEC, 1999) and MPEG-4

(

ISO/IEC, 1999), since these two are among the most

used in current multimedia services and

applications. However, multimedia content delivery

to different user contexts through diverse access

networks, requires specific adaptation tools for

providing Universal Multimedia Access (UMA)

(Pereira and Burnet, 2003). For this purpose, several

authors have proposed different types of transcoding

as the best solution for dealing with adaptation

problems in heterogeneous communication

scenarios (Assuncao and Ghanbari, 1998; Reibman

et al., 2000;

Shanableh and Ghanbari, 2000; Sun et al.,

2002).

In the case of transcoding from MPEG-2 to

MPEG-4, several aspects have been addressed in

previous work. For example, in (Takahashi et al.,

2001) a solution for the problem of motion vector

reuse was proposed and in (Guo et al 2001; Xie et al

2003) different efficient methods are addressed for

transcoding from MPEG-2 to MPEG-4. A common

aspect of these transcoding methods is that of

dealing with video frames or, using MPEG-4

terminology, rectangular video objects. We address

a different type transcoding where MPEG-2 video is

converted into MPEG-4 video objects with arbitrary

shape.

Distance education and e-learning are

increasingly important application domains where

multimedia technology plays a relevant role. The

particular characteristics of such environments give

rise to new types of heterogeneous transcoding and

media adaptation schemes for efficient

representation and manipulation (Dorai et al, 2003).

In this paper we propose a transcoding scheme

for adapting MPEG-2 coded video into MPEG-4

video objects of arbitrary shape for e-learning

applications. The video scenes to be transcoded are

constrained by the application specific context. They

comprise two objects of interest: the whiteboard and

the lecturer. In the proposed scheme, we exploit the

fact that both the whiteboard and the lecturer might

be encoded as independent video objects. Then, by

taking advantage from the MPEG-4 coding tools, we

propose a transcoding mechanism for matching

MPEG-2 coded signals into separate MPEG-4 video

objects. The proposed transcoding scheme relies on

a hybrid domain spatio-temporal segmentation

algorithm. The two objects referred to above are

extracted from the coded video frames and then

278

Assuncao P. and Santos N. (2004).

A VIDEO TRANSCODING SCHEME FOR E-LEARNING MULTIMEDIA APPLICATIONS.

In Proceedings of the First International Conference on E-Business and Telecommunication Networks, pages 278-283

DOI: 10.5220/0001397702780283

 SciTePress

independently encoded by using different coding

parameters, according to their inherent

characteristics. The results show a good

performance taking into account both objective and

subjective quality as well as coding efficiency.

This paper is organised as follows. In the next

section we address the specific characteristics of the

visual contents that we are dealing with in this work.

In section 3 we describe the hybrid segmentation

algorithm used for fast extraction of video objects.

The experimental results are presented in section 4

and finally section 5 concludes the paper.

2 THE VISUAL CONTENT

The type of multimedia contents that we deal with in

this work involves recorded MPEG-2 video

originally captured from a typical classroom scene.

This consists of a teacher speaking, writing on a

whiteboard and moving in front of the whiteboard

area. In the MPEG-4 context this scene contains two

video objects with different semantic value for the

human observer: the lecturer and the whiteboard.

Figure 1 shows one picture of the visual content

used in our experiments.

In regard to subjective quality it should be

stressed that motion smoothness is more important

than texture accuracy in the case of the lecturer,

while texture is much more important than motion in

the case of the whiteboard. This means that

different requirements should be defined for the

temporal and spatial quality of each video object in

order to achieve better transcoding efficiency. Note

that the most active periods correspond to different

types of motion, such as walking, writing on the

whiteboard, gesture and speaking.

Figure 2: A typical image from a video sequence used in

e-learning environment

The whiteboard is where the lecturer writes

down pedagogical contents for supporting and

complementing the oral explanations. The main

characteristic of this video object consists of its

relatively slow motion and high texture detail. The

slow motion results from the human writing speed

on such type of board whereas the high spatial detail

is a consequence its specific visual contents, i.e.,

characters and diagrams written with a marker.

Therefore, this video object can be efficiently

encoded at reduced temporal rates such that more

bits are allocated to encode the texture information,

i.e., higher spatial quality.

3 SEGMENTATION BASED

TRANSCODING

The segmentation based transcoding architecture

proposed in this paper is depicted in Figure 2. It is

comprised of a modified MPEG-2 decoder which

includes a hybrid video segmentation algorithm and

Identification of the

Region of Interest

MB Based

Spatial

Segmentation

MPEG-2

stream

VLD Q

-1

MPEG-4

encoding

Visual

Object

Generation

Object 1

Object 2

IDCT

Reference

Frame Buffer

Pixel Domain

Mask

Refinement

Algorithm

MPEG-4

stream

Hybrid Segmentation

Figure 1: MPEG2 – MPEG4 Segmentation Based Transcoding

A VIDEO TRANSCODING SCHEME FOR E-LEARNING MULTIMEDIA APPLICATIONS

279

a MPEG-4 visual encoder. The hybrid segmentation

process operates in both the DCT and pixel domains

(Kim et al., 1999; Yu et al., 2003). As it can be seen

in the figure, the input MPEG-2 video stream is

transcoded into two MPEG-4 video objects by using

segmentation before the MPEG-4 encoding stage.

3.1 DCT domain coarse

segmentation

The DCT domain algorithm operates on predefined

temporal window where a coarse spatial region is

identified as the location of the lecturer. Since this is

a low motion sequence the boundaries of the

temporal window were set at the I pictures of the

MPEG-2 stream. This dynamic region is found by a

fast algorithm that evaluates the DC distance

measure between two MB located in the same

spatial position of consecutive I pictures. The DC

distance d

i,n,m

between two MB in pictures n and

n+t,

tN∈

, both with MB address i, is determined

as follows:

,, , ,int in in t

daa

=−

and a

i,n

is given by,

,, ,,

()/4

in in in in

aYkCbCr

⎛⎞

=++

⎜⎟

⎝⎠

∑

where Y

i,n

(k) is the DC coefficient of the luminance

block k in MB i of picture n, and Cb

i,n

, Cr

i,n

are the

DC coefficients of the corresponding chrominance

blocks. Part of the dynamic region is comprised of

those MB whose DC distance is greater than a

threshold Th. The complete region of interest is then

found after a further processing step that fills in the

“holes” left by the previous one, i.e., those MB that

belong to the inside part of the moving region and

were not identified because they have similar texture

to that of their neighbours. This DCT domain

processing allows identification of the slow moving

region in a temporal segment of the sequence. Note

that the spatial region found through this process

contains more MB than actually those which belong

to the object of interest. However, this is an

extremely fast and efficient method for identifying

the coarse region within which the lecturer can be

found.

The result is a set of MB addresses that define a

much smaller region containing the video object.

From the experimental results we found that, in the

type of sequence we are using, the size of this region

is about 25% - 43% of the total image size.

Therefore the pixel domain mask refinement that

follows this stage operates on a much smaller set of

data.

3.2 Pixel domain mask refinement

The actual size of the moving video object is smaller

than that of the DCT domain mask obtained in the

previous step because the spatial region is identified

in a GOP basis. Thus, the resulting mask includes,

not only the object of interest, but also the

surrounding area where it moves between two I

pictures.

One of the tasks of the pixel domain refinement

algorithm is to shrink the DCT mask up to the actual

boundaries of the object. As a consequence, the

object shape is also refined up to the pixel level.

This also improves the quality of the segmentation

because the first mask obtained in the DCT domain

is coincident with the MB structure of the original

picture, hence it is a stepwise boundary.

The pixel domain algorithm implemented for

obtaining the refined masks involves 4 steps: 1-

median filtering the region of interest to eliminate

noise (Yin et al., 1996); 2- histogram analysis of the

spatial region of interest to determine the best

threshold for splitting the pixels into two sets; 3-

obtaining a segmentation mask based on the

histogram and 4- post-processing to eliminate

isolated small groups of pixels which do not belong

to the main area (characters, lines, etc).

4 EXPERIMENTAL RESULTS

In order to evaluate the performance of the

segmentation based transcoding scheme, we have

used MPEG-2 video streams, which are currently

being used for e-learning purposes within the

intranet of our campus. These are of the type shown

in Figure 1. We have carried out several subjective

tests in order to find out the minimum bandwidth

that achieves a good subjective quality for this type

of application since poor picture quality is not

acceptable because it may lead to additional learning

difficulties. From these tests we have found that 2

Mbps provides an acceptable visual quality. Then

the MPEG-2 stream was transcoded into an MPEG-

4 visual stream combining two video objects. The

objective is two-fold, i.e., to enable individual object

coding and manipulation as well as to increase the

scene coding efficiency. While the former is

achieved by proper segmentation, the latter greatly

depends on both the efficiency of the coding

algorithm and the set of coding parameters.

ICETE 2004 - WIRELESS COMMUNICATION SYSTEMS AND NETWORKS

280

4.1 Segmentation

The hybrid segmentation algorithm described in

Section 3 was used to produce the segmentation

mask for extracting the two objects of interest from

the MPEG-2 video stream. In order to ease the

comparison we show the results for the picture

shown in Figure 1. In Figure 3 we show the coarse

region obtained from the DCT domain algorithm

and in Figure 4 the corresponding spatial region. As

we have pointed out before, this region is greater

than the actual moving object (the lecturer). As we

pointed out before, the mask precision is limited to

MB level because this is the processing data unit in

the DCT domain. In Figure 5 we show the region

identified by the histogram based algorithm and post

processing and Figure 6 shows the video object

identified through this process.

Figure 3: Coarse mask obtained in the DCT domain.

Figure 4: Coarse spatial region obtained from the

corresponding mask

Figure 5: Refined mask

Figure 6: Video object

4.2 Transcoding efficiency

In order to evaluate the picture quality under a

significant transcoding ratio, we have set the output

bit rate to 500 Kbps and we compare three different

transcoding schemes. In all cases we have used the

same input video sequence, which was available in

the server at 2 Mbps.

For reference and comparison with the proposed

scheme we have used straightforward transcoding

from MPEG-2 to MPEG-2 and from MPEG-2 to

MPEG-4, using a single rectangular object. In the

case of the proposed scheme, we have taken into

account the inherent characteristics of each visual

object, as pointed out before. Then for each video

object we have set the same output bit rate of 250

A VIDEO TRANSCODING SCHEME FOR E-LEARNING MULTIMEDIA APPLICATIONS

281

kbps but different temporal rates. The lecturer was

encoded at 25Hz whereas the whiteboard was

encoded at 6.25Hz. For comparison with the video

frames, after decoding the two objects these were

combined to form frames again.

As we can observe in Figure 7, the proposed

scheme achieves a good performance comparing

with both references. By using different coding

parameters for each video object, the transcoded

pictures have better spatial quality in the whiteboard

area, mainly because of its reduced temporal rate

which allows more bits to encode the texture. The

composition problem that arises when different

video objects are displayed at different frame rates

may be overcome by filling in the missing areas

with pixels from the surrounding area. However,

this issue is not addressed in this paper. The same

behaviour as shown in Figure 7 is obtained for other

transcoding ratios.

5 CONCLUSION

The transcoding scheme proposed in this paper is

suitable for e-learning applications where MPEG-2

to MPEG-4 conversion might be useful. The

experimental results show that a good performance

is achieved by choosing different temporal rates for

video objects according to their specific

characteristics.

A possible application of this type of transcoding is

in wireless access where the user may receive the

audio and only the whiteboard visual information at

much lower bit rates but still with an acceptable

quality of service.

REFERENCES

Assuncao P. and Ghanbari M, 1998. A frequency domain

transcoder for dynamic bit rate reduction MPEG-2 bit

streams, IEEE Transactions on Circuits and Systems

for Video Technology, Vol. 8, No 8, pp. 953-967.

Dorai, C., Oria V. and Neelavalli V., 2003. Structuralizing

Educational Videos Based on Presentation Content,

IEEE International Conference on Image Processing,

Barcelona-Spain.

Guo W., Lin L., Zheng W and Zheng W., 2001.

Mismatched MB Retrieval from MPEG-2 to MPEG-4

Transcoding. IEEE Pacific Rim Conference on

Multimedia, Beijing, China.

ISO/IEC 13818-2, 1995. Generic Coding of Moving

Pictures and Associated Audio - Part 2: Video.

ISO/IEC 14496-2, 1999. Information Technology -

Generic Coding of Audio-Visual Objects – Part 2:

Visual, Vancouver.

Kim M., Choi J. G., Kim D., Lee H., Lee M. H., Ahn C.

and Ho Y-S., 1999. A VOP Generation Tool:

Automatic Segmentation of Moving Objects in Image

Sequences Based on Spatio-Temporal Information,

IEEE Transactions on Circuits and Systems for Video

Technology, pp. 1216-1226, Vol. 9, No 8.

Pereira F., Burnet I., 2003. Universal Multimedia

Experiences for Tomorrow, IEEE Signal Processing

Magazine, vol. 20, No. 2.

Reyes G., Reibman A., Chang S-F., Chuang J., 2000.

Error Resilient Transcoding for Video over Wireless

Channels, IEEE Journal on Selected Areas in

Communications, Vol. 18, No. 6.

Shanableh T., and Ghanbari M., 2000. Heterogeneous

video transcoding to lower spatio-temporal resolutions

and different encoding formats, IEEE Transactions on

Multimedia, Vol. 2, No 2, pp. 101-110.

Takahashi K., Satoh K., Suzuki T., Yagasaki Y., 2001.

Motion Vector Synthesis Algorithm for MPEG2-to-

MPEG4 transcoder, Visual Communications and

Image Processing, Proceedings of SPIE, Vol. 4310,

pp. 872-882.

Xie R., Liu J. and Wang X. 2003. Efficient MPEG-2 to

MPEG-4 Compressed Video Transcoding, Visual

Communications and Image Processing, Proceedings

of SPIE Vol. 4671, pp. 192-201, Lugano-Switzerland.

Xin J., Sun M-T., Choi B-S., Chun K-W., 2002. An

HDTV to SDTV Spatial Transcoder, IEEE

Transactions on Circuits and Systems for Video

Technology, Vol. 12, No 11.

Yin L., Yang R., Gabbouj M., Neuvo Y., 1996. Weighted

Median Filters: A Tutorial, IEEE Trans. on Circuits

and Systems, vol. 43, n 3, pp. 157-192.

Yu X-D., Duan L-Y. and Tian Q., 2003. Robust Moving

Video Object Segmentation in the MPEG Compressed

Domain, IEEE International Conference on Image

Processing, Barcelona-Spain.

ICETE 2004 - WIRELESS COMMUNICATION SYSTEMS AND NETWORKS

282

Figure 7: PSNR of the transcoded sequence

A VIDEO TRANSCODING SCHEME FOR E-LEARNING MULTIMEDIA APPLICATIONS

283