Thứ Hai, 9 tháng 6, 2008

MPEG-2 VIDEO COMPRESSION

MPEG-2 is an extension of the MPEG-1 international standard for digital compression of audio and video signals. MPEG-1 was designed to code progressively scanned video at bit rates up to about 1.5 Mbit/s for applications such as CD-i (compact disc interactive). MPEG-2 is directed at broadcast formats at higher data rates; it provides extra algorithmic 'tools' for efficiently coding interlaced video, supports a wide range of bit rates and provides for multichannel surround sound coding. This tutorial paper introduces the principles used for compressing video according to the MPEG-2 standard, outlines the general structure of a video coder and decoder, and describes the subsets ('profiles') of the toolkit and the sets of constraints on parameter values ('levels') defined to date.

1. INTRODUCTION

Recent progress in digital technology has made the widespread use of compressed digital video signals practical. Standardisation has been very important in the development of common compression methods to be used in the new services and products that are now possible. This allows the new services to interoperate with each other and encourages the investment needed in integrated circuits to make the technology cheap.

MPEG (Moving Picture Experts Group) was started in 1988 as a working group within ISO/IEC with the aim of defining standards for digital compression of audio-visual signals. MPEG's first project, MPEG-1, was published in 1993 as ISO/IEC 11172 [1]. It is a three-part standard defining audio and video compression coding methods and a multiplexing system for interleaving audio and video data so that they can be played back together. MPEG-1 principally supports video coding up to about 1.5 Mbit/s giving quality similar to VHS and stereo audio at 192 bit/s. It is used in the CD-i and Video-CD systems for storing video and audio on CD-ROM.

During 1990, MPEG recognised the need for a second, related standard for coding video for broadcast formats at higher data rates. The MPEG-2 standard [2] is capable of coding standard-definition television at bit rates from about 3-15 Mbit/s and high-definition television at 15-30 Mbit/s. MPEG-2 extends the stereo audio capabilities of MPEG-1 to multi-channel surround sound coding. MPEG-2 decoders will also decode MPEG-1 bitstreams.

Drafts of the audio, video and systems specifications were completed in November 1993 and the ISO/IEC approval process was completed in November 1994. The final text was published in 1995.

MPEG-2 aims to be a generic video coding system supporting a diverse range of applications. Different algorithmic 'tools', developed for many applications, have been integrated into the full standard. To implement all the features of the standard in all decoders is unnecessarily complex and a waste of bandwidth, so a small number of subsets of the full standard, known as profiles and levels, have been defined. A profile is a subset of algorithmic tools and a level identifies a set of constraints on parameter values (such as picture size and bit rate). A decoder which supports a particular profile and level is only required to support the corresponding subset of the full standard and set of parameter constraints.

This paper introduces the principles used in MPEG-2 video compression systems, outlines the general structure of a coder and decoder, and describes the profiles and levels defined to date.

2. VIDEO FUNDAMENTALS

Television services in Europe currently broadcast video at a frame rate of 25 Hz. Each frame consists of two interlaced fields, giving a field rate of 50 Hz. The first field of each frame contains only the odd numbered lines of the frame (numbering the top frame line as line 1). The second field contains only the even numbered lines of the frame and is sampled in the video camera 20 ms after the first field. It is important to note that one interlaced frame contains fields from two instants in time. American television is similarly interlaced but with a frame rate of just under 30 Hz.

In video systems other than television, non-interlaced video is commonplace (for example, most computers output non-interlaced video). In non-interlaced video, all the lines of a frame are sampled at the same instant in time. Non-interlaced video is also termed 'progressively scanned' or 'sequentially scanned' video.

The red, green and blue (RGB) signals coming from a colour television camera can be equivalently expressed as luminance (Y) and chrominance (UV) components. The chrominance bandwidth may be reduced relative to the luminance without significantly affecting the picture quality. For standard definition video, CCIR recommendation 601 [3] defines how the component (YUV) video signals can be sampled and digitised to form discrete pixels. The terms 4:2:2 and 4:2:0 are often used to describe the sampling structure of the digital picture. 4:2:2 means the chrominance is horizontally subsampled by a factor of two relative to the luminance; 4:2:0 means the chrominance is horizontally and vertically subsampled by a factor of two relative to the luminance.

The active region of a digital television frame, sampled according to CCIR recommendation 601, is 720 pixels by 576 lines for a frame rate of 25 Hz. Using 8 bits for each Y, U or V pixel, the uncompressed bit rates for 4:2:2 and 4:2:0 signals are therefore:

 4:2:2: 720 x 576 x 25 x 8 + 360 x 576 x 25 x ( 8 + 8 ) = 166 Mbit/s
4:2:0: 720 x 576 x 25 x 8 + 360 x 288 x 25 x ( 8 + 8 ) = 124 Mbit/s

MPEG-2 is capable of compressing the bit rate of standard-definition 4:2:0 video down to about 3-15 Mbit/s. At the lower bit rates in this range, the impairments introduced by the MPEG-2 coding and decoding process become increasingly objectionable. For digital terrestrial television broadcasting of standard-definition video, a bit rate of around 6 Mbit/s is thought to be a good compromise between picture quality and transmission bandwidth efficiency.

3. BIT RATE REDUCTION PRINCIPLES

A bit rate reduction system operates by removing redundant information from the signal at the coder prior to transmission and re-inserting it at the decoder. A coder and decoder pair are referred to as a 'codec'. In video signals, two distinct kinds of redundancy can be identified.

Spatial and temporal redundancy: Pixel values are not independent, but are correlated with their neighbours both within the same frame and across frames. So, to some extent, the value of a pixel is predictable given the values of neighbouring pixels.

Psychovisual redundancy: The human eye has a limited response to fine spatial detail [4], and is less sensitive to detail near object edges or around shot-changes. Consequently, controlled impairments introduced into the decoded picture by the bit rate reduction process should not be visible to a human observer.

Two key techniques employed in an MPEG codec are intra-frame Discrete Cosine Transform (DCT) coding and motion-compensated inter-frame prediction. These techniques have been successfully applied to video bit rate reduction prior to MPEG, notably for 625-line video contribution standards at 34 Mbit/s [5] and video conference systems at bit rates below 2 Mbit/s [6].

Intra-frame DCT coding

DCT [7]: A two-dimensional DCT is performed on small blocks (8 pixels by 8 lines) of each component of the picture to produce blocks of DCT coefficients (Fig. 1). The magnitude of each DCT coefficient indicates the contribution of a particular combination of horizontal and vertical spatial frequencies to the original picture block. The coefficient corresponding to zero horizontal and vertical frequency is called the DC coefficient.

[Equation]

[Fig. 1]

Fig. 1 - The discrete cosine transform (DCT).
Pixel value and DCT coefficient magnitude are represented by dot size.

The DCT doesn't directly reduce the number of bits required to represent the block. In fact for an 8x8 block of 8 bit pixels, the DCT produces an 8x8 block of 11 bit coefficients (the range of coefficient values is larger than the range of pixel values.) The reduction in the number of bits follows from the observation that, for typical blocks from natural images, the distribution of coefficients is non-uniform. The transform tends to concentrate the energy into the low-frequency coefficients and many of the other coefficients are near-zero. The bit rate reduction is achieved by not transmitting the near-zero coefficients and by quantising and coding the remaining coefficients as described below. The non-uniform coefficient distribution is a result of the spatial redundancy present in the original image block.

Quantisation: The function of the coder is to transmit the DCT block to the decoder, in a bit rate efficient manner, so that it can perform the inverse transform to reconstruct the image. It has been observed that the numerical precision of the DCT coefficients may be reduced while still maintaining good image quality at the decoder. Quantisation is used to reduce the number of possible values to be transmitted, reducing the required number of bits.

The degree of quantisation applied to each coefficient is weighted according to the visibility of the resulting quantisation noise to a human observer. In practice, this results in the high-frequency coefficients being more coarsely quantised than the low-frequency coefficients. Note that the quantisation noise introduced by the coder is not reversible in the decoder, making the coding and decoding process 'lossy'.

Coding: The serialisation and coding of the quantised DCT coefficients exploits the likely clustering of energy into the low-frequency coefficients and the frequent occurrence of zero-value coefficients. The block is scanned in a diagonal zigzag pattern starting at the DC coefficient to produce a list of quantised coefficient values, ordered according to the scan pattern.

The list of values produced by scanning is entropy coded using a variable-length code (VLC). Each VLC code word denotes a run of zeros followed by a non-zero coefficient of a particular level. VLC coding recognises that short runs of zeros are more likely than long ones and small coefficients are more likely than large ones. The VLC allocates code words which have different lengths depending upon the probability with which they are expected to occur. To enable the decoder to distinguish where one code ends and the next begins, the VLC has the property that no complete code is a prefix of any other.

Fig. 1 shows the zigzag scanning process, using the scan pattern common to both MPEG-1 and MPEG-2. MPEG-2 has an additional 'alternate' scan pattern intended for scanning the quantised coefficients resulting from interlaced source pictures.

To illustrate the variable-length coding process, consider the following example list of values produced by scanning the quantised coefficients from a transformed block:

 12, 6, 6, 0, 4, 3, 0, 0, 0...0

The first step is to group the values into runs of (zero or more) zeros followed by a non-zero value. Additionally, the final run of zeros is replaced with an end of block (EOB) marker. Using parentheses to show the groups, this gives:

 (12), (6), (6), (0, 4), (3) EOB
The second step is to generate the variable length code words corresponding to each group (a run of zeros followed by a non-zero value) and the EOB marker. Table 1 shows an extract of the DCT coefficient VLC table common to both MPEG-1 and MPEG-2. MPEG-2 has an additional 'intra' VLC optimised for coding intra blocks (see Section 4). Using the variable length code from Table 1 and adding spaces and commas for readability, the final coded representation of the example block is:

 0000 0000 1101 00, 0010 0001 0, 0010 0001 0, 0000 0011 000, 0010 10, 10

Table 1: Extract from the MPEG-2 DCT coefficient VLC table.

Length of
run of zeros

Value of non-zero
coefficient

Variable-length
codeword
0
12
0000 0000 1101 00
0
6
0010 0001 0
1
4
0000 0011 000
0
3
0010 10
EOB
-
10

Motion-compensated inter-frame prediction

This technique exploits temporal redundancy by attempting to predict the frame to be coded from a previous 'reference' frame. The prediction cannot be based on a source picture because the prediction has to be repeatable in the decoder, where the source pictures are not available (the decoded pictures are not identical to the source pictures because the bit rate reduction process introduces small distortions into the decoded picture.) Consequently, the coder contains a local decoder which reconstructs pictures exactly as they would be in the decoder, from which predictions can be formed.

The simplest inter-frame prediction of the block being coded is that which takes the co-sited (i.e. the same spatial position) block from the reference picture. Naturally this makes a good prediction for stationary regions of the image, but is poor in moving areas. A more sophisticated method, known as motion-compensated inter-frame prediction, is to offset any translational motion which has occurred between the block being coded and the reference frame and to use a shifted block from the reference frame as the prediction.

One method of determining the motion that has occurred between the block being coded and the reference frame is a 'block-matching' search in which a large number of trial offsets are tested by the coder using the luminance component of the picture. The 'best' offset is selected on the basis of minimum error between the block being coded and the prediction.

The bit rate overhead of using motion-compensated prediction is the need to convey the motion vectors required to predict each block to the decoder. For example, using MPEG-2 to compress standard-definition video to 6 Mbit/s, the motion vector overhead could account for about 2 Mbit/s during a picture making heavy use of motion-compensated prediction.

4. MPEG-2 DETAILS

Codec structure

In an MPEG-2 system, the DCT and motion-compensated interframe prediction are combined, as shown in Fig. 2. The coder subtracts the motion-compensated prediction from the source picture to form a 'prediction error' picture. The prediction error is transformed with the DCT, the coefficients are quantised and these quantised values coded using a VLC. The coded luminance and chrominance prediction error is combined with 'side information' required by the decoder, such as motion vectors and synchronising information, and formed into a bitstream for transmission. Fig. 3 shows an outline of the MPEG-2 video bitstream structure.

[Fig. 2]

[key]

Fig. 2 - (a) Motion-compensated DCT coder; (b) motion compensated DCT decoder.

[Fig. 3]

Fig. 3 - Outline of MPEG-2 video bitstream structure (shown bottom up).

In the decoder, the quantised DCT coefficients are reconstructed and inverse transformed to produce the prediction error. This is added to the motion-compensated prediction generated from previously decoded pictures to produce the decoded output.

In an MPEG-2 codec, the motion-compensated predictor shown in Fig. 2 supports many methods for generating a prediction. For example, the block may be 'forward predicted' from a previous picture, 'backward predicted' from a future picture, or 'bidirectionally predicted' by averaging a forward and backward prediction. The method used to predict the block may change from one block to the next. Additionally, the two fields within a block may be predicted separately with their own motion vector, or together using a common motion vector. Another option is to make a zero-value prediction, such that the source image block rather than the prediction error block is DCT coded. For each block to be coded, the coder chooses between these prediction modes, trying to maximise the decoded picture quality within the constraints of the bit rate. The choice of prediction mode is transmitted to the decoder, with the prediction error, so that it may regenerate the correct prediction.

Picture types

In MPEG-2, three 'picture types' are defined. The picture type defines which prediction modes may be used to code each block.

'Intra' pictures (I-pictures) are coded without reference to other pictures. Moderate compression is achieved by reducing spatial redundancy, but not temporal redundancy. They can be used periodically to provide access points in the bitstream where decoding can begin.

'Predictive' pictures (P-pictures) can use the previous I- or P-picture for motion compensation and may be used as a reference for further prediction. Each block in a P-picture can either be predicted or intra-coded. By reducing spatial and temporal redundancy, P-pictures offer increased compression compared to I-pictures.

'Bidirectionally-predictive' pictures (B-pictures) can use the previous and next I- or P-pictures for motion-compensation, and offer the highest degree of compression. Each block in a B-picture can be forward, backward or bidirectionally predicted or intra-coded. To enable backward prediction from a future frame, the coder reorders the pictures from natural 'display' order to 'bitstream' order so that the B-picture is transmitted after the previous and next pictures it references. This introduces a reordering delay dependent on the number of consecutive B-pictures.

The different picture types typically occur in a repeating sequence, termed a 'Group of Pictures' or GOP. A typical GOP in display order is:

 B1 B2 I3 B4 B5 P6 B7 B8 P9 B10 B11 P12

The corresponding bitstream order is:

 I3 B1 B2 P6 B4 B5 P9 B7 B8 P12 B10 B11

A regular GOP structure can be described with two parameters: N, which is the number of pictures in the GOP, and M, which is the spacing of P-pictures. The GOP given here is described as N=12 and M=3. MPEG-2 does not insist on a regular GOP structure. For example, a P-picture following a shot-change may be badly predicted since the reference picture for prediction is completely different from the picture being predicted. Thus, it may be beneficial to code it as an I-picture instead.

For a given decoded picture quality, coding using each picture type produces a different number of bits. In a typical example sequence, a coded I-picture was three times larger than a coded P-picture, which was itself 50% larger than a coded B-picture.

Buffer control

By removing much of the redundancy from the source images, the coder outputs a variable bit rate. The bit rate depends on the complexity and predictability of the source picture and the effectiveness of the motion-compensated prediction.

For many applications, the bitstream must be carried in a fixed bit rate channel. In these cases, a buffer store is placed between the coder and the channel. The buffer is filled at a variable rate by the coder, and emptied at a constant rate by the channel. To prevent the buffer from under- or overflowing, a feedback mechanism acts to adjust the average coded bit rate as a function of the buffer fullness. For example, the average coded bit rate may be lowered by increasing the degree of quantisation applied to the DCT coefficients. This reduces the number of bits generated by the variable-length coding, but increases distortion in the decoded image. The decoder must also have a buffer between the channel and the variable rate input to the decoding process. The size of the buffers in the coder and decoder must be the same.

MPEG-2 defines the maximum decoder (and hence coder) buffer size, although the coder may choose to use only part of this. The delay through the coder and decoder buffer is equal to the buffer size divided by the channel bit rate. For example, an MPEG-2 coder operating at 6 Mbit/s with a buffer size of 1.8 Mbits would have a total delay through the coder and decoder buffers of around 300 ms. Reducing the buffer size will reduce the delay, but may affect picture quality if the buffer becomes too small to accommodate the variation in bit rate from the coder VLC.

Profiles and levels

MPEG-2 video is an extension of MPEG-1 video. MPEG-1 was targeted at coding progressively scanned video at bit rates up to about 1.5 Mbit/s. MPEG-2 provides extra algorithmic 'tools' for efficiently coding interlaced video and supports a wide range of bit rates. MPEG-2 also provides tools for 'scalable' coding where useful video can be reconstructed from pieces of the total bitstream. The total bitstream may be structured in layers, starting with a base layer (that can be decoded by itself) and adding refinement layers to reduce quantisation distortion or improve resolution.

A small number of subsets of the complete MPEG-2 tool kit have been defined, known as profiles and levels. A profile is a subset of algorithmic tools and a level identifies a set of constraints on parameter values (such as picture size or bit rate). The profiles and levels defined to date fit together such that a higher profile or level is superset of a lower one. A decoder which supports a particular profile and level is only required to support the corresponding subset of algorithmic tools and set of parameter constraints.

Details of non-scalable profiles: Two non-scalable profiles are defined by the MPEG-2 specification.

The simple profile uses no B-frames, and hence no backward or interpolated prediction. Consequently, no picture reordering is required (picture reordering would add about 120 ms to the coding delay). With a small coder buffer, this profile is suitable for low-delay applications such as video conferencing where the overall delay is around 100 ms. Coding is performed on a 4:2:0 video signal.

The main profile adds support for B-pictures and is the most widely used profile. Using B-pictures increases the picture quality, but adds about 120 ms to the coding delay to allow for the picture reordering. Main profile decoders will also decode MPEG-1 video. Currently, most MPEG-2 video decoder chip-sets support the main profile at main level.

Details of scalable profiles: The SNR profile adds support for enhancement layers of DCT coefficient refinement, using the 'signal to noise (SNR) ratio scalability' tool. Fig. 4 shows an example SNR-scalable coder and decoder.

[Fig. 4]

[key]

Fig. 4 - (a) SNR-scalable video coder; (b) SNR-scalable video decoder.

The codec operates in a similar manner to the non-scalable codec shown in Fig. 2, with the addition of an extra quantisation stage. The coder quantises the DCT coefficients to a given accuracy, variable-length codes them and transmits them as the lower-level or 'base-layer' bitstream. The quantisation error introduced by the first quantiser is itself quantised, variable-length coded and transmitted as the upper-level or 'enhancement-layer' bitstream. Side information required by the decoder, such as motion vectors, is transmitted only in the base layer.

The base-layer bitstream can be decoded in the same way as the non-scalable case shown in Fig. 2(b). To decode the combined base and enhancement layers, both layers must be received, as shown in Fig. 4(b). The enhancement-layer coefficient refinements are added to the base-layer coefficient values following inverse quantisation. The resulting coefficients are then decoded in the same way as the non-scalable case.

The SNR profile is suggested for digital terrestrial television as a way of providing graceful degradation.

The spatial profile adds support for enhancement layers carrying the coded image at different resolutions, using the 'spatial scalability' tool. Fig. 5 shows an example spatial-scalable coder and decoder.

[Fig. 5]

[key]

Fig. 5 - (a) Spatial-scalable video coder; (b) spatial-scalable video decoder.

Spatial scalability is characterised by the use of decoded pictures from a lower layer as a prediction in a higher layer. If the higher layer is carrying the image at a higher resolution, then the decoded pictures from the lower layer must be sample rate converted to the higher resolution by means of an 'up-converter'.

In the coder shown in Fig. 5(a), two coder loops operate at different picture resolutions to produce the base and enhancement layers. The base-layer coder produces a bitstream which may be decoded in the same way as the non-scalable case. The enhancement-layer coder is offered the 'up-converted' locally-decoded pictures from the base layer, as a prediction for the upper-layer block. This prediction is in addition to the prediction from the upper-layer's motion-compensated predictor. The adaptive weighting function, W in Fig. 5(a), selects between the prediction from the upper and lower layers.

As with SNR scalability, the lower-layer bitstream can be decoded in the same way as the non-scalable case. To decode the combined lower and upper layers, both layers must be received, as shown in Fig. 5(b). The lower layer is decoded first and the 'up-converted' decoded pictures offered to the upper-layer decoder for possible use as a prediction. The upper-layer decoder selects between its own motion-compensated prediction and the 'up-converted' prediction from the lower layer, using a value for the weighting function, W, transmitted in the upper-layer bitstream.

The spatial profile is suggested as a way to broadcast a high-definition TV service with a main-profile compatible standard-definition service.

The high profile adds support for coding a 4:2:2 video signal and includes the scalability tools of the SNR and spatial profile.

Details of levels: MPEG-2 defines four levels of coding parameter constraints. Table 2 shows the constraints on picture size, frame rate, bit rate and buffer size for each of the defined levels. Note that the constraints are upper limits and that the codecs may be operated below these limits (e.g. a high-1440 decoder will decode a 720 pixels by 576 lines picture).

Table 2: MPEG-2 levels: Picture size, frame-rate and bit rate constraints.

Level
Max. frame,
width, pixels

Max. frame,
height, lines

Max. frame,
rate, Hz

Max. bit rate,
Mbit/s

Buffer size,
bits
Low
352
288
30
4
475136
Main
720
576
30
15
1835008
High-1440
1440
1152
60
60
7340032
High
1920
1152
60
80
9781248

In broadcasting terms, standard-definition TV requires main level and high-definition TV requires high-1440 level. The bit rate required to achieve a particular level of picture quality approximately scales with resolution.

5. CONCLUDING COMMENTS

MPEG-2 has been very successful in defining a specification to serve a range of applications, bit rates, qualities and services.

Currently, the major interest is in the main profile at main level (MP@ML) for applications such as digital television broadcasting (terrestrial, satellite and cable), video-on-demand services and desktop video systems. Several manufacturers have announced MP@ML single-chip decoders and multichip encoders. Prototype equipment supporting the SNR and spatial profiles has also been constructed for use in broadcasting field trials.

The specification only defines the bitstream syntax and decoding process. Generally, this means that any decoders which conform to the specification should produce near identical output pictures. However, decoders may differ in how they respond to errors introduced in the transmission channel. For example, an advanced decoder might attempt to conceal faults in the decoded picture if it detects errors in the bitstream.

For a coder to conform to the specification, it only has to produce a valid bitstream. This condition alone has no bearing on the picture quality through the codec, and there is likely to be a variation in coding performance between different coder designs. For example, the coding performance may vary depending on the quality of the motion-vector measurement, the techniques for controlling the bit rate, the methods used to choose between the different prediction modes, the degree of picture preprocessing and the way in which the quantiser is adapted according to the picture content.

The picture quality through an MPEG-2 codec depends on the complexity and predictability of the source pictures. Real-time coders and decoders have demonstrated generally good quality standard-definition pictures at bit rates around 6 Mbit/s. As experience of MPEG-2 coding increases, the same picture quality may be achievable at lower bit rates.

6. ACKNOWLEDGEMENTS

The author would like to thank the BBC for permission to publish this paper.

7. REFERENCES

  1. ISO/IEC 11172: 'Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s'.
  2. ISO/IEC 13818: 'Generic coding of moving pictures and associated audio (MPEG-2)'.
  3. 'Encoding parameters of digital television for studios', CCIR Recommendation 601-1 XVIth Plenary Assembly Dubrovnik 1986, Vol. XI, Part 1, pp. 319-328.
  4. JAIN, A.K.: 'Fundamentals of digital image processing' (Prentice Hall, 1989).
  5. WELLS, N.D.: 'Component codec standard for high-quality digital television', Electronics & Communication Engineering Journal, August 1992, 4, (4), pp. 195-202.
  6. CARR, M.D.: 'New video coding standard for the 1990s', Electronics & Communication Engineering Journal, June 1990, 2, (3), pp. 119-124.
  7. RAO, K.R. and YIP, P.: 'Discrete cosine transform: algorithms, advantages, applications' (Academic Press, 1990).
BBC News

Không có nhận xét nào: