ITU - DTTB Tutorial - Video & Audio Source Coding Pt4

3.8 ISO/MEG-2 Audio System

3.8.1 ISO/MPEG Audio: Generic Audio Coding for High Quality Stereo

IS0/MPEG Phase 1

The International Organization for Standardization, a world-wide federation of national standards developed and prepared a standard on information technology - Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s. The "Audio Subgroup" of MPEG had the responsibility for developing a standard for coding of PCM audio signals with sampling rates of 32, 44.1 and 48 kHz at bit rates in a range of 32 to 192 kbit/s per mono and 64 to 384 kbit/s per stereo audio channel.

Different layers of the coding system with increasing encoder and decoder complexity and performance are described in the audio part of the ISO Standard 11172. The idea of this three layer-concept was to have an universal coding scheme for many applications with totally different requirements, like consumer recording, professional recording, combined recording and processing of audio and video, telecommunication and broadcasting.

Two mechanisms can be used to reduce the bit-rate of audio signals. One mechanism is determined mainly by removing the redundancy of the audio signal using statistical correlation. Additionally, the very new codecs are reducing in addition to redundancy, the irrelevancy of the audio signal by considering psycho-acoustical phenomena, like spectral and temporal masking. Only with both of these techniques, making use of the statistical correlation and the masking effects of the human ear, a significant reduction of the bit-rate down to 200 kbit/s per stereophonic signal and below could be obtained.

The basic structure of a perceptual audio coding scheme is characterized by the following modules:

A time/frequency mapping (filter bank) is used to decompose the input signal into sub sampled spectral components.
The output of this filter bank, or the output of a parallel transform is used to calculate an estimate of the actual (time dependent) masking threshold using rules known from psycho acoustics.
The sub-band samples are quantised and coded with the aim of keeping the noise, which is introduced by quantising, below the masking threshold. Depending on the algorithm, this step is done in very different ways. The complexity varies from block companding to analysis-by-synthesis systems using additional noiseless compression.
A frame packing is used to assemble the bit stream, which typically consists of the quantised and coded mapped samples and some side information, e.g. bit allocation information.

Depending on the focus on either low frequency resolution together with high time resolution or high frequency resolution which leads to only limited time resolution the systems are usually called sub-band coders or transform coders.

The fundamental basis for calculating the masking threshold in the encoder is given by results of masking threshold measurements for narrow-band signals considering tone masking noise and vice versa. Very special masker/test-tone relations are described in the literature and the worst case results regarding the upper and lower slopes of the masking curves have been considered for the assumption, that the same masking thresholds can be used for both, simple audio and complex audio situations.

From the past there are well known techniques to reduce the bit-rate of audio signals, like sub-band and transform coding, often linked with ADPCM, vector quantising, variable length coding, pre- and post-processing. The basic problem was the design of an optimal analysis/synthesis strategy which provides the two, sometimes contrary requirements to serve both, a high frequency and a high time resolution combined with a low complexity implementation, necessary most of all for consumer applications. It is plausible, that there exists no simple way modelling the auditory system. Two conventional methods of coding, namely sub-band and transform coding have been combined together in ISO/MPEG-Audio Standard 11172-3. Each of the three layers is using both, a polyphase filter bank of 32 sub-bands, which gives a very good time resolution for the coding process of the audio signal, and in parallel to this filter bank, a transform, which gives the required frequency resolution for calculating the spectral masking thresholds. Layer III uses a hybrid structure consisting of a polyphase filter bank and a MDCT (Modified Discrete Cosine Transform) in a serial way.

Generic Coding Concept

In view of a lot of totally different applications, a concept of a generic coding system was envisaged. Depending on the application, three layers of the coding system with increasing complexity and performance can be used. The ISO/MPEG-Audio coding technique offers to deal with a much higher dynamic range, due to the scaling technique used, than Compact Disc or DAT, i.e. conventional 16 bit PCM.

In all three layers the input PCM audio signal is converted from the time into a frequency domain. This is done by a polyphase filter bank consisting of 32 sub-bands.

In Layers I and II a filter bank creates 32 sub-band representations of the input audio stream which are then quantised and coded under the control of a psycho acoustic model from which a blockwise adaptive bit allocation is derived. The encoder and decoder of both layers are shown in Fig. 25.

Layer I is a simplified version of the MPEG-1 coding scheme, most appropriate for consumer applications, such as digital home recording on tapes, Winchester Discs or on Magneto-Optical Discs, i.e. for those applications for which very low data-rates are not mandatory.

Layer II introduces further compression with respect to Layer I, by redundancy and irrelevancy removal on the scale factors and uses more precise quantisation. Layer II has numerous applications in both, consumer and professional audio, like audio broadcasting, television, recording, telecommunication and multimedia.

figure 20

ISO/IEC 11172-3 Layer I and Layer II encoder and decoder

3.8.2 ISO/MPEG-2 Audio: Generic 5 + 1 Multichannel Audio Coding

The standardization phase of ISO/IEC MPEG-2 Audio is characterized by the consequent extension from two to five audio channels providing backward compatibility to MPEG 1. The main aspects are high quality of five (+1) audio channels, low bit rate and backwards compatibility - the key to ensure that existing 2-channel decoders will still be able to decode compatible stereo information from five (+1) multichannel signals.

The backward compatibility to two-channel stereo was a very strong requirement of the MPEG 2 multichannel surround system. Due to the backward compatibility of the proposed MPEG multichannel audio coding standard, a two channel decoder will deliver a correct basic stereo signal, consisting of a down-mix from the original five channel source.

3.8.3 Backward/Forward Compatibility

For several applications it is the intention to improve the existing 2/0-stereo sound system step by step by transmitting additional sound channels (centre, surround), without making use of simulcast operation: The multichannel sound decoder has to be backward/forward compatible with the existing sound format.

Backward compatibility means that the existing two-channel (low price) decoder should properly decode the basic 2/0-stereo information from the multichannel bit stream. This implies the provision of compatibility matrices, using adequate down mix coefficients. The principles of backward compatibility of ISO/IEC 13818-3 to ISO/IEC 11172-3 are illustrated in Fig. 27.

Forward compatibility means that a future multichannel decoder should be able to decode properly the basic 2/0-stereo bit stream.

figure 21

Principle of backwards compatibility of MPEG-2 Audio with ISO/IEC 11172-3

On the other hand, there will be other applications which do not require backward/ forward compatibility with existing 2/0-stereo sound formats. In these cases the compatibility requirement may not be appropriate, because possible coding constraints due to compatibility matrixing could be avoided. In order to ensure maximum coding efficiency and minimum complexity for the different application areas it seemed advantageous to realise both strategies in a universal codec. This is possible by switching the compatibility matrix on or off, in other words, the multichannel sound codec can be used either in a mode where the basic stereo information consists of a left and right channel that constitute an appropriate down mix of the audio information from all source channels, or optionally in a mode where the basic stereo information may consist only of the left and right channel of the multichannel sound configuration.

3.8.4 Second Stereo Programme

Alternatively, the multichannel extension part of the bit-stream can be configured for the provision of two stereo programmes, the first to be decoded by a decoder according to ISO/IEC 11172-3. Both stereo programmes are coded independently from each other, but the joint stereo coding technique can be applied to each of the two stereo programmes. Also in this case the compatibility matrix is not used, and the auxiliary signals T3 and T4 are forming the second stereo programme L2, R2.

figure 22

Ancillary data field of the ISO/IEC 11172-3 Layer II frame
carrying multichannel extension information

3.8.5 Compatibility with ISO 11172-3

The MPEG-2 multichannel coding standard provides full backward/forward compatibility with the ISO Audio Coding Standard 11172-3. It is realised by exploiting the ancillary data field of the
ISO 11172-3 audio frame for the provision of additional channels (see Fig. 28). The "variable length" of the ancillary data field gives the possibility to carry the complete multichannel extension information. A standard two-channel MPEG1-Audio decoder, according to ISO 11172-3, just ignores this part of the ancillary data field.

Configurability with respect to the sound channel allocation and to the bit-rate offers useful combinations of various levels of multichannel stereo performance and various numbers of channels in the composite and independent coding mode.

3.8.6 Perceptual Coding Strategies for Multichannel Audio

If composite coding methods are used for an audio programme consisting of more than one channel, the bit-rate required does not increase proportionally with the number of channels. For multichannel audio, the composite coding technique is very efficient, because there are a lot of correlation, both in the signal by itself, and in the binaural perception of such a signal. The following effects may be used:

A certain portion of the stereophonic signals does not contribute to the localization of sound sources. This portion may be reproduced via any loudspeaker (reduction of channel separation, called dynamic crosstalk).
Certain stereophonic signals contain inter-channel coherent portions, which in principle could be transmitted via one channel instead of two (reduction of redundancy).
The processing capacity of the auditory system is limited to a certain degree. It is not able to perceive certain details of individual sound channels in a multichannel presentation (exploitation of "inter-channel masking" by the common masking threshold).
The bit-rate per channel required for perceptual coding depends on the signal. It varies dynamically in the range of about 100 kbit/s /15/. Since the individual dynamic bit-rates of the centre and surround signals may not vary completely correlated (or they may even be non-correlated), a smoothing effect of the overall bit-rate peaks may result. This common bit pool ("bit exchange") is particular efficient in the independent coding mode.

In the composite coding mode the irrelevant and redundant portions of the stereophonic signals are eliminated as consequently as possible.

Continue to Section 3.9

Return to DTTB Tutorial Table Of Contents

Return to Tutorial Index Page