3.7 Audio Compression and Coding
3.7.1 Introduction
The Compact Disc has made digital audio popular. Its 16-bit PCM format is an accepted audio representation standard although its bit rate of 706 kbit/s per monophonic channel is rather high. In audio production resolutions up to 24-bit PCM are in use. EBU/AES interface specifications allow for a 16 to 24-bit resolution and 32, 44.1 or 48-kHz sampling frequency. Lower bit rates are mandatory if audio signals are to be transmitted over channels of limited capacity or are to be stored in storage media of limited capacity. Earlier proposals to reduce the PCM rates have followed those for speech coding. However differences between audio and speech signals are manifold since audio coding implies higher values of sampling rate, amplitude resolution and dynamic range, larger variations in power density spectra, differences in human perception, and higher listener expectations of quality. Unlike speech, we also have to deal with stereo and multichannel audio signal presentations.
New coding techniques for high quality audio signals use the properties of human sound perception by exploiting the spectral and temporal masking effects of the ear. The quality of the reproduced sound must be as good as that obtained by 16-bit PCM with 44.1 or 48 kHz sampling rate. If, for a minimum bit-rate with reasonable complexity of the codec, no perceptible difference between the original sound and the reproduction of the decoded audio signal exists, the optimum has been achieved. Source coding systems, have been shown to allow a bit-rate reduction from 768 kbit/s (16 bits at 48 kHz) down to about 100 kbit/s per monophonic channel, while preserving the subjective quality of the digital studio signal for critical signals. This high gain in coding is possible, because the quantising noise is adapted to the masking thresholds and only those details of the signal are transmitted which will be perceived by the listener.
ITU-R Recommendation BS. 1115 addresses two-channel low bit-rate audio coding to be used for digital sound broadcasting applications. For emission applications, ISO/IEC 11172-3 (MPEG-1) Layer II at 128 kbit/s for single channel, and at 256 kbit/s for two-channel configuration is recommended. For contribution and distribution links, ITU-R recommends the use of MPEG-1 Layer II at data rates of 180 kbit/s per channel, or 120 kbit/s per channel if no further cascading is used.
Multichannel audio is of interest in DTTB. At present, multichannel audio is known primarily from the cinema. But even in consumer applications, multichannel has been used for the last few years, e.g. Dolby-Surround with home-TV and VCRs. With the introduction of Advanced or High Definition Television (ADTV, HDTV) with its improved resolution and increased picture size, giving an impression similar to a cinema, an improved audio performance is desired. A way to achieve an improved realism is to use more than two audio channels. Subjective assessments [1] indicate that the switch from mono (1/0) to stereo (2/0) is equivalent to one grade improvement on the ITU-R 5-point quality grading scale; from stereo (2/0) to three channel (3/0) an additional one grade of improvement; from three channel (3/0) to surround sound (3/2) an additional one half grade improvement.
ITU-R BS. 775-1 "Multichannel stereophonic sound system with and without accompanying picture" specifies the use of the 3/2 multichannel audio system (left, centre, right; left surround, right surround). The advantage of this system is a large listening area, but a disadvantage is the need for a higher transmission bit-rate. ITU-R Recommendation BS 1196 recommends that DTTB systems should use for audio coding the International Standard specified in ISO/IEC IS 13818-3 or the North American Standard specified in ATSC A/52. With the application of the coding systems recommended by the ITU-R, an economical way for storage or transmission of the multichannel audio is available. Besides the applications with ADTV and HDTV, a lot of multimedia applications, which become more and more popular for consumers, will introduce multichannel audio, if the data-rates can be handled in an economical way.
3.7.2 Characteristics of a DTTB Audio System
A suitable sound system for television broadcasting should meet several basic requirements and provide a number of technical/operational features.
3/2-Stereo Presentation
As regards stereophonic presentation, ITU-R Rec. BS. 775 identifies a centre channel C and two surround channels Ls, Rs, in addition to the basic left and right stereo channels L, R, as the reference sound format. It is referred to as "3/2-stereo" (3 front/2 surround channels), shown in Fig. 17 and requires handling of five channels in the studio, storage media, contribution, distribution, emission links, and in the home.
figure 17
3/2-stereo reference loudspeaker arrangement
For sound applications with picture accompanying the sound, the three front channels ensure sufficient directional stability and clarity of the picture related frontal images, according to the common practice in the cinema. The 3/2-stereo format has also been found to be the optimum compromise for audio-only applications and an improvement of two-channel stereophony. The addition of one pair of surround channels to the three front channels allows improved realism of auditory ambiance.
Low Frequency Enhancement Channel
According to the ITU-R Rec. BS. 775 the 3/2-stereo sound format should provide one optional low frequency enhancement (LFE) channel in addition to the full range main channels with the LFE channel being capable of carrying signals in the frequency range 20 Hz to 120 Hz. The purpose of this channel is to enable listeners, who choose to, to extend the low frequency content of the programme in terms of both frequency and level. In this way it is the same as the sub woofer channel used in the digital film sound format, and thus optimum compatibility with film sound material would be ensured in this aspect.
Downward Compatibility
A hierarchy of sound formats providing a lower number of channels and reduced stereophonic presentation performance (down to 2/0-stereo or even mono) and a corresponding set of downward mixing equations are recommended by ITU-R Rec. BS. 775-1 to provide downward compatibility. The hierarchy and recommended coefficients for 3/2 configuration are shown in Fig. 18. Useful alternative lower level sound formats are 3/1, 3/0, 2/2, 2/0, 1/0. These may be used in circumstances where economic or channel capacity constraints apply in the transmission link or where only a lower number of reproduction channels is desired.
figure 18
Down mix from 3/2 down to 1/0 for a future multichannel audio system
Backward compatibility
In the case that an existing two-channel DTTB service is extended to multichannel, and compatibility with existing two channel receivers is required, ITU-R Rec. BS. 775 identifies two ways in which this backward compatibility could be realised. The multichannel service may be provided simultaneously with the two-channel service (simulcasting operation). The alternative is that the transmitted left and right channels convey compatible signals, downmixed (matrixed) from the multichannel signals. In addition to the stereo channels, additional channels can be transmitted that carry appropriate signals, which allow retrieval of the original multichannel set of signals by dematrixing. The advantage of the latter method is that less additional data capacity is required to add the multichannel service.
Associated Services and Configurability
In addition to the main multichannel service, associated services may be required.
In some areas multilingual services may be of benefit. This can be accomplished in various ways. For example complete multichannel mixes can be transmitted for each language. Alternatively, an individual dialogue channel for each language may be transmitted in addition to a common multichannel music and effects mix.
Additional sound services may include those for the hearing impaired and for the visually impaired. For the hearing impaired, a clean dialogue channel (ie. no music/effects) is advantageous. For the visually impaired, a descriptive channel would be needed.
Optimum exploitation of the available bit-rate for multichannel stereo performance and sound quality on the one hand and bilingual programmes or associated services on the other depends on the application, on the type of programme, etc. For this reason a number of alternative sound channel/service/quality level configurations is beneficial.
3.7.3 Overview of the DTTB Audio System
As illustrated in Fig. 19, the DTTB audio subsystem comprises the audio encoding/decoding function and resides between the audio inputs/outputs and the transport subsystem. The audio encoder(s) is (are) responsible for generating the audio elementary stream(s) which are encoded representations of the baseband audio input signals. The flexibility of the transport system allows multiple audio elementary streams to be delivered to the receiver. At the receiver, the transport subsystem is responsible for selecting which audio streams(s) to deliver to the audio subsystem. The audio subsystem is responsible for decoding the audio elementary stream(s) back into baseband audio.
Figure 19
Audio subsystem within the digital television system
An audio program source is encoded by a digital television audio encoder. The output of the audio encoder is a string of bits that represent the audio source, and is referred to as an audio elementary stream. The transport subsystem packetises the audio data into PES packets which are then further packetised into a transport stream. The transmission subsystem converts the transport packets into a modulated RF signal for transmission to the receiver. At the receiver, the received signal is demodulated by the receiver transmission subsystem. The receiver transport subsystem converts the received transport packets back into an audio elementary stream which is decoded by the digital television audio decoder. The partitioning shown is conceptual, and practical implementations may differ. For example, the transport processing may be broken into two blocks; one to perform PES packetisation, and the second to perform transport packetisation. Or, some of the transport functionality may be included in either the audio coder or the transmission subsystem.
Additional audio sources, such as multilingual channels are incorporated in the main audio elementary stream in ISO/MPEG-2 coding, they are conveyed by additional elementary streams in AC-3 coding.
Audio encoder interface
The audio system accepts baseband audio inputs with channelisation consistent with ITU-R Recommendation BS-775, "Multi-channel stereophonic sound system with and without accompanying picture".
Sampling frequency
The system conveys digital audio sampled at a frequency of 48 kHz, locked to the 27 MHz system clock. Sampling frequencies of 44.1 and 32 kHz may also be supported. Auxiliary services at one half of these frequencies may also be supported in the MPEG-2 system.
Resolution
In general, input signals should be quantised to at least 16-bit resolution. The audio compression system can convey audio signals with a resolution of more than 16 bits.
3.7.4 Overview and basics of audio compression
A major objective of audio compression is to represent an audio source with as few bits as possible, while preserving the level of quality required for the given application. The challenge in providing a bit-rate reduced sound service is to code the signal in a manner in which the errors that are introduced are inaudible to humans. The ISO/IEC MPEG-2 Layer II and AC-3 systems both use a sub-band representation of the audio signal in order to take advantage of the frequency masking properties of the human hearing system. The frequency spectrum of the audio signal is separated into sub-bands by the use of a sub-band or transform filter bank. This results in a representation of the audio signal by sub-band samples (MPEG-2) and by frequency coefficients (AC-3).
The sub-band signals may be quantised because the resulting quantising noise will be at a similar frequency, and relatively low signal to noise ratios (SNRs) are acceptable due to the psychoacoustic phenomenon of masking. A psychoacoustic model of human hearing determines what actual SNR is acceptable in each sub-band. A bit allocation operation distributes the available bits among the sub-bands in accordance with the required SNR. The sub-band values are quantised to the precision indicated by the bit allocation operation and formatted into the audio elementary stream. The basic unit of encoded audio is the audio access unit (or frame) which consists of a fixed number of sub-band samples. Each frame of audio is an independently decodable entity. Knowledge of the bit allocation allows the decoder to unpack and de-quantise the sub-band signals. The synthesis filterbank is the inverse of the analysis filterbank, and converts the reconstructed sub-band signals back into a linear PCM signal.
References
Return to DTTB Tutorial Table Of Contents