The Human “Earscape” model is a toolthat can be used to optimise the digitalisation of audio using the range andvarying sensitivety of human hearing. Humans can only hear sounds fromapproximately 20Hz to 20kHz, allowing the omission of frequencies outside ofthis range.
Additionally, the sensitivity over this range is non-uniform, withsensitivity dropping off towards the ends of this range. The construction ofhuman ears makes them especially sensitive to frequencies around 2-5kHz,covering a large portion of human speech. This correspondence between frequencyand sensitivity, as well as the bounds of human hearing, forms the Human”Earscape” model. We can use this model to choose our quantisation levels fordifferent areas of the frequency spectrum to match human hearing, using moredata to represent areas that humans are more sensitive to and less wheresensitivity is lower.
This technique of using unequally spaced quantizationlevels is used in the A-law and ?-law companding algorithms and dramaticallyreduces bandwidth, and hence bitrate, without a significantly perceivable degradationof quality in the audio (usually speech in this case).Instead of looking at the frequencyrange of human hearing and the sensitivity to differing frequencies, the AudioNoise Masking model focusses on the interaction between neighbouringfrequencies and human perception of these. The construction of the human ear resultsin ranges of frequencies stimulating the same nerves. Each band of frequenciesthat stimulates the same region of nerves is called a critical band, and theability to distinguish multiple frequencies is diminished within each criticalband. In particular, if there is a sufficiently strong signal at a specificfrequency within a critical band, weaker signals at nearby frequencies withinthe same critical band are unperceivable. This process is called auditorymasking and is a perceptual weakness that forms the basis for the Audio NoiseMasking model in compression. We can exploit this weakness by observing thevarious frequencies within each critical band and removing those which areunperceivable due to masking.
Hence we can significantly reduce the bitratewithout a perceivable degradation of quality by removing inaudible informationand assign our quantization noise to these areas.Audio Layer I of the MPEG audiostandard begins with the decomposition of the input audio signal into 32 frequencysub?bands, each around 700Hz wide covering the full audible spectrum. This isachieved by passing the input signal through a filter bank consisting ofmultiple band-pass filters, each of which only allows through frequencieswithin a certain fixed range. The sub?bands created are designed to mimic thecritical bands of the human auditory system, allowing us to exploit the AudioNoise Masking model. Each sub?band is quantized and encoded separately accordingto this model, with the signal-to-mask ratio being calculated for each. This isthe ratio of signal energy to the masking threshold (the minimum signalstrength required for a sound to audible in the presence of a masking signal).This ratio determines the number of quantization levels used.One issue with this process is that theband-pass filters used to split the input signal are not perfect, introducingsome noise and reducing quality.
Additionally, when the audio is decoded, itmust pass through an inverse filter band that recombines the signal which againis a lossy transformation. The sub?bands used here also have constant widthacross the audible spectrum, and do not accurately represent the critical bandsof the human ear which increase in width exponentially with frequency. Hencethese sub?bands do not exactly mirror human auditory behaviour and degrade theaudio quantization. Finally, the band-pass filters are not sharply defined andso neighbouring sub?bands overlap. This means that a signals at certainfrequencies can affect two bands at once, introducing aliasing artefacts anddegrading quality.On one hand, whilst the compression isultimately lossy, the difference in quality between the input and the output isunperceivable to human ears.
This is because the degradation in qualityresulting from imperfect splitting and combining transformations is very subtle,and the quantization noise introduced is allocated to unperceivable parts ofthe audio signal. Hence, compared to less sophisticated quantization algorithms(eg. PCM), an MPEG-1 Layer 1 output will be significantly higher quality forthe same bitrate.
For this reason, on balance, the division of the input signalinto 32 sub-bands in the MPEG-1 Audio Layer I standard improved thequantization quality.The Discrete Cosine Transform used in MPEG/AudioLayer 3 is a mathematical function applied to the output of the filter bankused in Layers I & II. This transformation decomposes the input signal intoa sum of component frequencies, representing each with a cosine wave, andassigning each a coefficient specifying its amplitude.
This transformationinvertible, and both the DCT and its inverse are lossless. This process mapsthe input signal from the time domain to the frequency domain, which is themain purpose of involving DCT in MPEG audio compression. Unlike in the timedomain where a signal’s variation is described over time, the frequency domainallows us to express a signal purely with respect to frequency. This transformationhence gives greater insight into the components that make up the signal, and allowsus to perform more advanced compression algorithms on the signal based on humanperception. These use the human psychoacoustic model to filter out perceptuallyinaudible parts of the input signal and more closely correspond to human hearingthan those used in the other audio layers.
Hence the One drawback of this compressiontechnique is that DCT and its inverse are more computationally intensive thanthose used in the other layers, potentially making encoding, decoding andmanipulating audio in real time slower. However, with modern technology this isnot a concern, with the algorithm running quickly and smoothly on even veryweak systems, and the processing penalty more than justified by the improvementin audio quality and great reduction in file size.Similar to the Discrete CosineTransform, a Wavelet transform decomposes and represents audio signals as a sumof wave functions. However, instead of being limited to cosine functions aswith DCT, the Wavelet transform can be performed with almost any family offunctions including sinusoidal, .
These wavelet functions can be scaled todifferent frequencies and also shifted temporally, unlike in DTC, and ingeneral this process produces a more detailed picture of the signal beinganalysed. An optimal wavelet representation can even be chosen for eachindividual frame that matches the characteristics of the signal better than acosine series. As a result, fewer terms in the sum are required to accuratelyencode the frame, reducing number of non-zero coefficients required.
Thesecoefficients can hence be encoded using fewer bits, reducing the bitrate andincreasing the compression ratio. As a result a Wavelet transform coulddramatically reduce file sizes and increase quality.However, in order to achieve this we needto perform analysis on each frame to determine the optimal wavelet, and this iscomputationally intensive. Encoding the audio can therefore take a long time asthis optimisation must be performed, although decoding is a comparable speed towith DCT.
We would be able to make use of human perception in our compressionin a similar fashion to that using DCT, however we would need to adapt thepsychoacoustic model to fit wavelet representation to achieve this.In 2000 the JPEG-2000 standardlaunched incorporating wavelet transform, and similar to with audiocompression, this promised better compression and improved image quality.However the standard has been a commercial failure, and even today fewconsumers and manufacturers have adopted the standard and many applications notsupporting it. This is widely regarded to be accountable to a developers beingreluctant to incorporate the new standard while also supporting the original,the dominance of basic JPEG, and slower processing.
This case study gives However many audio standards in use –less dominance – possibility of new format. Processing less time less of anissue with todays computers, compression becoming very important for audio witheg streaming – potential for more support. Overall the benefits outweigh thedrawbacks making the replacement of DCT with Wavelet transform feasible.