1.1 Objectives of Speech Coding
From a grand perspective, the objective of speech transmission technologies is to improve communication. We want to enable people to talk to each other, even when geographically separated. One may argue that most of the big problems in the world, such as wars and other conflicts, are due to a lack of understanding and a lack of empathy between people. The best cure for such problems is an increased amount of communication on every level of the society. Technologies that enable efficient communication between people can therefore potentially play a large role in such societal improvements.
The main objective of speech coding is to allow efficient spoken communication at a distance. That is, the objective is to provide a technology which allows natural communication, where the quality is so high that the transmitted signal allows an intelligible and pleasant dialogue to take place, whereby the effort of communication is minimised. Simultaneously, the technology has to be efficient in the sense that the smallest possible amount of resources are used. The amount of resources is important both in the sense that firstly, some technical solutions might not be physically realisable, for example, because they would require more bandwidth than what is available. Secondly, resource-efficiency also reduces the price of the technology such that devices and infrastructure become affordable for the consumers.
Secondary objectives of speech coding are storage and broadcasting of speech signals in an efficient way, such that the perceptual quality is optimised while a minimum of resources are used. Since the goals of storage and broadcast applications are largely covered by the requirements and goals of telecommunication, we will not discuss storage or broadcast applications in depth.
In more technical terms, the objective of speech coding for transmission is to extract information from a speech signal in a form that can be efficiently and rapidly transmitted or stored, such that essentially the same signal can be later reconstructed or resynthesised. In this context, the essential information of a speech signal is that information which is required for natural, human communication. Simultaneously, the practical context of the speech codec imposes limitations to what technologies are realisable in terms of, for example, available transmission bandwidth and computational capacity. Note that here the essential quality criterion is of a perceptual nature.
The most important specific design goals of speech codecs are:
Distortions should be minimised such that intelligibility and pleasantness of the original signal, as well as the required level of listening effort, are retained .
Bitrate should be minimised such that the desired quality is achieved with the lowest possible number of transmitted bits.
Delay should be minimised, such that the time from when a sound is captured at the encoder, to when it is synthesised at the decoder, is as small as possible. This includes both the algorithmic delay of the codec itself as well as delays due to data transport and buffering .
Hardware requirements such as computational complexity and storage requirements should be minimised, such that the hardware costs (manufacture and raw materials) and energy consumption are low.
Robustness to transmission errors The detrimental impact of transmission errors such as packet-loss should be minimised .
It is clear that the task of coding speech is very similar to audio coding, although traditionally they have been treated as separate fields. The original reason for division between speech and audio is that the context in which they were used puts very different requirements on the system. Most importantly, speech coding was typically related to an interactive dialogue, whereas audio coding was used for storage and broadcast. The most obvious consequence is that encoders for audio can operate offline whereby computational complexity and algorithmic delay are of much smaller importance. In addition, since storage is much cheaper than transmission, we can easily afford using higher bitrates for audio than speech. Further, while high-fidelity enthusiasts are abound within the field of audio, I have yet to encounter a high-fidelity speech communication enthusiast. Quality requirements in speech and audio are thus very different.
On the other hand, in the last decade or two, technologies and media have converged into mobile smartphones. It is now assumed that a mobile phone can handle, in addition to speech, also all other media such as audio and video. To keep speech and audio codecs separate in such a device seems then rather artificial and superfluous. Moreover, content such as movies and radio shows frequently exhibit a mixture of speech and audio, which would benefit from a unified codec. Consequently, in the last two decades, a wave of unification has gone through the fields, especially in the form of coding standards that cover both speech and audio applications, such as [] .
While it can be expected that the process of unification will continue, that does not make speech codecs redundant. Speech codecs still possess a prominent part in unified speech and audio codecs and currently there is little evidence for that to change. The principles of speech coding will be further and further merged into such unified codecs, but the techniques themselves will not disappear. In particular, I would predict that methods of audio and speech coding will in the near future be generalised such that traditional methods of both areas are special cases of a unified theory.
1.2 Perceptual Quality
The classical ultimate goal of perceptual coding, especially within audio coding, has always been perceptual transparency , which is a loosely defined quality level where the coded signal is perceptually indistinguishable from the original. When analysing short segments of speech or audio, perceptual transparency is fairly unambiguous and if a signal is perceptually transparent when analysed in short sections, it will almost certainly be transparent also when evaluated in longer segments. The reverse is, however, not true. For longer speech and audio signals a much lower quality level is sufficient for perceptual transparency.
Specifically, when considering a long segment of speech or audio, our memory tends to store only abstract level information, such as linguistic content, speech style and speaker identity of a speech signal, but ignore acoustic information that does not carry important information. Consider for example the famous phrase of John F. Kennedy, Ich bin ein Berliner you have probably heard the recording of that at some point. If you would hear a recording of the same phrase today, could you determine whether it is the same recording, another recording made by Kennedy the same day, or an imitation by an skilled orator? However, would you then hear the phrase Ich bin ein Hamburger , you would surely immediately notice the difference, even if it would be a recording spoken by Kennedy himself. Note that the reason as to why the latter phrase is so easily distinguishable from the former is that the meaning of the signal has changed, even if the distance on some acoustic dimension would be small.
The ultimate goal should therefore be something like a cognitive transparency , where the content of a speech or audio signal cannot be distinguished from the original. Unfortunately, science is nowhere near of being able to create objective measures for cognitive transparency and must therefore use perceptual transparency as a fall-back solution.