Sampling Theory

Sampling theory

Music, like any audio, is a continuous time-varying signal, which in this case, takes the form of reverberations in a medium like the air. In this analogue representation of sound, the behaviour of the air, speaker, or eardrum is well-defined at any arbitrary point in time. There is, theoretically, infinite precision in a continuous signal. Obviously, this would be ideal for the perfect reproduction of music from a sound file or CD, but there cannot exist a device or medium capable of containing the infinite amount of information it would take to encode an infinitely precise signal. Not to mention, the digital systems we use today simply aren't compatible with analogue signals. Instead, what we can do is record a discretised, digital representation of the original sound that will allow us to store and transmit the data and later re-create the original. This digital representation can be created such that a perfect fidelity reconstruction is possible in which case the recording is said to be "lossless". It can also sacrifice perfect fidelity and introduce an arbitrary degree of error, i.e., become "lossy", in exchange for better performance in another metric such as file size. The following discussion presumes a lossless representation is desired.

On digital systems like computers, MP3 players, and CDs, one common method for storing audio is called "(Linear) Pulse-code Modulation" or PCM for short. In PCM, the amplitude of the signal is sampled at regular intervals and the value is quantised to the nearest integer value. The idea that a discretised version of a continuous signal can capture all of its information is known as the Nyquist-Shannon sampling theorem. Two important parameters of PCM are the sampling frequency and the bit depth. The first is simply the rate at which the original signal is sampled and quantised; the most important point here is that the highest frequency component which can later be perfectly reconstructed is limited to one-half of the sampling frequency. Stated precisely, the band-limit of the signal is up to, but not including, half of the sampling frequency. This is known as the Nyquist Criterion.

$$f_s \lt 2B$$

Since the human ear is sensitive up to about 22 kHz in most healthy individuals, the most common music sampling frequencies are 44.1 kHz or 48 kHz. The second parameter, bit depth, is how many bits are used per sample to encode the amplitude. The more bits per sample, the more levels that are available for quantisation. A 16-bit signed integer PCM is common in WAV files for example; there are 64 536 levels of amplitude, 32 767 down to -32 768. Quantisation error and file size are inversely proportional. In the real world, digital media files are reconstructed through a process known as "zero-order hold" not into their perfect analogue ancestors, but into quantised approximations as illustrated later.

Above you can see a 100 Hz continuous-time signal in blue being sampled ten times per period, or at 1000 Hz. That is to say, the sampling frequency is 10 times the signal frequency, thus more than satisfying the Nyquist criterion. Because of this, it should be easy to see that given only the samples, and the assumption that the original was a sinusoid, you can only arrive at the one, exclusive solution of the original signal. If we really wanted to be frugal with our valuable bits and bytes, we could further reduce the sampling frequency to produce fewer samples while still capturing the original perfectly. The lowest sampling frequency that can accomplish this is called the "critical sampling frequency."

You can see the newly discretised representation only takes about one-sixth the number of samples as previously shown, but still produces the same result. Reducing the sampling rate any further would result in the scenario where the same samples could represent more than one signal. Multiple, different, signals being represented by the same samples are ambiguous; information that constrained the solution space has been lost and cannot be recovered. These multiple signals would be aliases of each other. Aliasing is often not avoidable under the Nyquist frequency and in the real world, you would typically oversample at some multiple of this rate so as to ensure you have enough data to later recreate the original, even in the case of minor imperfections during recording. For example, in this degenerate case, even sampling at the critical Nyquist frequency (and not strictly required greater than it, as per the Nyquist criterion) can yield two signal aliases: one is the original signal, the other is zero signal, a zero amplitude sinusoid. Both fit the data; both are potential solutions. In fact, in this specific case, you'll notice that ANY sinusoid with the same phase and frequency would fit the data even with any given amplitude. Clearly, this is not a desirable outcome.

And finally, much worse, sampling below the Nyquist rate ensures we will misrepresent the data and draw the wrong conclusion from it. In this case, sampling only once per period, the data is reconstructed into a signal with half the frequency as the original. The original has been completely lost.

What these ideas tell us about our final system is that, simply, we need to sample the music at a rate at least twice as high as the highest frequency component of the sound we wish to capture or recreate. Whether it is implemented in hardware with an ADC sampling an audio line or sampled directly from digital media, the Nyquist criterion must be respected.

One last point is that in real-world systems, the sampled data isn't reconstructed back into a perfect analogue representation, instead, the most common technique is to simply hold the amplitude of the signal at whatever the previous sample value was until the next is reached. This is called "zero-order hold" (ZOH) and works quite well if the sampling rate and bit depth are both sufficiently high. The following graphic displays the case where bit depth is indeed sufficiently high; since all samples lay on the original signal, there is minimal quantisation error. However, the sampling rate is low enough that the reconstruction no longer closely resembles the original in shape. In the limit as both sampling frequency and bit depth approach infinity, the ZOH representation approaches perfect fidelity. If the values are sufficiently high, we aren't able to perceive the small errors and the signal is perfect for all (human) intents and purposes.

$$x_{ZOH}(t) = \sum_{n=-\infty}^{\infty} x[n] \cdot \textrm{rect}( \frac{t - T/2 - nT}{T} )$$