DSD is often promoted as a superior audio format over PCM, yet much confusion persists regarding what it actually is and how it works. We shall approach the matter in a way that will hopefully clear up some of this.
An audio signal captured by a microphone is a continuously varying waveform. In digitising such a signal, our goal is to encode everything a human can hear without introducing any audible noise or distortion. The loss of inaudible parts of the signal is acceptable, as is the addition of noise too faint or too high in frequency to be perceived. With generally agreed limitations on human hearing, this means frequencies below 20 kHz must be preserved with a dynamic range of at least 90 dB.
Simple sampling
The sampling theorem tells us that in order to meet the 20 kHz bandwidth requirement, the sample rate must be at least 40 kHz. A little safety margin and various technical reasons give us the commonly used sample rates of 44.1 kHz and 48 kHz. The desired dynamic range is then reached with a 16-bit sample resolution along with suitable dithering. Figure 1 below shows the spectrum of a 1 kHz sine wave with an amplitude of -6 dBFS sampled at 48 kHz with 16-bit resolution and TPDF dither. Integrating the noise up to 20 kHz gives a dynamic range of approximately 94 dB.
As we can see, the combination of 48 kHz sample rate and 16-bit resolution readily provides adequate performance according the requirements set out in the beginning. Suppose, however, that 16-bit resolution isn’t practical for some reason. Can we still get the dynamic range we want? The answer is yes, we can. The trick is to use a higher sample rate. For a given sample resolution (bit depth), the total quantisation noise is constant. When the sample rate is increased, this noise is thus spread over a wider frequency range, resulting in a lower level within the fixed 20 kHz bandwidth of interest. To illustrate this, we sample the same 1 kHz sine wave again, this time using only 12-bit resolution but with the sample rate increased 256-fold to 12.288 MHz. The resulting spectrum plot, figure 2, looks exactly like the previous one. Not shown in the figure is the dither noise continuing at the same level up to the Nyquist frequency of 6.144 MHz.
Noise shaping
Although this exchange of bit depth for sample rate works, it is quite wasteful. To save a mere 4 bits of sample resolution, we had to increase the total data rate by a factor of 192. The reason for this is that the full dynamic range is available all the way up to the Nyquist frequency. Since frequencies above 20 kHz are inaudible, there is no need to maintain the high dynamic range there. Might there be a way to trade a loss of dynamic range at some frequencies for an increase at others? Again, the answer is yes. The method is called noise shaping, and it works by feedback of the quantisation error such that it can be corrected for in subsequent samples. With a suitable filter in the feedback loop, the shape of the noise spectrum can be tailored to the specific needs. In our case, we want to lower the noise level for frequencies below 20 kHz. To do so, we must allow an increase in noise at higher frequencies. The noise can only be moved, never eliminated. An example of a noise-shaped spectrum is shown in figure 3 below.
With noise shaping, we are able to maintain the desired dynamic range below 20 kHz while sampling at 192 kHz with 10-bit resolution. While this still amounts to a higher data rate than the baseline, the increase is now only 2.5 times, a much more manageable number.
Going to extremes
What if 10 bits per sample is still too much? Increasing the sample rate allows more aggressive noise shaping, in turn enabling lower bit depths without loss of dynamic range below 20 kHz. At the extreme, we would have only one bit per sample. Here the usual interpretation of binary numbers no longer works. Instead, a “1” bit denotes the maximum positive swing of the signal while a “0” bit represents its most negative value. A signal level of zero cannot be encoded, which already warns us that there may be trouble ahead. Another consequence of having only two possible sample values is that the usual TPDF dither cannot be used. A different method is clearly required.
The commonly used method of producing a 1-bit noise-shaped encoding of a signal is known as sigma-delta (or delta-sigma) modulation, sometimes abbreviated SDM, and like the noise shaping used above, it is based on error feedback. The difference lies, briefly put, in a more complex feedback loop structure.
When only two sample values are possible, there will be a lot of quantisation noise. Even with noise shaping, a high sample rate is necessary in order to provide space in the spectrum where the unwanted noise from the low frequencies can be relocated. For hi-fi applications, the lowest oversampling factor commonly used with 1-bit data is 64. The graph below (figure 4) shows the spectrum up to 100 kHz of a sine wave sampled at 3.072 MHz (64 × 48 kHz) and converted to 1-bit format using sigma-delta modulation.
In the frequency range shown, spectrum is similar to the previous example. The noise level remains low, as desired, a bit beyond 20 kHz, then rises to reach quite high levels. Expanding the range of the graph up to the Nyquist frequency (figure 5) reveals more noise. A lot more. With the audible frequencies occupying only a thin sliver at the left-hand side, this spectrum consists almost entirely of noise.
In terms of efficiency, this 1-bit format requires four times the data rate of the 48 kHz, 16-bit baseline. To be completely fair, the 1-bit variant has a somewhat higher dynamic range (106 dB) in the audible band. On the other hand, the same increase can be had by extending the sample resolution to 18 bits, a data rate increase of only 12.5%. If the goal is higher dynamic range, it is far more efficient to increase the bit depth than raising the sample rate.
Misconceptions
In the world of hi-fi, it is common to encounter misleading statements about DSD, the marketing name for 1-bit audio. A few examples (found on dsd-guide.com and blog.nativedsd.com):
- “There are no samples, there are no words, there is no code.”
- “DSD is a lot closer to analog than PCM ever thought to be.”
- “[PCM] has really nothing to do with audio. You have minimal resolution at zero crossing, whereby with DSD you have maximum resolution and on and on.”
- “DSD (or the more correct non marketing term, Pulse Density Modulation – PDM) is an analog signal itself.”
These quotes are all nonsense. A DSD signal is discrete in time and amplitude, ergo digital. That is all there is to it. A signal cannot be a little digital. The last quote, about DSD being pulse density modulation, is particularly devious. In figure 6 we see a single period of a sine wave and it’s DSD representation. Where the sine wave is positive, the DSD signal has an excess of positive-going pulses. If the negative-going pulses are disregarded, there is indeed a resemblance to a PDM signal. This interpretation is, however, not helpful. Any similarity to PDM is a coincidence, not a design intent, and viewing it as such does not aid analysis or understanding of 1-bit audio signals.
A 1-bit signal is best regarded just like any other sampled signal: as a sequence of pulses at fixed intervals, the height of each matching the sample value, in this case either +1 or -1. Moreover, the sigma-delta modulation process by which the signal is created operates on discrete-time samples. It is a method for quantising a sampled signal with noise shaping, nothing more, and it can be used with bit depth. There is certainly not anything analogue about it. As it happens, actual PDM is in fact analogue; it just has nothing to do with DSD.
Compare this with a regular dithered (PCM) signal. Since the continuous input value does not (generally) match one of the available quantisation levels, the output fluctuates between a few close values such that the average over a small interval is close to the true signal value. This is illustrated in figure 7 with one period of a sine wave sampled and quantised to 6-bit resolution.
The 1-bit case is conceptually no different. However, since only the two extreme levels are possible, the sample values cannot visually track the curve the way they do with multi-bit quantisation. Also, because the error in each sample is larger, a higher sample rate is needed in order to obtain a good average value over a reasonable length of time.
In summary, DSD is a degenerate form of digital audio (PCM) with only two possible sample values, +1 and -1. Although somewhat tricky to produce with acceptable noise levels, it is still very much a digital format and is readily analysed as such. For instance, the spectrum plots above were created by applying an FFT (Fast Fourier Transform) to the 1-bit sample values. No magic, just maths.
The number of people who exploded 20+ years ago when I said “DSD is just a form of highly oversampled, noise-shaped PCM” need to read this.
It didn’t help that Sony tried to equate it to PWM, which of course it is not.
Interesting exploration. Not complete though. All digital systems are followed by some sort of, analogue, low pass filter active or passive in hardware like in a speaker system or the human ear. Taking that out of the calculation is also a misconception. It would be helpful to go a bit further and include low-pass filtering to the walk.
The analogue filter is not part of the digital representation of the signal, which is what this article discusses. The reconstruction process, which does include an analogue filter, is discussed elsewhere.