A frequent claim by detractors of digital audio is that the time resolution is equal to the sampling interval, 22.7 μs for the CD format. This is incorrect. Although there is a limit, it is much smaller, and it does not depend on the sample rate.

The sampling theorem shows that a band-limited signal can be sampled and reconstructed exactly if the sample rate is higher than twice the bandwidth. The catch is that it assumes infinite precision in the samples, which of course is not the case with 16-bit or even 24-bit audio. A finite precision does introduce a small error. However, and this is crucial, this error is unrelated to the sample interval.

Consider a sine wave sampled at regular intervals. Now shift the sine wave sideways a little as shown by the dashed curve in figure 1. Notice that the values where the curve crosses the vertical grid lines have all changed even though the shift is smaller than one interval.

Sine waves

The time resolution of the sampled signal is equivalent to the minimum shift required to produce a change in the sampled values. The amount by which a sample value changes with a given time shift depends on the slope of the signal around the sample point. Higher frequencies have steeper slopes and thus smaller displacements can be recorded.

Going back to the sine wave, we observe that it has the steepest slopes at the zero crossings. This maximum slope for a sine wave of frequency \(f\) and amplitude \(a\) is \(2πfa\). Let \(d_{min}\) be the smallest recordable difference in sample values. In the immediate vicinity of the zero crossing, the sine wave can be approximated by a straight line, so we are looking for the time \(t_{min}\) at which \(2πfat_{min} = d_{min}\). Solving for \(t_{min}\) yields the following equation: \[t_{min} = \frac{d_{min}}{2πfa}\]

For a sample precision of \(b\) bits, the size of one step (the value of the LSB) is \(2 / (2^b – 1)\). To trigger a change from 0 to 1 in the LSB, the value of the sine wave must reach half this amount. In other words, \(d_{min} = 1 / (2^b – 1)\). Substituting this into the previous equation, we arrive at the final formula: \[t_{min} = \frac{1}{2πfa(2^b – 1)}\]

With CD quality audio, 16 bits at 44.1 kHz, the best-case time resolution is obtained with a full-scale signal at 22.05 kHz. The above formula then yields \[t_{min} = \frac{1}{2π \times 22050\ \mathrm{Hz} \times 1 \times (2^{16} – 1)} = 110\ \mathrm{ps}\]

For a more typical 1 kHz signal at -20 dB, i.e. with an amplitude of 0.1, the same calculation produces a value of 24 ns. Although not nearly as good as the best case, it is still nearly one thousand times better than the erroneously claimed limit of one sample interval.

Notice that the calculation of the time resolution does not include the sample rate. Nevertheless, it can make a difference due to the frequency of the signal being part of the formula. A higher sample rate permits higher-frequency signals, which means smaller time shifts can be measured at a given sample resolution. Of course, this only matters if the signal in question actually contains such high-frequency components, and acoustically recorded music generally does not. It is also doubtful that such small time differences are in any way audible.

15 Replies to “Time resolution of digital audio”

  1. Even though the result is correct there is a typing mistake 20050 Hz instead of 22050 Hz in the timing resolution formula.
    Rgds.

      1. Please help me, I’ve been trying to understand this…

        How can digital music with a sample rate of 44.1khz or 48khz accurately convey a waveform at half that frequency? I know I’m missing something, but I don’t know what it is. You state here that as the wave moves in the pico or nanosecond range, the amplitude changes enough to be registered (based on the bit depth) at the next vertical interval…. but how are those time periods smaller than the sampling frequency itself?

        A simplistic understanding of sample rate would have me believe that a 24khz tone could only have TWO samples of the wave for a 48khz sampling frequency. Depending on the time those samples were taken, I might have the highest and lowest points in that wave, or something in between, or two states of zero where it crosses from + to -.

        What use is that? Assuming best case, that i had samples that were always positive or negative, I could have 2 samples of a wave and know it’s frequency, and assume (and generate) a sine wave… but what if the wave is of a different shape, say a sawtooth i.e. triangle wave? I’d need a whole lot more samples.

        Or am I just misunderstanding waves/frequency and the math… is any wave that deviates from a pure sine wave simply a combination of other frequencies, such that any deviation from a pure sine wave at 24khz would by definition be including higher frequency content? I guess that could sort of make sense, but it’s not satisfying. Synthesizers bend and shape waves in all sorts of ways to generate different sounds. Pure sine wave bass sounds are boring for example, so if you’re watching an oscilloscope of some electronic bass you’ll notice the waveform shift and change (in ways that often no amplifier or speaker could perfectly track in the first place, nor even our ears if we were to get super precise about it, which raises it’s own philosophical questions about what such a sound should or would sound like or if it is simply an unrealistic impossible sound, or just a physical limitation but a theoretical sound). Anyway I don’t think of those modified low frequency sine waves that bend or turn into right angled triangles in either orientation as having high frequency content, I think of them as being a (lets say) 40hz tone that has been bent or misshapen to have a different character to the sound, but always retaining the same period.

        Math might say that such a waveform could be generated or approximated or expressed as a combination of many frequencies at many amplitudes… but wtf. That sounds like digitizing an analog thing… just as a picture can be a representation of what we see, it’s not what we see. a combination of LF and HF content might reconstruct a waveform that looks like a bent over 30hz wave, but in real life it’s NOT that. it’s a damn mangled 30hz wave, nothing HF involved whatsoever.

        I still don’t get how you have any detail or nuance in signals approaching the nyquist frequency with so few digital samples. How can it PERFECTLY reconstruct anything within the band limited signal or half the sampling rate? If I snap a picture of a moving object 100 times a second, I can’t perfectly reconstruct JACK SH*T about what happened BETWEEN those photos, I can only assume and interpolate based on an algorithm. Is digital audio like taking pictures or not? If it is, then there’s no way we’re getting enough data at high frequencies… we’re getting a couple shots and then guessing, when really anything could have happened between the sample points. I’m guessing I’m wrong but I’m hoping I’ve said enough for you to understand the nature of my misunderstanding and offer a relatively simple (little math) explanation or analogy that I can grasp. Thanks.

        1. The sampling theorem explains how the reconstruction works for frequencies strictly less than half the sample rate. As you note, at exactly half the sample rate all the samples will get the same value and tell you nothing.

          Regarding the waveforms produced by synthesisers, maths (specifically, the Fourier transform) says any signal is representable as a sum of sine waves. Both ways of looking at a signal are equally valid; neither is more true or correct than the other. We simply choose whichever is more convenient in each situation. If instead of an oscilloscope you connected a spectrum analyser to your synthesiser, the plain sine wave would yield a single spike on the display while the triangle wave would produce additional spikes of varying height at higher frequencies. In fact, some synthesisers work by mixing simple sine waves of different frequencies, all using analogue electronics.

          Finally, as for your example of picturing a moving object 100 times per second. If we know nothing about the object, it is indeed possible that it has travelled to the moon and back between two of our pictures. However, if we know the maximum speed and acceleration it is capable of, there are limits to how far it can have moved between the observations. This is analogous to the bandwidth limitation in the sampling theorem. The faster something can change, the more frequently we must measure it in order to capture its behaviour in full.

          A lot of these concepts can appear unintuitive, so it’s understandable to be a little confused.

          1. Thank you for the straightforward response, these things are indeed unintuitive. After I posted the question, I continued to do a LOT more reading. (I wish there were more articles on your page, or that they were more in depth, because I really enjoy your content.)

            After doing some more reading, particularly this entire pdf which was extremely helpful (http://www.dspguide.com/CH3.PDF), I understood what was happening and what I was missing, in a very detailed manner.

            I feel kind of stupid seeing as how I answered the primary focus of my question IN my question – namely that there are NO subtleties (in a visual time-domain graph of a signal, i.e. what an oscilloscope would produce) in the highest frequency content (theoretically, 22050hz, although due to the aliasing/reconstruction filter(s) there won’t be much if any content there) like I was imagining their could be. Obviously we can reconstruct the waves based on any number of points greater than 2, if we can make certain assumptions (the signal band), because our “assumption” that these waves are pure sine waves is not actually an assumption, but a hard fact due to the constraints of the signals frequency content.

            It seems so obvious now, but I had been thinking about this for a while (because I was shopping for a good DAC, and I like to understand everything) and was frustrated. I was assuming digital audio was giving me sawtooth or square waves in the highest frequencies, that were only smoothed out by the limitations of my amp and speaker not being able to perfectly track them. I also wondered how a DAC chip or Op amp could accurately change it’s voltage so fast, but that PDF gets into slew rates a bit too. You should check it out and see what you think! Lots of good info in there.

            After doing so much reading about digital audio, and how it actually CAN perfectly reconstruct (in theory) any appropriately band limited analog signal, it actually exposes a lot of total BS that some audiophiles talk about regarding ‘analog’ vs digital audio reproduction. The primary factor limiting the realization of that perfection being the low pass filters at the recording and playback stages, and then to a lesser degree obviously the imperfect nature of the analog components – op amps, resistors/capacitors, power supplies etc. A lot of the same things apply to class D/T amps as well… it seems like they’re kind of like “1-bit” DACS (as I understand it virtually all dacs work like that… a stream of single bits, not chunks of 16/24/32 bits at a given sample rate – the advertised bit depth is simply the end result in terms of effective precision), where the slew rate is in the MHZ range and the output voltage tracks the input voltage with some multiplication.

            If you know much about class D amps – or especially if you know anything about the spin on class D amps known as class T (tripath chip based amps – I have a TA3020 based amp that is rather epic), I would love to understand some of the intricacies of how those work, or what the differences are. Perhaps a future article idea, seeing as how common class D is these days, and how misunderstood it is. It was loathed/mistrusted for a while by audiophiles but for some time now even they have realized that good implementations are absolutely top notch and every bit as good as A/B, with the potential advantage of higher damping factors (is that the same as or directly related to feedback?). But even today, people say they sound “dryer” and less warm or full in the midrange in particular. Is this because they’re actually MORE accurate, and people prefer the inaccuracies of A/B amps much like some prefer the (much greater) inaccuracies of tube amps, or is something else going on?

            I ended up buying a Khadas Tone Board by the way, which was to upgrade from my HRT Music Streamer II. I didn’t know I wanted an upgrade until I heard my friends Schiit Modi 2 and the difference was actually significant which surprised me. The Khadas is very impressive, the increase in fidelity is not subtle. What aspect(s) in particular are responsible for that change I do not know.

        2. Hi Lee. I find the “100 pictures a second” viewpoint (which is time domain) a poor way to think about what frequencies can be captures (frequency domain, obviously).

          I don’t want to write a book here in the comment, so I’ll leave a link—I hope that’s acceptable here, my site isn’t monetized, I’ve never made a cent off my writing or videos, just trying to help. But I’ll give a thumbnail explanation here:

          Sampling audio is equivalent to multiplying the audio signal by a unit impulse train running at the sampling frequency. At the sampling instance, you capture the signal at that moment in time, and at all other times you get zero. This is amplitude modulation by a pulse train. Encoded as a digital value. Pulse Code Modulation—PCM.

          Amplitude modulation creates sidebands. So, the digitized signal captures the audio signal. And it creates a sideband of it mirrored around the sampling frequency. (and higher, but for this discussion we only care about the sideband nearest the audio). In other words, with a SR of 48 kHz, everything up to 24 kHz will have an inverted copy extending downward from 48 kHz. When we play back, the DAC filters out frequencies about 24 kHz and we are left with the intended audio. But if we had tried to sample audio that extends to 30 kHz, 30 kHz will have a sideband of 48 kHz – 30 kHz = 18 kHz (an “alias” of 30k). The DAC will remove everything above 24 kHz, and we are left with this alias pollution in the audio.

          https://www.earlevel.com/main/tag/sampling-theory-series/?orderby=date&order=ASC

          Nigel

  2. Is this in relevant to claims made about MQA?

    Could you do an article about MQA either debunking or proving if it has any merit?

  3. Nice articles! Much appreciated…

    So…my thought is that dither removes this level-dependent time quantization—at the expense of noise, of course. Would you agree?

    I’ve just run it through my head at this point, but I don’t see why that wouldn’t be true.

    1. The primary purpose of dither is, simply put, to convert quantisation distortion to uncorrelated noise. It also has the effect of allowing changes smaller than one bit level to be detected through shifts in the average noise level. This, of course, implies that phase shifts smaller than in the above calculation can also be detected. The amplitude of the signal still matters, though. The difference with dither is that there is no longer a hard cut-off.

      1. Unfortunately, at the moment I don’t have time except to think about it intuitively, but I think dither helps you at all signal amplitudes. It seems it would help a relatively small amount at large amplitudes, where the time quantization issue is small, and a relatively large amount at small amplitudes where the issue is large, resulting in the same thing despite amplitude. Like you, I’m referring to a continuous tone here. A singular event would be a different discussion, as dither essential adds a random error compared to non-dithered quantization.

        Just a curiosity for thought. None of this changes the reason that brought me here, which was the claim of timing limitations for musical events (~23 µs) in 44.1 kHz sampled audio (in this case, used to justify MQA). You give a great explanation of why this is bunk, and most importantly that it’s not sample rate dependent at all. That alone is a sufficient counter to the claim. I’m just adding that I think dither further drives home the point that there is no effective timing limitation at 44.1 kHz.

        1. Dither makes it possible to detect smaller phase differences at any amplitude. A phase shift by a given amount is still easier to detect if the signal amplitude is higher, which I think is what you’re also saying.

          1. Well…the amplitude doesn’t matter either. Put another way, if you TPDF dither and truncate a 24-bit audio file to 16-bit, then null the two, a white noise floor remains. In other words, the 16-bit audio is exactly the audio of the 24-bit file with a very low amplitude, fixed level white noise floor mixed in. Since the only difference is that noise floor, there can’t be a measurable or perceivable difference difference in timing resolution. (Or, you could say there is, but it’s so small the noise floor is hiding it.) So, the only difference between 24-bit and 16-bit is the noise floor.

          2. That is not quite correct. After the dither noise is added, the sample are rounded to the nearest 16-bit value. This means that if the dither is subtracted again, what you get is not quite the same as the original. The point is that this residual error is uncorrelated with the signal. The result is that the measured amplitude or phase of a periodic signal becomes more accurate as the window of observation is made longer. The higher the level of dither noise, the longer the observation needs to be to achieve the same accuracy.

  4. I disagree. First, I describe the reduction as truncation, and in hardware this is typically what happens—most easily seen in hardware that uses a shorter DAC word than the available data, the lower bits are simply not connected. However, rounding is trivially different—it’s the same with a half-bit DC offset. This is meaningless in audio since we can’t hear a constant half-bit offset at any sample size.

    The bottom line is quantization error. If you want to split some of its effects off to attribute to timing resolution errors, that’s fine, but the same methods that address quantization error (such as more bits, or dither) address the timing issue.

    Honestly, I see this as a non-problem. If you’re not dithering smaller word sizes, and can hear a problem, you’ve got a problem. If you can’t, you don’t have a problem. 🙂 But dither ensures you can’t hear a problem (quantization issue are obscured by the random noise). Again, I appreciate you showing such error is not attributable to sample rate (although doubling of sample rate does improve s/n by 3 dB…). I’m just extending your conclusion.

Leave a Reply to Joe Mainstreet Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.