The cache coherency bug

This is the story of one of the strangest bugs I’ve ever seen, and also one of the first I worked on in 3D graphics. I have an open disagreement with one of our driver developers about whether I got it right. What we can say is that it hasn’t come back, and I’m satisfied that my explanation makes sense.

The setup

Sometimes, when the user resizes the window or changes the colors of the background gradient, the background is displayed corrupted.

I don’t have a screenshot, but the corruption looked very similar to this photograph:

My DSLR likes to do this sometimes, I suspect I need to get a new battery for it

The exact appearance of the corruption is random, but it always manifests as horizontal mis-coloured lines. It doesn’t take terribly long to determine that the bug only occurs on systems with dedicated GPUs (at the time, the only ones we had to test were Nvidia). It was also interesting that people with better CPUs tended to see it more often.

The investigation

Every time the window was resized, or the background gradient was modified, we would generate a new image and upload it to the GPU. I confirmed that the image we were generating was always correct in memory, and when exported as a PNG. Yet, when I used Renderdoc I could see that the background image on the GPU was corrupted.

This was a pickle, since everything I had control over looked right. We were definitely generating a valid gradient image, that image was being stored in memory, and the contents of that memory was being DMA’d over the PCIe bus to the GPU. The only thing I had found was that introducing other calls or sleeps in between generating the image and uploading it could fix the issue either some or all of the time depending on the exact calls.

But what could calling other random functions possibly achieve? It’s not like sleeping could possibly change what was uploaded to the GPU.

Yes it could

Here is my mental model of what was happening:

By all the metrics I could use to directly inspect things, the problem was happening at the DMA step. If you read the title of this page, you will realize that a better mental model is this:

And that what was actually happening was this:

The portion of the image that had not yet been flushed was just being sent to the GPU as whatever happened to already be stored in memory. This explained everything:

Why it was only happening on dGPUs: The other graphics devices we tested, Intel iGPUs and our OpenGL software emulation layer, are naturally cache coherent. Only a dGPU has an opportunity to bypass the CPU cache via DMA.
Why random function calls could fix it: The more time or work the CPU has between generating the image and uploading it to the GPU, the more likely it is to be flushed to cache.
Why more powerful CPUs experienced it more often: My manager’s i9 had more cores and more cache than my lowly i5. That meant it was less likely to need to evict what was in its cache, keeping that data away from main memory longer.

The solution, unpleasant as it may be, was to loop through the entire image buffer and use the __mm_cl_flush() intrinsic to force it into main memory before uploading it to the GPU.

How is this possible?

The team was just happy to see this bug resolved, even if the fix was described as "a little bit magic". However, one of the driver devs who I respected very much was not satisfied with this explanation.

PCIe is cache coherent, this shouldn’t be possible.

— Experienced driver dev

Not wanting to reject the wisdom of my elders, I did a little digging and found that there is a flag that can be submitted with PCIe requests called NOSNOOP. This flag specifies that the DMA engine should provide the requested data immediately, without performing any CPU cache flushing.

One can imagine why Nvidia would implement this feature as it would potentially provide a nice performance boost, and in fact you can find references directly from Nvidia where they recommend using it (such as this one). I was also able to find a few references online to other graphics developers encountering the same problem and using the same technique.

This left me satisfied that we had our answer, particularly considering the bug still had not resurfaced many months later. I still don’t think he’s entirely convinced though 😆.