The cache coherency bug
This is the story of one of the strangest bugs I’ve ever seen, and also one of the first I worked on in 3D graphics. I have an open disagreement with one of our driver developers about whether I got it right. What we can say is that it hasn’t come back, and I’m satisfied that my explanation makes sense.
The setup
Sometimes, when the user resizes the window or changes the colors of the background gradient, the background is displayed corrupted. |
I don’t have a screenshot, but the corruption looked very similar to this photograph:
The exact appearance of the corruption is random, but it always manifests as horizontal mis-coloured lines. It doesn’t take terribly long to determine that the bug only occurs on systems with dedicated GPUs (at the time, the only ones we had to test were Nvidia). It was also interesting that people with better CPUs tended to see it more often.
The investigation
Every time the window was resized, or the background gradient was modified, we would generate a new image and upload it to the GPU. I confirmed that the image we were generating was always correct in memory, and when exported as a PNG. Yet, when I used Renderdoc I could see that the background image on the GPU was corrupted.
This was a pickle, since everything I had control over looked right. We were definitely generating a valid gradient image, that image was being stored in memory, and the contents of that memory was being DMA’d over the PCIe bus to the GPU. The only thing I had found was that introducing other calls or sleeps in between generating the image and uploading it could fix the issue either some or all of the time depending on the exact calls.
But what could calling other random functions possibly achieve? It’s not like sleeping could possibly change what was uploaded to the GPU.
Yes it could
Here is my mental model of what was happening:
By all the metrics I could use to directly inspect things, the problem was happening at the DMA step. If you read the title of this page, you will realize that a better mental model is this:
And that what was actually happening was this:
The portion of the image that had not yet been flushed was just being sent to the GPU as whatever happened to already be stored in memory. This explained everything:
The solution, unpleasant as it may be,
was to loop through the entire
image buffer and use the __mm_cl_flush()
intrinsic
to force it into main memory before uploading it to the GPU.
How is this possible?
The team was just happy to see this bug resolved, even if the fix was described as "a little bit magic". However, one of the driver devs who I respected very much was not satisfied with this explanation.
PCIe is cache coherent, this shouldn’t be possible.
Not wanting to reject the wisdom of my elders,
I did a little digging and found that there is a
flag that can be submitted with PCIe requests
called NOSNOOP
.
This flag specifies that the DMA engine
should provide the requested data immediately,
without performing any CPU cache flushing.
One can imagine why Nvidia would implement this feature as it would potentially provide a nice performance boost, and in fact you can find references directly from Nvidia where they recommend using it (such as this one). I was also able to find a few references online to other graphics developers encountering the same problem and using the same technique.
This left me satisfied that we had our answer, particularly considering the bug still had not resurfaced many months later. I still don’t think he’s entirely convinced though 😆.