The Intel driver bug

This is the story of one of the worst bugs I have ever worked on.

Setting the scene

I had spent months working on a revised architecture for an existing module. Product release was imminent, and I had wound down my work on that branch. The tests were passing and my coworkers were using my code every day. I had moved on to adding new features for a future release, basking in the joy of working in my new rational codebase.

One of the unfortunate things in the code I had inherited was that it solved many problems by unbinding the OpenGL context. It seemed like any time there was a bug involving threads, the go-to solution was to call the context unbind wrapper function. Since we now had a carefully designed threading model where each thread was exclusively paired with an OpenGL context, I had removed all ~30 of these calls and deleted the wrapper function from the codebase.

Keep that in the back of your mind.

Disaster sans mitigation

One day I got a report from the "last mile" QA department;

Some of our example programs segfault on program exit, about a third of the time.

I quickly determined the following facts:

  • It only affected modern Intel GPUs.

  • It only affected programs with multiple displays, and therefore multiple threads and OpenGL contexts. This was only three out of several dozen 3D example programs we had.

  • It only happened at the end of a program that terminated normally, when the GPU driver DLL was unloaded.

  • The segfault was a read-access violation into memory which the driver had allocated and managed internally.

  • This didn’t happen in the previous release.

It didn’t take us long to decide that this was an Intel driver bug. Unfortunately due to the many layers of code involved, we felt we would be unlikely to get a Minimum Reproducible Example in a sane amount of time, and thus it would be difficult to file a bug report. We also needed to take care of this ASAP, preferably without requiring users to modify their drivers, which meant we really wanted to put the solution in our code.

We were going to have to figure out a way around this segfault.

At this point I need to stress that this bug was of no practical consequence, since it only manifested once the user’s code had finished running. If there was no debugger attached, all they would see is an unusual delay before the program finally terminated. However, we felt we had to fix the problem since users would see the segfaults when playing with the example programs in Visual Studio, and probably in their own applications too. We also didn’t want to take the risk that there really was something subtle going on that we did not understand, which could affect the long-term stability of the library.

Grinding away

Over the next month (and at least two weeks of dedicated man-hours) I chased myriad trains of logic.

Maybe it was to do with device contexts in our windowing layer? Maybe it was two operations being performed simultaneously on different OpenGL contexts in a way the driver would not tolerate? I was already aware of a few such cases, which we had a mutex for.

At one point I was following the lifetime of the specific allocations that were involved in the read-access-violation, without debugging symbols, just hoping I could glean some relevant information about them.

The fact that the problem was intermittent made this all a thousand times worse.

Despair

The frustration was palpable. "I believe in you" my manager said. Except in more words, and with generally more frenchness.

Unlike him, my faith wavered.

The moment of clarity

One day I suppose I thought "what would my predecessor do?"

We had threads.

We had OpenGL contexts.

It doesn’t take a rocket surgeon to figure out what his solution would have been. And so I went back to the previous release, copy-pasted his context unbind wrapper function, and put it in some semi-random locations that seemed relevant.

The problem went away.

Another couple hours narrowing it down by experimentation, and I had two calls in exactly the right locations to prevent the bug (either location on its own had no effect). I added a comment saying rather pointedly "this is an Intel driver bug workaround", committed the fix, and most likely kicked off for a nice cold one.

Reflection

All that pain.

All that frustration.

All those man-hours.

All that re-introducing an unnecessary context unbind.

Just so that some devs with Intel GPUs wouldn’t see an error message.

And I’d do it again today, because hell if I’m going to knowingly ship a segfault to a customer.