Psychic Debugging: Random Crashes
January 24th, 2006
Raymond Chen has had an entry or two about “Psychic Debugging.” This skill is a lot like everyday psychic abilities by virtue of being really just intuition and guesswork, but dissimilar in that it produces results that benefit society.
Today one of the developers was working on a weird problem he found in his development environment. He could test his functionality once (it’s an ASPX page), but the second time he tried to use it there was usually a “random” exception on the page that didn’t make any sense. Sometimes, instead of an exception, his worker process just crashed. He also told me that he was trying to use a new API that I knew mapped to a third-party COM object.
This is where the mystical powers came in handy. I tried running his page once, then running a different page—same weird exceptions and crashes. It was here that I started to suspect that the heap was being corrupted by the COM library.
(Let me quickly explain what heap corruption is: suppose I have two objects, X and Y, next to one another in memory. Now suppose I make a mistake writing my program and accidently write beyond the end of object X. I’ve now screwed up object Y, but we might not know about it until we try to use object Y. We might not try to use object Y for quite a while, so this makes the real cause of the bug awfully difficult to find. To make matters worse, the next time we run the program our bug writing to object X might scribble all over a completely different object that will crash the program somewhere else.)
On a hunch I attached a debugger after the first load of the page. When the exception occurred, I ran the !verifyheap command in the SOS extension. This reported that there was a corrupt object on the heap. I ran the same test a few times to confirm the theory.
If you’re lucky, you might see a debug break like this in one of the Windows Debuggers as soon as the heap corruption occurs:
HEAP[heapcorrupt.exe]: Heap block at 001AD700 modified at 001AD780 past requested size of 78 ... ntdll!DbgBreakPoint:
This is just a debug break, so it won’t tear down the process immediately. Depending on the application you’re working on the real problems could start much later, long after the offending code has finished executing.
All I can say is that it’s a goddamned miracle that the developer was doing a good job testing his functionality and noticed this. The mayhem that would have ensued if this made it to QA would have made it far more difficult to track down the source of the problem. The happy ending here is that the vendor has a fully-managed version of their component that we can switch to—hopefully that will be more difficult for them to screw up.