One of several applications I work on is a COM add-in for Excel. To put it mildly, this thing can be a real bear to support sometimes.
I was looking into a very peculiar access violation (AV) today. It seemed to only ever happen if—
- My add-in was installed.
- The Analysis Toolpak was installed.
- I tried to load a very large and very particular workbook from a client.
I took a look at a crash dump and—luckily—was able to reproduce this in my office. This is a very good thing, because when I can’t reproduce it often means that I will have to dress up.
The AV was occurring very deep in some Excel code. It didn’t look like my add-in was doing anything special, or in fact, anything at all.
My first instinct might have been, “Why can’t the Excel team fix their stupid bugs!!?!” But, I have more empathy than that. The AV happened in a function that appeared to put the EBP register through a blender set to “pulverize.”
In other words, it required effort to get an accurate stack trace within this function, and I was far too lazy to pick up the pieces. Any reasonable person would be too—and maybe the bug only occurs with the optimizer turned on. So I will give the Excel team a pass on this one.
So what’s a clever debugger to do? Well, I (begrudgingly) know enough about Excel to know that the Analysis Toolpak is a bit of a dinosaur—apparently just like many Office users. It is actually ANALYS32.XLL, nothing more than a standard DLL with special exports. It is a relic of add-in architectures gone by.
So I used GFLAGS.EXE to set WinDbg as the startup debugger for Excel, and fired it up.
Now, I had absolutely no idea which worksheet functions were actually in the toolpak. Frankly, I had no particular desire to do the research. So, I set a breakpoint on all of them:
0:00> bm ANALYS32!*
The bm
command is the set symbol breakpoint command. Since I only have export symbols for the Analysis Toolpak, this effectively puts a break on every exported function from that XLL.
After that I hit “go” and opened up the workbook. The first break was on the XIRR formula:
ANALYS32!xirr:
3d00fbb1 55 push ebp
At this point I had second thoughts about breaking on all of the formulas, so I cleared those breakpoints.
0:00> bc *
And I figured that it was pretty likely that Excel would be calling any user-defined-function from the same parent procedure, so I stepped one frame down on the stack.
0:00> .frame 1
The .frame
command changes the local context. It’s basically the same as clicking on a higher function in the stack trace in Visual Studio. At that point I saw this in my disassembly window:
3046aaac ff5004 call dword ptr [eax+0x4] ; call to UDF
3046aaaf 8b4d14 mov ecx, [ebp+0x14] ; return address
The first instruction is the call to the user-defined function. The second is the return address, or the next instruction to be executed when the user-defined function returns.
I was still in the information-gathering stage here, so I decided to just dump out the functions being called here rather than try to really dig into any disassembly where the AV was occurring.
I decided to print out messages on both of these instructions—in other words, before and after every call to a UDF. I did that with two breakpoints:
0:000> bp 3046aaac "dds @eax+4 L1; g"
0:000> bp 3046aaaf ".echo return; g"
Let me point out a few things about these. First, I specified the breakpoint location directly by instruction address, and not using any symbols. Second, let me explain what I mean by dds @eax+4 L1.
dds is the display data with symbols command. And as you recall, the instruction we are breaking on is:
call dword ptr [eax+0x4]
So what I am trying to do here is print out the name of the function being called, if we indeed have the name. If there’s a symbol for (@eax+0x4) when we get here, dds will print it out for us.
After this I hit “go,” and this is what I saw:
00137974 3d00fbb1 ANALYS32!xirr
return
00137974 3d00fbb1 ANALYS32!xirr
return
00137780 3d00fbb1 ANALYS32!xirr
00133d58 3d00fbb1 ANALYS32!xirr
return
00133d58 3d00fbb1 ANALYS32!xirr
return
return
<KABOOM>
Hm, that’s interesting. The XIRR’s by themselves seem to work fine, but we get an AV as soon as the nested XIRR function returns.
At this point I disabled the Analysis Toolpak and opened the worksheet up. Searching through the formulas, I found the XIRR functions that my client was using. I couldn’t figure out how any of them were dependent upon one another, but I did notice that they were all feeding on a TODAY()
function. TODAY
is what you call a volatile function, and can cause a lot of crazy BS to go down.
On a hunch, I plugged in a nonvolatile alternative that I had handy in an XLA module. After that, the workbook loaded fine.
So there you have it. Apparently, Excel fails miserably if my add-in loads, there are nested XIRR formulas on a sheet, and they are using the TODAY
formula as an input.
I never figured out what the “real” conflict was. It’s not worth the effort and Excel shouldn’t crash, regardless, I told myself.
It’s ironic that I never would have been able to figure this much out if the conflict had been with an XLA add-in or an Automation add-in. Well ok, not never, but I probably would have given up first.