One of several applications I work on is a COM add-in for Excel. To put it mildly, this thing can be a real bear to support sometimes.
I was looking into a very peculiar access violation (AV) today. It seemed to only ever happen if—
- My add-in was installed.
- The Analysis Toolpak was installed.
- I tried to load a very large and very particular workbook from a client.
I took a look at a crash dump and—luckily—was able to reproduce this in my office. This is a very good thing, because when I can’t reproduce it often means that I will have to dress up.
The AV was occurring very deep in some Excel code. It didn’t look like my add-in was doing anything special, or in fact, anything at all.
My first instinct might have been, “Why can’t the Excel team fix their stupid bugs!!?!” But, I have more empathy than that. The AV happened in a function that appeared to put the EBP register through a blender set to “pulverize.”
In other words, it required effort to get an accurate stack trace within this function, and I was far too lazy to pick up the pieces. Any reasonable person would be too—and maybe the bug only occurs with the optimizer turned on. So I will give the Excel team a pass on this one.
So what’s a clever debugger to do? Well, I (begrudgingly) know enough about Excel to know that the Analysis Toolpak is a bit of a dinosaur—apparently just like many Office users. It is actually ANALYS32.XLL, nothing more than a standard DLL with special exports. It is a relic of add-in architectures gone by.
So I used GFLAGS.EXE to set WinDbg as the startup debugger for Excel, and fired it up.
Now, I had absolutely no idea which worksheet functions were actually in the toolpak. Frankly, I had no particular desire to do the research. So, I set a breakpoint on all of them:
The bm
command is the set symbol breakpoint command. Since I only have export symbols for the Analysis Toolpak, this effectively puts a break on every exported function from that XLL.
After that I hit “go” and opened up the workbook. The first break was on the XIRR formula:
ANALYS32!xirr:
3d00fbb1 55 push ebp
At this point I had second thoughts about breaking on all of the formulas, so I cleared those breakpoints.
And I figured that it was pretty likely that Excel would be calling any user-defined-function from the same parent procedure, so I stepped one frame down on the stack.
The .frame
command changes the local context. It’s basically the same as clicking on a higher function in the stack trace in Visual Studio. At that point I saw this in my disassembly window:
3046aaac ff5004 call dword ptr [eax+0x4] ; call to UDF
3046aaaf 8b4d14 mov ecx, [ebp+0x14] ; return address
The first instruction is the call to the user-defined function. The second is the return address, or the next instruction to be executed when the user-defined function returns.
I was still in the information-gathering stage here, so I decided to just dump out the functions being called here rather than try to really dig into any disassembly where the AV was occurring.
I decided to print out messages on both of these instructions—in other words, before and after every call to a UDF. I did that with two breakpoints:
0:000> bp 3046aaac "dds @eax+4 L1; g"
0:000> bp 3046aaaf ".echo return; g"
Let me point out a few things about these. First, I specified the breakpoint location directly by instruction address, and not using any symbols. Second, let me explain what I mean by dds @eax+4 L1.
dds is the display data with symbols command. And as you recall, the instruction we are breaking on is:
So what I am trying to do here is print out the name of the function being called, if we indeed have the name. If there’s a symbol for (@eax+0x4) when we get here, dds will print it out for us.
After this I hit “go,” and this is what I saw:
00137974 3d00fbb1 ANALYS32!xirr
return
00137974 3d00fbb1 ANALYS32!xirr
return
00137780 3d00fbb1 ANALYS32!xirr
00133d58 3d00fbb1 ANALYS32!xirr
return
00133d58 3d00fbb1 ANALYS32!xirr
return
return
<KABOOM>
Hm, that’s interesting. The XIRR’s by themselves seem to work fine, but we get an AV as soon as the nested XIRR function returns.
At this point I disabled the Analysis Toolpak and opened the worksheet up. Searching through the formulas, I found the XIRR functions that my client was using. I couldn’t figure out how any of them were dependent upon one another, but I did notice that they were all feeding on a TODAY()
function. TODAY
is what you call a volatile function, and can cause a lot of crazy BS to go down.
On a hunch, I plugged in a nonvolatile alternative that I had handy in an XLA module. After that, the workbook loaded fine.
So there you have it. Apparently, Excel fails miserably if my add-in loads, there are nested XIRR formulas on a sheet, and they are using the TODAY
formula as an input.
I never figured out what the “real” conflict was. It’s not worth the effort and Excel shouldn’t crash, regardless, I told myself.
It’s ironic that I never would have been able to figure this much out if the conflict had been with an XLA add-in or an Automation add-in. Well ok, not never, but I probably would have given up first.
Part I
Part II
In my original post, I left you with the assignment of modifying this command to create a series of dump files instead of overwriting the same one repeatedly.
0:003> bp kernel32!RaiseException ".dump /ma /o c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined
Well, this is going to be a short post because shortly after writing that I realized there was a much easier way to do this than what I had in mind.
The /u
switch for the .dump
command appends a timestamp and the PID to the filename.
0:003> bp kernel32!RaiseException ".dump /ma /u c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined
An effective way to fill up the hard drive on a production machine, if I do say so myself.
In my last post, I asked you to explain how to derive this command:
0:000> bp mscorwks!JIT_Throw "du poi(@ecx+10)+c; !clrstack!clrstack; g"
The purpose of which is to print out the exception text when a CLR exception occurs.
Finding the Function
The first question here might be, how did I know to break on this function? This is the easiest part to get. If we break on kernel32!RaiseException and examine the unmanaged stack, we can see that this is the first function called from managed code when an exception is thrown.
0:000> k
ChildEBP RetAddr
0012f5a8 79238b7d KERNEL32!RaiseException
0012f600 792f0d05 mscorwks!RaiseTheException+0xa0
0012f66c 02e000a0 mscorwks!JIT_Throw+0x4d
WARNING: Frame IP not in any known module. Following frames may be wrong.
0012f69c 791d94bc 0x2e000a0
0012f6a4 791ed194 mscorwks!CallDescrWorker+0x30
…
I have one important note about the JIT_Throw function. I’ve shown it here in the mscorwks
module. This is the workstation CLR. If you are working on a server application, you should change all references to that to mscosvr
, which is the server version of the CLR.
The more interesting part, of course, is how I figured out how to use du poi(@ecx+10)+c
to print out the exception text.
Finding the Exception
If we actually break on JIT_Throw
, we can locate the address of the current managed exception using the DumpStackObjects
command in the SOS extension.
0:000> !dso!dso
Thread 0
ESP/REG Object Name
ecx 0x00aaabfc System.Exception
0x0012f680 0x00aaabfc System.Exception
0x0012f688 0x00aaab6c System.Object[]
…
Notice that !dso
gives us an additional piece of information: at the start of this call to the JIT_Throw
function, the address of the current exception is in the ECX
register. Can we rely on that to be true in every case? It turns out that we can, and I think it is instructive to explain why this is so.
Some of you will already know why ECX
is special here, and if we take a look at the prologue of this function in assembly it might confirm your suspicions.
mscorwks!JIT_Throw:
792f0cb8 55 push ebp
792f0cb9 8bec mov ebp,esp
792f0cbb 83ec58 sub esp,0x58
792f0cbe 56 push esi
792f0cbf 8bf1 mov esi,ecx
The last two instructions are a common sight when working with functions declared with the [_fastcall](http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/_core__fastcall.asp) calling convention. This convention places the first two arguments to the function in the ECX
and EDX
registers.
For the sake of completeness I should say that you will also see the same kind of prologue if a C++ member function is being called (__thiscall). However, we know that this is not the case here.
Taking a quick look at the assembly here would be enough proof, but we can confirm that __fastcall
is actually being used by reading the comments in fcall.h in the SSCLI, and then looking up the definition of JIT_Throw
in jitinterface.cpp.
To summarize, as long as we are running on an x86 machine, we can always grab a pointer to the current exception at this breakpoint from the ECX
register.
The Home Stretch (Finding the Exception Text)
Now that we can locate the managed Exception, we can take a stab at finding the text to go along with it. Take a look at the output when we dump the Exception object (using the DumpObject command in SOS):
0:000> !do @ecx
Name: System.Exception
MethodTable 0x79b947ac
EEClass 0x79b94850
Size 64(0x40) bytes
GC Generation: 0
mdToken: 0x02000012 (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll)
FieldDesc*: 0x79b948b4
MT Field Offset Type Attr Value Name
0x79b947ac 0x400001d 0x4 CLASS instance 0x00000000 _className
0x79b947ac 0x400001e 0x8 CLASS instance 0x00000000 _exceptionMethod
0x79b947ac 0x400001f 0xc CLASS instance 0x00000000 _exceptionMethodString
0x79b947ac 0x4000020 0x10 CLASS instance 0x00aaab7c _message
0x79b947ac 0x4000021 0x14 CLASS instance 0x00000000 _innerException
0x79b947ac 0x4000022 0x18 CLASS instance 0x00000000 _helpURL
0x79b947ac 0x4000023 0x1c CLASS instance 0x00000000 _stackTrace
0x79b947ac 0x4000024 0x20 CLASS instance 0x00000000 _stackTraceString
0x79b947ac 0x4000025 0x24 CLASS instance 0x00000000 _remoteStackTraceString
0x79b947ac 0x4000026 0x2c System.Int32 instance 0 _remoteStackIndex
0x79b947ac 0x4000027 0x30 System.Int32 instance -2146233088 _HResult
0x79b947ac 0x4000028 0x28 CLASS instance 0x00000000 _source
0x79b947ac 0x4000029 0x34 System.Int32 instance 0 _xptrs
0x79b947ac 0x400002a 0x38 System.Int32 instance -532459699 _xcode
-----------------
Exception 00aaabfc in MT 79b947ac: System.Exception
_message: You stink!
The third column, Offset
, is what is important to us right now. This tells us the address of each field relative to the base address of the object. If we want the System.String
object instance for the _message
field, we can refer to that like this:
0:000> ? poi(@ecx+10)
Evaluate expression: 11185020 = 00aaab7c
(The poi
keyword in this expression returns a pointer-sized variable at the specified address. It is the same as *(pException+0x10)
would be in C or C++).
We can use the same kind of procedure on the String instance to get the actual characters.
0:000> !do poi(@ecx+10)
Name: System.String
…
FieldDesc*: 0x79b92978
MT Field Offset Type Attr Value Name
0x79b925c8 0x4000013 0x4 System.Int32 instance 11 m_arrayLength
0x79b925c8 0x4000014 0x8 System.Int32 instance 10 m_stringLength
0x79b925c8 0x4000015 0xc System.Char instance 0x59 m_firstChar
…
The first character in the string is at 0xC relative to the base of the object. We know that the CLR uses Unicode to represent strings internally, so we can use the du
command (display memory, Unicode) to print it out.
0:000> du poi(@ecx+10)+c
00aaab88 "You stink!"
Conclusion
Well, there you have it. With a little elbow grease, WinDbg is a tremendously powerful tool. I hope you’ve found this particular odyssey entertaining.