Math, Programming, and Minority Reports

Following Excel User-Defined Functions With WinDbg - A Debugging Odyssey
September 22nd, 2005

One of several applications I work on is a COM add-in for Excel. To put it mildly, this thing can be a real bear to support sometimes.

I was looking into a very peculiar access violation (AV) today. It seemed to only ever happen if—

My add-in was installed.
The Analysis Toolpak was installed.
I tried to load a very large and very particular workbook from a client.

I took a look at a crash dump and—luckily—was able to reproduce this in my office. This is a very good thing, because when I can’t reproduce it often means that I will have to dress up.

The AV was occurring very deep in some Excel code. It didn’t look like my add-in was doing anything special, or in fact, anything at all.

My first instinct might have been, “Why can’t the Excel team fix their stupid bugs!!?!” But, I have more empathy than that. The AV happened in a function that appeared to put the EBP register through a blender set to “pulverize.”

In other words, it required effort to get an accurate stack trace within this function, and I was far too lazy to pick up the pieces. Any reasonable person would be too—and maybe the bug only occurs with the optimizer turned on. So I will give the Excel team a pass on this one.

So what’s a clever debugger to do? Well, I (begrudgingly) know enough about Excel to know that the Analysis Toolpak is a bit of a dinosaur—apparently just like many Office users. It is actually ANALYS32.XLL, nothing more than a standard DLL with special exports. It is a relic of add-in architectures gone by.

So I used GFLAGS.EXE to set WinDbg as the startup debugger for Excel, and fired it up.

Now, I had absolutely no idea which worksheet functions were actually in the toolpak. Frankly, I had no particular desire to do the research. So, I set a breakpoint on all of them:

0:00> bm ANALYS32!*

The bm command is the set symbol breakpoint command. Since I only have export symbols for the Analysis Toolpak, this effectively puts a break on every exported function from that XLL.

After that I hit “go” and opened up the workbook. The first break was on the XIRR formula:

ANALYS32!xirr:
3d00fbb1 55               push    ebp

At this point I had second thoughts about breaking on all of the formulas, so I cleared those breakpoints.

0:00> bc *

And I figured that it was pretty likely that Excel would be calling any user-defined-function from the same parent procedure, so I stepped one frame down on the stack.

0:00> .frame 1

The .frame command changes the local context. It’s basically the same as clicking on a higher function in the stack trace in Visual Studio. At that point I saw this in my disassembly window:

3046aaac ff5004      call    dword ptr [eax+0x4]  ; call to UDF
3046aaaf 8b4d14      mov     ecx, [ebp+0x14]      ; return address

The first instruction is the call to the user-defined function. The second is the return address, or the next instruction to be executed when the user-defined function returns.

I was still in the information-gathering stage here, so I decided to just dump out the functions being called here rather than try to really dig into any disassembly where the AV was occurring.

I decided to print out messages on both of these instructions—in other words, before and after every call to a UDF. I did that with two breakpoints:

0:000> bp 3046aaac "dds @eax+4 L1; g"
0:000> bp 3046aaaf ".echo return; g"

Let me point out a few things about these. First, I specified the breakpoint location directly by instruction address, and not using any symbols. Second, let me explain what I mean by dds @eax+4 L1. dds is the display data with symbols command. And as you recall, the instruction we are breaking on is:

call    dword ptr [eax+0x4]

So what I am trying to do here is print out the name of the function being called, if we indeed have the name. If there’s a symbol for (@eax+0x4) when we get here, dds will print it out for us.

After this I hit “go,” and this is what I saw:

00137974  3d00fbb1 ANALYS32!xirr
return
00137974  3d00fbb1 ANALYS32!xirr
return
00137780  3d00fbb1 ANALYS32!xirr
00133d58  3d00fbb1 ANALYS32!xirr
return
00133d58  3d00fbb1 ANALYS32!xirr
return
return

Hm, that’s interesting. The XIRR’s by themselves seem to work fine, but we get an AV as soon as the nested XIRR function returns.

At this point I disabled the Analysis Toolpak and opened the worksheet up. Searching through the formulas, I found the XIRR functions that my client was using. I couldn’t figure out how any of them were dependent upon one another, but I did notice that they were all feeding on a TODAY() function. TODAY is what you call a volatile function, and can cause a lot of crazy BS to go down.

On a hunch, I plugged in a nonvolatile alternative that I had handy in an XLA module. After that, the workbook loaded fine.

So there you have it. Apparently, Excel fails miserably if my add-in loads, there are nested XIRR formulas on a sheet, and they are using the TODAY formula as an input.

I never figured out what the “real” conflict was. It’s not worth the effort and Excel shouldn’t crash, regardless, I told myself.

It’s ironic that I never would have been able to figure this much out if the conflict had been with an XLA add-in or an Automation add-in. Well ok, not never, but I probably would have given up first.

Using WinDbg to Log Exceptions, Part 3
September 18th, 2005

Part I

Part II

In my original post, I left you with the assignment of modifying this command to create a series of dump files instead of overwriting the same one repeatedly.

0:003> bp kernel32!RaiseException ".dump /ma /o c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined

Well, this is going to be a short post because shortly after writing that I realized there was a much easier way to do this than what I had in mind.

The /u switch for the .dump command appends a timestamp and the PID to the filename.

0:003> bp kernel32!RaiseException ".dump /ma /u c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined

An effective way to fill up the hard drive on a production machine, if I do say so myself.

Using WinDbg to Log Exceptions, Part 2
September 18th, 2005

In my last post, I asked you to explain how to derive this command:

0:000> bp mscorwks!JIT_Throw "du poi(@ecx+10)+c; !clrstack!clrstack; g"

The purpose of which is to print out the exception text when a CLR exception occurs.

Finding the Function

The first question here might be, how did I know to break on this function? This is the easiest part to get. If we break on kernel32!RaiseException and examine the unmanaged stack, we can see that this is the first function called from managed code when an exception is thrown.

0:000> k
ChildEBP RetAddr
0012f5a8 79238b7d KERNEL32!RaiseException
0012f600 792f0d05 mscorwks!RaiseTheException+0xa0
0012f66c 02e000a0 mscorwks!JIT_Throw+0x4d
WARNING: Frame IP not in any known module. Following frames may be wrong.
0012f69c 791d94bc 0x2e000a0
0012f6a4 791ed194 mscorwks!CallDescrWorker+0x30
&hellip;

I have one important note about the JIT_Throw function. I’ve shown it here in the mscorwks module. This is the workstation CLR. If you are working on a server application, you should change all references to that to mscosvr, which is the server version of the CLR.

The more interesting part, of course, is how I figured out how to use du poi(@ecx+10)+c to print out the exception text.

Finding the Exception

If we actually break on JIT_Throw, we can locate the address of the current managed exception using the DumpStackObjects command in the SOS extension.

0:000> !dso!dso
Thread 0
ESP/REG    Object     Name
ecx        0x00aaabfc System.Exception
0x0012f680 0x00aaabfc System.Exception
0x0012f688 0x00aaab6c System.Object[]
&hellip;

Notice that !dso gives us an additional piece of information: at the start of this call to the JIT_Throw function, the address of the current exception is in the ECX register. Can we rely on that to be true in every case? It turns out that we can, and I think it is instructive to explain why this is so.

Some of you will already know why ECX is special here, and if we take a look at the prologue of this function in assembly it might confirm your suspicions.

mscorwks!JIT_Throw:
792f0cb8 55               push    ebp
792f0cb9 8bec             mov     ebp,esp
792f0cbb 83ec58           sub     esp,0x58
792f0cbe 56               push    esi
792f0cbf 8bf1             mov     esi,ecx

The last two instructions are a common sight when working with functions declared with the [_fastcall](http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/_core__fastcall.asp) calling convention. This convention places the first two arguments to the function in the ECX and EDX registers.

For the sake of completeness I should say that you will also see the same kind of prologue if a C++ member function is being called (__thiscall). However, we know that this is not the case here.

Taking a quick look at the assembly here would be enough proof, but we can confirm that __fastcall is actually being used by reading the comments in fcall.h in the SSCLI, and then looking up the definition of JIT_Throw in jitinterface.cpp.

To summarize, as long as we are running on an x86 machine, we can always grab a pointer to the current exception at this breakpoint from the ECX register.

The Home Stretch (Finding the Exception Text)

Now that we can locate the managed Exception, we can take a stab at finding the text to go along with it. Take a look at the output when we dump the Exception object (using the DumpObject command in SOS):

0:000> !do @ecx
Name: System.Exception
MethodTable 0x79b947ac
EEClass 0x79b94850
Size 64(0x40) bytes
GC Generation: 0
mdToken: 0x02000012  (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll)
FieldDesc*: 0x79b948b4
MT      Field     Offset          Type   Attr     Value       Name
0x79b947ac 0x400001d     0x4         CLASS   instance 0x00000000  _className
0x79b947ac 0x400001e     0x8         CLASS   instance 0x00000000  _exceptionMethod
0x79b947ac 0x400001f     0xc         CLASS   instance 0x00000000  _exceptionMethodString
0x79b947ac 0x4000020    0x10         CLASS   instance 0x00aaab7c  _message
0x79b947ac 0x4000021    0x14         CLASS   instance 0x00000000  _innerException
0x79b947ac 0x4000022    0x18         CLASS   instance 0x00000000  _helpURL
0x79b947ac 0x4000023    0x1c         CLASS   instance 0x00000000  _stackTrace
0x79b947ac 0x4000024    0x20         CLASS   instance 0x00000000  _stackTraceString
0x79b947ac 0x4000025    0x24         CLASS   instance 0x00000000  _remoteStackTraceString
0x79b947ac 0x4000026    0x2c  System.Int32   instance 0           _remoteStackIndex
0x79b947ac 0x4000027    0x30  System.Int32   instance -2146233088 _HResult
0x79b947ac 0x4000028    0x28         CLASS   instance 0x00000000  _source
0x79b947ac 0x4000029    0x34  System.Int32   instance 0           _xptrs
0x79b947ac 0x400002a    0x38  System.Int32   instance -532459699  _xcode
-----------------
Exception 00aaabfc in MT 79b947ac: System.Exception
_message: You stink!

The third column, Offset, is what is important to us right now. This tells us the address of each field relative to the base address of the object. If we want the System.String object instance for the _message field, we can refer to that like this:

0:000> ? poi(@ecx+10)
Evaluate expression: 11185020 = 00aaab7c

(The poi keyword in this expression returns a pointer-sized variable at the specified address. It is the same as *(pException+0x10) would be in C or C++).

We can use the same kind of procedure on the String instance to get the actual characters.

0:000> !do poi(@ecx+10)
Name: System.String
&hellip;
FieldDesc*: 0x79b92978
MT      Field     Offset                 Type       Attr  Value  Name
0x79b925c8 0x4000013      0x4         System.Int32   instance     11  m_arrayLength
0x79b925c8 0x4000014      0x8         System.Int32   instance     10  m_stringLength
0x79b925c8 0x4000015      0xc          System.Char   instance   0x59  m_firstChar
&hellip;

The first character in the string is at 0xC relative to the base of the object. We know that the CLR uses Unicode to represent strings internally, so we can use the du command (display memory, Unicode) to print it out.

0:000> du poi(@ecx+10)+c
00aaab88  "You stink!"

Conclusion

Well, there you have it. With a little elbow grease, WinDbg is a tremendously powerful tool. I hope you’ve found this particular odyssey entertaining.

Following Excel User-Defined Functions With WinDbg - A Debugging Odyssey September 22nd, 2005

Using WinDbg to Log Exceptions, Part 3 September 18th, 2005