Archive for September, 2005

Finally, This Site Produces Some Results

In my last post, I complained about an obscure Excel crash. Well, this morning the client got back to me saying that he opened up the original workbook (with no workarounds in place), and obviously, got the crash.

However, something was different this time. Dr. Watson showed him a link for more information, which he followed and found a download link for Office 2003 SP2! Released on September 27th! It fixed the issue! Less than a week after I lost a whole day debugging and working around it.

Thanks to Microsoft for monitoring the blogosphere closely and for being so responsive to my issue. Hopefully, the legion of call center workers on The Sub-Continent who refuse to escalate my other premier support case to anyone even remotely familiar with the Excel code base helped in churning out this incredibly timed hotfix.

Following Excel User-Defined Functions with WinDbg: A Debugging Odyssey

One of several applications I work on is a COM add-in for Excel. To put it mildly, this thing can be a real bear to support sometimes.

I was looking into a very peculiar access violation (AV) today. It seemed to only ever happen if—

  • My add-in was installed.
  • The Analysis Toolpak was installed.
  • I tried to load a very large and very particular workbook from a client.

I took a look at a crash dump and—luckily—was able to reproduce this in my office. This is a very good thing, because when I can’t reproduce it often means that I will have to dress up.

The AV was occurring very deep in some Excel code. It didn’t look like my add-in was doing anything special, or in fact, anything at all.

My first instinct might have been, “Why can’t the Excel team fix their stupid bugs!!?!” But, I have more empathy than that. The AV happened in a function that appeared to put the EBP register through a blender set to “pulverize.”

In other words, it required effort to get an accurate stack trace within this function, and I was far too lazy to pick up the pieces. Any reasonable person would be too—and maybe the bug only occurs with the optimizer turned on. So I will give the Excel team a pass on this one.

So what’s a clever debugger to do? Well, I (begrudgingly) know enough about Excel to know that the Analysis Toolpak is a bit of a dinosaur—apparently just like many Office users. It is actually ANALYS32.XLL, nothing more than a standard DLL with special exports. It is a relic of add-in architectures gone by.

So I used GFLAGS.EXE to set WinDbg as the startup debugger for Excel, and fired it up.

Now, I had absolutely no idea which worksheet functions were actually in the toolpak. Frankly, I had no particular desire to do the research. So, I set a breakpoint on all of them:

0:00> bm ANALYS32!*

The bm command is the set symbol breakpoint command. Since I only have export symbols for the Analysis Toolpak, this effectively puts a break on every exported function from that XLL.

After that I hit “go” and opened up the workbook. The first break was on the XIRR formula:

ANALYS32!xirr:
3d00fbb1 55               push    ebp

At this point I had second thoughts about breaking on all of the formulas, so I cleared those breakpoints.

0:00> bc *

And I figured that it was pretty likely that Excel would be calling any user-defined-function from the same parent procedure, so I stepped one frame down on the stack.

0:00> .frame 1

The .frame command changes the local context. It’s basically the same as clicking on a higher function in the stack trace in Visual Studio. At that point I saw this in my disassembly window:

3046aaac ff5004      call    dword ptr [eax+0x4]  ; call to UDF
3046aaaf 8b4d14      mov     ecx, [ebp+0x14]      ; return address

The first instruction is the call to the user-defined function. The second is the return address, or the next instruction to be executed when the user-defined function returns.

I was still in the information-gathering stage here, so I decided to just dump out the functions being called here rather than try to really dig into any disassembly where the AV was occurring.

I decided to print out messages on both of these instructions—in other words, before and after every call to a UDF. I did that with two breakpoints:

0:000> bp 3046aaac "dds @eax+4 L1; g"
0:000> bp 3046aaaf ".echo return; g"

Let me point out a few things about these. First, I specified the breakpoint location directly by instruction address, and not using any symbols. Second, let me explain what I mean by “dds @eax+4 L1.” dds is the display data with symbols command. And as you recall, the instruction we are breaking on is:

call    dword ptr [eax+0x4]

So what I am trying to do here is print out the name of the function being called, if we indeed have the name. If there’s a symbol for (@eax+0x4) when we get here, dds will print it out for us.

After this I hit “go,” and this is what I saw:

00137974  3d00fbb1 ANALYS32!xirr
return
00137974  3d00fbb1 ANALYS32!xirr
return
00137780  3d00fbb1 ANALYS32!xirr
00133d58  3d00fbb1 ANALYS32!xirr
return
00133d58  3d00fbb1 ANALYS32!xirr
return
return

<KABOOM>

Hm, that’s interesting. The XIRR’s by themselves seem to work fine, but we get an AV as soon as the nested XIRR function returns.

At this point I disabled the Analysis Toolpak and opened the worksheet up. Searching through the formulas, I found the XIRR functions that my client was using. I couldn’t figure out how any of them were dependent upon one another, but I did notice that they were all feeding on a TODAY() function. TODAY is what you call a volatile function, and can cause a lot of crazy BS to go down.

On a hunch, I plugged in a nonvolatile alternative that I had handy in an XLA module. After that, the workbook loaded fine.

So there you have it. Apparently, Excel fails miserably if my add-in loads, there are nested XIRR formulas on a sheet, and they are using the TODAY formula as an input.

I never figured out what the “real” conflict was. It’s not worth the effort and Excel shouldn’t crash, regardless, I told myself.

It’s ironic that I never would have been able to figure this much out if the conflict had been with an XLA add-in or an Automation add-in. Well ok, not never, but I probably would have given up first.

Using WinDbg to Log Exceptions, Part III

Part I

Part II

In my original post, I left you with the assignment of modifying this command to create a series of dump files instead of overwriting the same one repeatedly.

0:003> bp kernel32!RaiseException ".dump /ma /o c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined

Well, this is going to be a short post because shortly after writing that I realized there was a much easier way to do this than what I had in mind.

The /u switch for the .dump command appends a timestamp and the PID to the filename.

0:003> bp kernel32!RaiseException ".dump /ma /u c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined

An effective way to fill up the hard drive on a production machine, if I do say so myself.

Using WinDbg to Log Exceptions, Part II

In my last post, I asked you to explain how to derive this command:

0:000> bp mscorwks!JIT_Throw "du poi(@ecx+10)+c; !clrstack; g"

The purpose of which is to print out the exception text when a CLR exception occurs.

Finding the Function

The first question here might be, how did I know to break on this function? This is the easiest part to get. If we break on kernel32!RaiseException and examine the unmanaged stack, we can see that this is the first function called from managed code when an exception is thrown.

0:000> k
ChildEBP RetAddr
0012f5a8 79238b7d KERNEL32!RaiseException
0012f600 792f0d05 mscorwks!RaiseTheException+0xa0
0012f66c 02e000a0 mscorwks!JIT_Throw+0x4d
WARNING: Frame IP not in any known module. Following frames may be wrong.
0012f69c 791d94bc 0x2e000a0
0012f6a4 791ed194 mscorwks!CallDescrWorker+0x30
…

I have one important note about the JIT_Throw function. I’ve shown it here in the mscorwks module. This is the workstation CLR. If you are working on a server application, you should change all references to that to mscosvr, which is the server version of the CLR.

The more interesting part, of course, is how I figured out how to use du poi(@ecx+10)+c to print out the exception text.

Finding the Exception

If we actually break on JIT_Throw, we can locate the address of the current managed exception using the DumpStackObjects command in the SOS extension.

0:000> !dso
Thread 0
ESP/REG    Object     Name
ecx        0x00aaabfc System.Exception
0x0012f680 0x00aaabfc System.Exception
0x0012f688 0x00aaab6c System.Object[]
…

Notice that !dso gives us an additional piece of information: at the start of this call to the JIT_Throw function, the address of the current exception is in the ECX register. Can we rely on that to be true in every case? It turns out that we can, and I think it is instructive to explain why this is so.

Some of you will already know why ECX is special here, and if we take a look at the prologue of this function in assembly it might confirm your suspicions.

mscorwks!JIT_Throw:
792f0cb8 55               push    ebp
792f0cb9 8bec             mov     ebp,esp
792f0cbb 83ec58           sub     esp,0x58
792f0cbe 56               push    esi
792f0cbf 8bf1             mov     esi,ecx

The last two instructions are a common sight when working with functions declared with the __fastcall calling convention. This convention places the first two arguments to the function in the ECX and EDX registers.

For the sake of completeness I should say that you will also see the same kind of prologue if a C++ member function is being called (__thiscall). However, we know that this is not the case here.

Taking a quick look at the assembly here would be enough proof, but we can confirm that __fastcall is actually being used by reading the comments in fcall.h in the SSCLI, and then looking up the definition of JIT_Throw in jitinterface.cpp.

To summarize, as long as we are running on an x86 machine, we can always grab a pointer to the current exception at this breakpoint from the ECX register.

The Home Stretch (Finding the Exception Text)

Now that we can locate the managed Exception, we can take a stab at finding the text to go along with it. Take a look at the output when we dump the Exception object (using the DumpObject command in SOS):

0:000> !do @ecx
Name: System.Exception
MethodTable 0x79b947ac
EEClass 0x79b94850
Size 64(0x40) bytes
GC Generation: 0
mdToken: 0x02000012  (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll)
FieldDesc*: 0x79b948b4
    MT      Field     Offset          Type   Attr     Value       Name
0x79b947ac 0x400001d     0x4         CLASS   instance 0x00000000  _className
0x79b947ac 0x400001e     0x8         CLASS   instance 0x00000000  _exceptionMethod
0x79b947ac 0x400001f     0xc         CLASS   instance 0x00000000  _exceptionMethodString
0x79b947ac 0x4000020    0x10         CLASS   instance 0x00aaab7c  _message
0x79b947ac 0x4000021    0x14         CLASS   instance 0x00000000  _innerException
0x79b947ac 0x4000022    0x18         CLASS   instance 0x00000000  _helpURL
0x79b947ac 0x4000023    0x1c         CLASS   instance 0x00000000  _stackTrace
0x79b947ac 0x4000024    0x20         CLASS   instance 0x00000000  _stackTraceString
0x79b947ac 0x4000025    0x24         CLASS   instance 0x00000000  _remoteStackTraceString
0x79b947ac 0x4000026    0x2c  System.Int32   instance 0           _remoteStackIndex
0x79b947ac 0x4000027    0x30  System.Int32   instance -2146233088 _HResult
0x79b947ac 0x4000028    0x28         CLASS   instance 0x00000000  _source
0x79b947ac 0x4000029    0x34  System.Int32   instance 0           _xptrs
0x79b947ac 0x400002a    0x38  System.Int32   instance -532459699  _xcode
-----------------
Exception 00aaabfc in MT 79b947ac: System.Exception
_message: You stink!

The third column, Offset, is what is important to us right now. This tells us the address of each field relative to the base address of the object. If we want the System.String object instance for the _message field, we can refer to that like this:

0:000> ? poi(@ecx+10)
Evaluate expression: 11185020 = 00aaab7c

(The poi keyword in this expression returns a pointer-sized variable at the specified address. It is the same as “*(pException+0x10)” would be in C or C++).

We can use the same kind of procedure on the String instance to get the actual characters.

0:000> !do poi(@ecx+10)
Name: System.String
…
FieldDesc*: 0x79b92978
     MT      Field     Offset                 Type       Attr  Value  Name
0x79b925c8 0x4000013      0x4         System.Int32   instance     11  m_arrayLength
0x79b925c8 0x4000014      0x8         System.Int32   instance     10  m_stringLength
0x79b925c8 0x4000015      0xc          System.Char   instance   0x59  m_firstChar
…

The first character in the string is at 0xC relative to the base of the object. We know that the CLR uses Unicode to represent strings internally, so we can use the du command (display memory, Unicode) to print it out.

0:000> du poi(@ecx+10)+c
00aaab88  "You stink!"

Conclusion

Well, there you have it. With a little elbow grease, WinDbg is a tremendously powerful tool. I hope you’ve found this particular odyssey entertaining.

Using WinDbg to Log Exceptions

Since I discovered that I am currently—and quite inexplicably—the #1 Google Blog Search result for the term “WinDbg,” I decided that I would attempt to actually produce some WinDbg content.

If you are not familiar with WinDbg, check it out with the rest of the Debugging Tools for Windows here.

To get things started, I thought I would post some examples that I’ve found useful lately.

Introduction

I use this technique quite a bit when I’m doing QA work on an application. I can think of a bunch of other scenarios where it might be useful, but one chore in particular that is well suited for this is finding ASP.NET page compilation exceptions.

There isn’t really a good way to clean up page compilation exceptions in a massive ASP.NET application. Why would you bother? Well, there are a few reasons.

  1. They encourage you to disable break-on-exceptions in a debugger like VS.NET, since you’ll be inundated with a lot of other people’s exceptions when trying to attach to the worker process. If you do that, you might miss swallowed exceptions or exceptions on background threads that you would have otherwise noticed and fixed.
  2. They might indicate a problem in the HTML being sent back to the user.
  3. They cause your application to take longer to start up.
  4. Some of us are just sticklers for the details, ok?

So, one good way to trim them down to a bearable amount is to log all of the exceptions that happen when the app starts.

Getting Started

The first thing you need to do is open up a log:

0:003> .logopen c:\temp\exceptions.txt
Opened log file 'c:\temp\exceptions.txt'

And if this is managed code you’ll want to load the SOS extension:

.load clr10\sos

We’ll use a breakpoint with a custom command to print out the stack when the exception occurs.

0:003> bp kernel32!RaiseException "!clrstack; g"

The !clrstack command will print out the stack, and g (go) resumes execution. If the application is unmanaged, you will want to replace !clrstack with one of the k (display stack) commands. Both managed and unmanaged code will use the same underlying call to the kernel32!RaiseException function.

Once you have this set up, you’ll see a trickle or a flood of stacks, depending on the relative health of your app. They’ll probably be more interesting than this:

(848.d18): CLR exception - code e0434f4d (first chance)
Thread 0
ESP         EIP
0x0012f648  0x7c81eae1 [FRAME: HelperMethodFrame]
0x0012f674  0x02e000a0 [DEFAULT] Void CsConsoleApp.Program.Main(SZArray String)
  at [+0x48] [+0xf] c:\src\scratch\csconsoleapp\class2.cs:26
0x0012f8b0  0x791d94bc [FRAME: GCFrame]
0x0012f9b0  0x791d94bc [FRAME: GCFrame]
0x0012fa94  0x791d94bc [FRAME: GCFrame]

But you get the idea.

Adding More Information

Now, notice that in the case of a managed exception, WinDbg didn’t give us much information, other than the fact that it occurred:

(848.d18): CLR exception - code e0434f4d (first chance)

It would be nice to get the exception text, right? You can do that by changing the breakpoint to this:

0:000> bp mscorwks!JIT_Throw "du poi(@ecx+10)+c; !clrstack; g"

Homework: explain this command and how I came up with it. I will post the answer soon.

Now we’ll get something more useful:

(848.d18): CLR exception - code e0434f4d (first chance)
00aaab88  "You stink!"

Other Mods

One thing you might want to do in the case of rarer exceptions is create a dump of the process. This is especially useful if your issue only happens in production. The simplest way to do that is very similar to what we just did:

0:003> bp kernel32!RaiseException ".dump /ma /o c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined

The /ma switch produces a full dump. I set this breakpoint to overwrite a single dump file.

More homework: modify the breakpoint to create a series of dumps. The answer will be posted in the near future.

Dvorak at it Again

John Dvorak recently posted this plug for what appears to be a pseudoscience site, holoscience.com. Quote:

This site is devoted to looking at space science in a holistic, interdisciplinary manner. It also has some exceptionally cool images.

The comments on his blog appear to be either disabled or not working, so I will post a reply here.

This site appears to be yet another "theory of everything, the scientific establishment is against me" kind of site. Example quote:

"The Big Bang is already dead! The unheralded "Galileo of the 20th century", Halton Arp, has proven that the universe is not expanding."

Everything written in the News area of the site appears to be self published and not peer-reviewed. There seems to be a war going on at Wikipedia over another apparently self-published article about the "Electric Universe" theory espoused on the site. View the discussion here.

Hi William, please refer to the list of papers on the article page going back to 1930 to show serious study in this field. A Google search will reveal 6,240 pages returned on a search for "'electric universe' + plasma" …. . –Bongani 19:38, 9 July 2005 (UTC)

No, thats a long list of papers which have something to do with electricity in space, and have nothing to do with EUM. One of the characteristics of pet/psuedo science [sic] is appropriating irrelevant papers that have the right-looking words in them. Any article that states That the Sun and stars are powered by an external electric current is (to be frank for just a moment) wacko nonsense. Any article that makes such a controversial statement, and then fails to back it up with a peer-reviewed paper, is unsuitable for wiki as "original research". … William M. Connolley 20:05:57, 2005-07-09 (UTC).

And as you can see the neutrality of the original article is disputed.

I can't believe Dvorak posted this, frankly. But it fits with the horoscope ad displayed prominently on his site. Holoscience does have "purty pictures," though.

Update (11:15) - It appears that Dvorak might have posters on his site other than himself. Not sure why or who they are, but I thought I should mention that the opinion I quoted above may not be directly attributable to the popular PC Magazine columnist.

Don Quixote vs. the Bookstore, Part II

Here is the latest in my correspondence with a bookstore “recommending” Kevin Trudeau’s “Natural Cures ’They’ Don’t Want You to Know About.”

Part I

A Word for OllyDbg

OllyDbg is a very nice tool for debugging other people’s code. While I definitely still prefer WinDbg in most situations, OllyDbg is great for stepping through assembly.

OllyDbg

I had a good reason to use it last week. I have an old VB6 application that needs to interop with new .NET applications. The VB6 code is using an ancient COM library for encryption; this library is fairly opaque in regards to what it is actually doing.

Without giving away too many details, there’s no way I can move the old code to over to anything else. I wish the original author had just imported the advapi32 Crypt* functions, but it’s too late for that now.

So, needing to decrypt data coming from the VB6 side, I was left with a few choices. I could just use COM Interop and reference the old library in my new code. But I wasn’t really happy with that, mostly because of the complexity that it adds to deployment.

I was familiar enough with the Crypto API to know that the COM library couldn’t be doing anything too complicated. This is where OllyDbg comes in.

I stepped through the library call in assembly, stopping when it made Crypt* calls. I got the parameters from the stack, and wrote them all down.

From there, I wrote a quick C++ console app to test out recreating the calls and decrypting some sample data. As it turned out, the specific algorithms that the COM library was using weren’t exposed in System.Security.Cryptography, so the C# version I ended up with had to use P/Invoke with advapi32.

But anyway, I got rid of an annoying dependency. A very satisfying hack. You can read about an even better one here—Lee Holmes uses OllyDbg to crack a program to run as a non-admin.

Jealous Much?

Here are some pictures from the office we just moved into last week.

View from 55 Water Street

Above is the view from my desk, looking west into the financial district.

View from 55 Water Street

This one is from the north side, looking towards midtown.

View from 55 Water Street

Looking northeast over the South Street Seaport, the Brooklyn Bridge, and the Manhattan Bridge.

View from 55 Water Street

Looking south at the Statue of Liberty.

Don Quixote vs. the Bookstore

There’s nothing like eight hours in various airports to make you want to try to change the world. Here’s an email I just sent to Gary McBrayer of the Hudson Group. I will post any replies I receive.