Archive for the ‘WinDbg’ Category

The Debugger Extension, Part 3 - A Crash Course in .NET Object Layout

The Debugger Extension

To write this extension, we need at least a cursory understanding of the way JIT-compiled objects are represented in memory. The basic structure on a 32-bit machine is:

  Offset
         +---------------------+
  +0x0   |  MethodTable*       |
         +---------------------+
  +0x4   |  Field 1            |
         +---------------------+
  +0x8   |  Field 2            |
         +---------------------+
         |  ...                |
         +---------------------+
  +0x4*N |  Field N            |
         +---------------------+

If you are wondering what the method table is, well, it is what it sounds like. It's a list of pointers to functions that the object defines. And some other stuff. If we wanted to dive deeper into the type metadata that supports Reflection and other magical CLR api's, the MethodTable is where we would start. But that is beyond the scope of this series.

The object's fields follow the MethodTable. Types derived from System.ValueType (structures) that are held as fields are inlined into the object instance. So if we have a class that has a DateTime field,

  Offset
         +---------------------+
  +0x0   |  MethodTable*       |
         +---------------------+
  +0x4   |  _dt.dateData       |
         |                     |
         +---------------------+
  +0xc   |  Field 2            |
         +---------------------+
         |  ...                |
         +---------------------+
+0x4*N+4 |  Field N            |
         +---------------------+

The datetime field would occupy two slots since it's a 64-bit value. Reference types (objects) that are held as fields are kept as pointer-sized handles.

The fields may not be in the same order as they are written in the source, but they should be stable from one process to the next. I'm not able to guarantee that since I don't work for Microsoft and I don't have access to the source for the CLR's JIT, but I've observed this consistency quite a bit. To view the field layout of any managed type, you can use the !do (DumpObject) command in the SOS extension, or !DumpClass in the same if you do not have an instance handy. Below is the output for an instance of the type we are using in this example, ArbitraryType. This instance has a _color field of zero (Colors.Red) and a _id field of 22.

0:000> !do 012723e4
Name: SampleApp.ArbitraryType
MethodTable: 009131b0
EEClass: 00911410
Size: 16(0x10) bytes
 (C:\src\samples\dmext\SampleApp\bin\Release\SampleApp.exe)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
00913104  4000006        4         System.Int32  0 instance        0 _color
790fed1c  4000007        8         System.Int32  0 instance       22 _id

We can use the dd command to display the raw memory for the same instance. I've added comments here for the object's data.

0:000> dd /c1 012723e4 l3
012723e4  009131b0        ; MethodTable*
012723e8  00000000        ; _color = (int)Colors.Red = 0
012723ec  00000016        ; _id = 0x16 = 22

The MethodTable pointer will be the same for each instance of the type, but its value will be different every time you run the program. The constancy of this field enables a nice hack solution to our problem.

Rather than root through the CLR's internal structures to find ArbitraryType instances, we will simply search memory for DWORDs that look like they're pointers to the MethodTable for our type. This may result in some bogus hits, but they'll just be noise.

In the next post in this series, we will actually start coding the extension.

If you are interested in further reading on CLR internals, I recommend this MSDN article and the SSCLI codebase.

The Debugger Extension, Part 2 - A Use Case & the Problem Setup

The Debugger Extension

Now that we have a feel for the difficulty of the task, we should probably stop for a moment and reflect. Do we really want to go through with this?

Why would you want to go to all the trouble of writing your own extension? Especially with the SOS extension around, this may seem like an awful lot of effort if you are working with managed code. I would tend to agree, but I have found a few situations where writing an extension was helpful.

The best example I have revolves around a centralized framework for accessing the ASP.NET Cache in a web application, which I authored and which my company is currently using. Each cached item is accompanied by our own metadata. Writing an extension that understood this structure and was able to aggregate cache statistics made it easier to analyze dumps of high-memory situations when thousands of these objects are present. I can only offer my own experience here, but I'm sure many developers have similar functionality.

I think that the arguments for writing an extension become more compelling once you have developed a certain amount of your own framework for doing so (I will try to share some of mine here). It is also a very instructive activity–you can learn a great deal about the way the CLR works by getting your hands dirty.

Now that I've hopefully convinced you that this will be a worthwhile activity, let's come up with a very stupid piece of C# code to study.

enum Colors : int
{
    Red = 0,
    Green = 1,
    Blue = 2,
    Purple = 3
}

class ArbitraryType
{
    private Colors _color;
    private int _id;

    public ArbitraryType(Colors c, int id)
    {
        _color = c;
        _id = id;
    }
}

Our job will be to print out statistics about the "Colors" of the ArbitraryTypes in memory.

Before we can do that, we will have to do some investigation so that we understand exactly what we're looking for. That will be the subject of the next post.

The Debugger Extension, Part 1 - What is a DbgEng Extension?

The Debugger Extension

A better question to start off might be: "what is DbgEng?" Frequent visitors may have seen me refer to WinDbg and the Debugging Tools for Windows somewhat interchangably. There are actually two other debuggers in the package called CDB and NTSD. CDB, NTSD, and WinDbg are all written on top of the same debugger engine, implemented in dbgeng.dll.

An extension for the debugger engine is a dll with a specific set of exported functions. Commands that are callable from the debugger are implemented as additional exported functions. SOS, SieExtPub, and other modules I may have mentioned before are all debugger extensions in this vein. The full syntax for calling an extension function from within the debugger is:

![module].[function] [arguments]

The extension module name is only really necessary if there is a naming collision. For example, if you have several extensions loaded that all define a !help function, you can call the !help function in Son of Strike by typing !SOS.help.

The basics of Writing a Debugger Extension

The DbgEng API is a set of COM interfaces that allow you to interact with both the debugger and the process or crash dump being debugged. It's a little difficult to find documentation for the interfaces online, but the help files that come with the debugging tools are reasonably complete. The header files in the SDK fill in the remaining gaps. You can get some information from this PowerPoint document.

Understanding the COM interfaces is definitely the simplest aspect of writing an extension, at least from the perspective of developers who have been spending their lives writing code in user mode–not even to mention those accustomed to managed code and Visual Studio. DbgEng requires that your extension be built using the Windows Driver Development Kit (DDK) build environment. One thing I would definitely recommend is DDKBUILD, a freeware batch file that allows you to use the DDK build environment from a Visual Studio makefile project.

As you can see, writing an extension remains something of an arcane activity. Eran Sandler has threatened to create a framework for writing managed debugger extensions, and I hope he does. That would be cool.

It has recently become at least slightly easier to create extensions, in terms of language if not environment. In the most recent debugger update, I found this in the release notes:

New EngExtCpp C++ extension framework … This is a Developer Preview, and APIs are subject to change.

This is the framework I will be using. After setting up the sources, makefile, .def file, and .rc file for my project (I recommend just copying and editing these from one of the debugger SDK samples), all I need to do to create a working–but useless–extension is to define two files.

// ------------------------------------------------------
// dmext.h
//
#pragma once
#include "engextcpp.hpp"

class EXT_CLASS : public ExtExtension
{
public:
    EXT_CLASS();
    EXT_COMMAND_METHOD(foo);
};

// ------------------------------------------------------
// dmext.cpp
//
#include "stdafx.h"
#include "dmext.h"

EXT_DECLARE_GLOBALS();

EXT_CLASS::EXT_CLASS()
{
}

EXT_COMMAND(foo, "Sample extension command", "")
{
    this->Out("Hello World.\n");
}

In the next post, after I plead with you for a while that this is still a good idea, we'll set up a sample problem set to work with.

Exceptions are not a Control Mechanism

This is a partial debugger trace of an extremely popular and widespread application, used in commercial software by thousands of developers at hundreds of companies including Microsoft. This was happening when the program was functioning normally.

(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): C++ EH exception - code e06d7363 (first chance)
ModLoad: 5c060000 5c072000   C:\WINDOWS\system32\SRCLIENT.DLL
ModLoad: 692c0000 692ee000   C:\WINDOWS\System32\Wbem\framedyn.dll
(1230.1234): Unknown exception - code 80010105 (first chance)
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): Unknown exception - code 80010105 (first chance)
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1254): Unknown exception - code 80010108 (first chance)
ModLoad: 76bb0000 76bb4000   C:\WINDOWS\System32\SFC.DLL
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): Unknown exception - code 8001010e (first chance)
ModLoad: 76bb0000 76bb4000   C:\WINDOWS\System32\SFC.DLL
ModLoad: 5c060000 5c072000   C:\WINDOWS\system32\SRCLIENT.DLL
ModLoad: 692c0000 692ee000   C:\WINDOWS\System32\Wbem\framedyn.dll
(1230.1234): C++ EH exception - code e06d7363 (first chance)
(1230.1234): Unknown exception - code 80010012 (first chance)

I had an A/V problem with this utility that I was trying to debug - this was made basically impossible by the fact that the program was continuing after dozens of other access violations. People, you really are not supposed to do this. I won't say what application this is, but I will say that I am not going to be using it in the future.

App Server Autopsy

Earlier this week, we had a production issue with application servers that seized up and stopped serving requests. On one impacted server,

  • The private bytes counter for the w3wp process was very high for this type of server. The process is usually under 400MB, but at the time of the problem it was about 1.2GB.
  • The CPU utilization was at 0%.
  • The server immediately returned an OutOfMemoryException to all Remoting callers.
  • The server was down for fifteen minutes with this condition and didn’t recover without an iisreset. We did of course get a hang dump first.

The problem occurred very early in the morning, seemingly on many servers at once. It happened again later in the day on just one server.

The most obvious thing to look for here is a memory leak in the application. Since this is a managed application, that could mean one of a few things:

  • A lot of memory tied up in objects that are accidentally rooted, and therefore correctly not being collected. For example, you could have a static Hashtable somewhere that just kept accumulating references.
  • Not a lot of objects allocated, but many objects pinned. Since the GC can’t move these objects, they could be causing heap fragmentation. In slightly less technical terms, we’ve got plenty of free memory but none of it in large enough blocks to be useful.
  • Memory being leaked in calls into unmanaged code: COM-Interop, P/Invoke, and so forth.

I know the application well enough to say that both (2) and (3) are pretty unlikely. We’ve tried to keep it as purely managed as possible, so there’s not much COM-Interop and few reasons to ever allocate a GCHandle.

Curiously enough, the answer turned out to be none of these possibilities.

The obvious first step was to load SOS and see just what was insisting on having all of this memory. The way we do that with SOS is with the “!dumpheap –stat” command. This was the tail end of the output.

…
0x020126b0     64,951    41,967,420 System.Int32[]
0x0e5e563c  1,059,033    42,361,320 System.Data.DataRow
0x02012970     63,235    47,628,552 System.Collections.Hashtable/bucket[]
0x0201209c    130,914    54,505,988 System.Object[]
0x79b94638  2,441,756   309,847,632 System.String
0x02012c3c     12,703   411,828,216 System.Byte[]

The 300MB of strings is not necessarily out of the ordinary. This is, after all, a web application. The 400MB of byte arrays is very strange indeed, however. Add to that the fact that there are relatively very few arrays taking up a huge amount of space.

Doing some quick math, I saw that the average byte array size was in the neighborhood of 32KB, and I had absolutely no idea what that could be.

I started what could become a grueling process of looking at an awful lot of byte arrays. The !dumpheap –type System.Byte[] command, aside from taking a very long time, printed out many long runs of arrays that looked like this:

…
0x027c8b68 0x02012c3c   31,756    2 System.Byte[]
0x027d0774 0x02012c3c   31,756    2 System.Byte[]
0x027d8380 0x02012c3c   31,756    2 System.Byte[]
0x027dff8c 0x02012c3c   31,756    2 System.Byte[]
0x027e7b98 0x02012c3c   31,756    2 System.Byte[]
0x027ef7a4 0x02012c3c   31,756    2 System.Byte[]
0x027f73b0 0x02012c3c   31,756    2 System.Byte[]
0x027fefbc 0x02012c3c   31,756    2 System.Byte[]
…

These were interspersed with shorter runs of basically random sizes. At this point I was sure I had at least located the objects causing the problem, but I still needed to find out where they were coming from.

The weird thing was that I couldn’t find roots for any of these arrays. I was starting to think something was wrong with !gcroot, because aside from also taking a very long time it wasn’t finding me any roots.

I noticed that almost all of the arrays were in GC generation 2 (gen2), which indicates that relatively speaking, they’d been around a long time. However, the first run of the arrays was still in gen0. Inspired, I tried to find the !gcroot of the address of one of those arrays. This is what I found:

HANDLE(Strong):23811d8:Root:0xa655000(System.Object[])->
0x26de25c(System.Web.UbyteBufferAllocator)->
0x26de274(System.Collections.Stack)->
0x293b12c(System.Object[])->0x2861fdc(System.Byte[])

This did not look at all familiar. This is saying that the byte array is rooted by an instance of a framework type called System.Web.UbyteBufferAllocator, which is itself rooted in a static field somewhere. UbyteBufferAllocator is an internal type, so none of our code could be creating it.

I turned away from WinDbg for a minute and opened up Reflector. I took a look at where this class is used:

UByteBufferAllocator users

Mercifully, there weren’t too many possibilities. The parameters to the UbyteBufferAllocator constructor determine the size of the byte arrays that it will create and maintain.

Checking the size of thee allocators beinig created in each of these cases, I found the match. The type initializer for System.Web.HttpResponseBufferElement owned the UbyteBufferAllocator that was creating these arrays.

HttpResponseBufferElement initializer

I took a look at the implementation of the allocator class. It maintains a pool of byte arrays, and provides them to the HttpResponseBufferElement type. It only pools a small number of buffers, yielding new buffers if the demand for them is very high.

The HttpResponseBufferElement class itself is only used from a few places:

  • System.Web.HttpWriter.BufferData
  • System.Web.HttpWriter.BufferResource
  • System.Web.HttpWriter.FlushCharBuffer

This revelation made my job quite a bit easier.

This machine serves clients that, for the most part, are talking to it using .NET Remoting. The responses to these calls would not be returned using an HttpWriter. There are, however, a handful of web services hosted on the server. These would write back XML data to their clients using a HttpWriter object.

To confirm this, I dumped out the data in the byte arrays using the dc command. I guess if I had been thinking a little clearer I might have tried that right away. The data was definitely a web service response.

total 4,834 objects
0:000> dc 0x46c5cac8
46c5cac8  02012c3c 00007c00 69766564 30313e64  <,...|..barid>10
46c5cad8  39383335 6b2f3c38 65647965 3e646976  53898</foobarid>
46c5cae8  79656b3c 45766544 746e6576 65707954  <foobarEventType
46c5caf8  656d614e 6f74533e 53206b63 74696c70  Name>Foobar data
46c5cb08  61262073 203b706d 6e676953 63696669  s & Signific
46c5cb18  20746e61 636f7453 6944206b 65646976  ant Foobar Data
46c5cb28  3c73646e 79656b2f 45766544 746e6576  nds</foobarEvent
46c5cb38  65707954 656d614e 656b3c3e 76654479  TypeName><foobar

All of this would indicate that the server tried to push out a VERY big response in the recent past. The 1.1 GC was evidently crippled by this, since it was not collecting the unrooted objects in gen2. It was reproducible enough to occur on several servers at once. This is, without a doubt, a bug in the GC algorithm or the execution engine.

This may be fixed in 2.0. I won’t be able to tell you because I don’t intend to test it.

We did some more investigation to figure out exactly which web service it was and who was using it, with methods that won’t make sense to anyone without an intimate knowledge of the app.

You may recall that at the beginning of the article I said,

The problem occurred very early in the morning, seemingly on many servers at once. It happened again later in the day on just one server.

This turned out to be significant.

In the early morning hours, our India QA team was stress testing the functionality that uses the problem web service, and came across this bug. Evidently thinking that the server going down was unrelated, the tester tried again and again on other servers.

Later in the day, the developer who owns the functionality tried to reproduce the bug that the tester entered, and brought another server down. A few hours after that I had tracked it down in WinDbg.

The fix, on our end anyway, was to fix a single sql query that resulted in truly epic amounts of unnecessary data being returned.

Boom!

Debugging ASP/Visual Basic Applications, Some Assembly Required

This is a very technical post about debugging an unmanaged memory dump. The tool I am using for this is WinDbg. You can get more general information about WinDbg here.

We had our production release a few days ago, which means some random issues any way you slice it. It’s been much more serious in the past, but I looked at an interesting IIS hang dump today.

Although most of the debugging I’ve gone through here has been managed code, the dump in question is from a classic ASP/COM application. In our production environment, we run a legacy application side-by-side with a new ASP.NET application. To the user, it’s all the same app; to us, it’s a bit of an integration nightmare.

In this particular case, the server was unresponsive and not serving any requests. We took a dump of all of the worker processes before being forced to do an iisreset.

The first thing I did after opening the dump was to take a quick look at the stacks on all of the threads to get an idea of the overall activity. I noticed that there were a ton (ok, about twenty) with a stack more or less like this:

0:022> k
ChildEBP RetAddr
030de954 7c822114 ntdll!KiFastSystemCallRet
030de958 77e6711b ntdll!NtWaitForMultipleObjects+0xc
030dea00 7739cd08 kernel32!WaitForMultipleObjectsEx+0x11a
030dea5c 77697483 user32!RealMsgWaitForMultipleObjectsEx+0x141
030dea84 776974f2 ole32!CCliModalLoop::BlockFn+0x80
030deaac 7778866b ole32!ModalLoop+0x5b
030deac8 77788011 ole32!ThreadSendReceive+0xa0
030deae4 77787ed7 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x112
030debc4 776975b8 ole32!CRpcChannelBuffer::SendReceive2+0xc1
030debe0 7769756a ole32!CCliModalLoop::SendReceive+0x1e
030dec4c 776c4eee ole32!CAptRpcChnl::SendReceive+0x6f
030deca0 77ce127e ole32!CCtxComChnl::SendReceive+0x91
030decbc 77ce13ca rpcrt4!NdrProxySendReceive+0x43
030df0a4 77d0c947 rpcrt4!NdrClientCall2+0x206
030df0bc 77d0c911 oleaut32!IDispatch_RemoteInvoke_Proxy+0x1c
030df37c 6b61c892 oleaut32!IDispatch_Invoke_Proxy+0xb6
WARNING: Stack unwind information not available. Following frames may be wrong.
030df3e4 6b61fb5c vbscript!DllRegisterServer+0x8285
00000000 00000000 vbscript!DllRegisterServer+0xb54f

This is an automation call – either to a local or remote com server. Notice that we’re warned that bottom two frames on the stack “may be wrong.” This is because there is optimized code running that does not set up a clean stack frame.

That is, code compiled without the optimizer will usually do something like this at the start of each function:

push   ebp           ; Save the old stack base
mov    ebp, esp      ; Stack base becomes the current top of the stack
sub    esp, 0xc      ; Save space for local variables
mov    eax, [ebp+8]  ; example reference to one of the parameters

Optimized functions can omit that, and refer to parameters and local variables relative to the ESP register instead of the EBP register. All of this matters to us now because this makes it significantly harder for the debugger to construct a clean stack trace.

So anyway, we’re not 100% sure at this point that the call is originating on an .asp page (executing in vbscript.dll), but since that’s pretty much all this process does, it’s more than a safe bet.

How do we start digging into what these calls are? Well, a good starting point is SIEExtPub, which is a debugger extension just for COM stuff. It’s maintained by and for Microsoft’s SIE group, which employs some of the most badass engineers you will ever meet. (You can download a copy of this dll here).

SIEExtPub has a !comcalls function which purports to show the automation calls on all threads. This was the first thing I tried, and in a similar situation it’s what I would recommend.

Unfortunately, this didn’t get me anywhere. The output for all of the threads looked like this:

       Thread 49 - STA
Target Process ID: 13a82170 = 329785712
Target Thread  ID: d48526a2  (STA - Possible junk values)

It’s possible I don’t totally understand the output, but the reason I’m saying this looks bogus is because the Process ID and Thread ID values are outside of the range that I’m used to seeing. Typically (although I’m not sure it’s by rule), PID’s and TID’s are word values (ie, not more than 0xffff).

It was time for a radical re-evaluation of the whole scene. I decided to dissect one of the function calls, as close to “our code” as I could get. That would be this function:

030df37c 6b61c892 oleaut32!IDispatch_Invoke_Proxy+0xb6

Which is the last one on the stack before we are warned that “the following frames might be wrong.” Below that, it’s likely to get dicey.

This isn’t a documented API function, but I’m going to guess that its parameters will closely follow those to IDispatch::Invoke or possibly IDispatchEx::InvokeEx since this is being called from a script. I’ve done enough C++ COM coding to know what to expect for those.

I googled the name of the function too, and found this definition in the ReactOS source:

HRESULT CALLBACK IDispatch_Invoke_Proxy(
       IDispatch* This,
       DISPID dispIdMember,
       REFIID riid,
       LCID lcid,
       WORD wFlags,
       DISPPARAMS* pDispParams,
       VARIANT* pVarResult,
       EXCEPINFO* pExcepInfo,
       UINT* puArgErr)

Which is basically the same as IDispatch::Invoke. I decided to take a stab at the parameters being passed. I needed to dump the stack frame out to see them. To do that, I grabbed the child EBP pushed onto the stack at that call:

ChildEBP RetAddr
030df0bc 77d0c911 oleaut32!IDispatch_RemoteInvoke_Proxy+0x1c

And I used that with the dds (dump dwords with symbols) command:

0:022> dds 030df0bc
030df0bc  030df37c
030df0c0  77d0c911 oleaut32!IDispatch_Invoke_Proxy+0xb6  ; return address
030df0c4  141004b4                                       ; this
030df0c8  60030005                                       ; dispIdMember
030df0cc  6b655340 vbscript!DllRegisterServer+0x40d33    ; &IID_NULL
030df0d0  00000409                                       ; en-US
030df0d4  00020001                                       ; DISPATCH_METHOD
030df0d8  030df438                                       ; pDispParams
030df0dc  030df360                                       ; pVarResult
030df0e0  030df454                                       ; pExcepInfo
030df0e4  030df428                                       ; puArgErr

I’d like to point out a few things here before moving on. First, notice that it gives us some text next to the third parameter to the function. It does this because the address is within the vbscript.dll module (meaning it’s a constant or global variable). The name it gives us isn’t very helpful, though, because the only symbols we have for vbscript are export symbols. Unless you work for Microsoft, in which case this whole thing should be much, much easier for you.

If you read the documentation for Invoke, you’ll notice that they say the riid parameter is reserved and must be set to IID_NULL. If we’re looking for signs that what we’re looking at is making sense, we should remember that IID_NULL would be compiled into a dll as a constant (bing). We can dump that GUID-sized memory chunk out and see that it is indeed NULL (bing):

0:022> dd 6b655340 l4
6b655340  00000000 00000000 00000000 00000000

And finally, the next parameter should be a locale and it is set to 0x409. If you handle locales a good bit you might recognize that as 1033 decimal, or en-US (bing).

Generally speaking, I think it’s a good idea to stop every now and then when debugging and do sanity checks like that.

So, moving on, if we want to look at the parameters being passed to the function we should dig into the pDispParams parameter (number six). This is a pointer to a DISPPARAMS structure, which can be found in oaidl.h.

typedef struct tagDISPPARAMS
{
    VARIANTARG *rgvarg;
    DISPID *rgdispidNamedArgs;
    UINT cArgs;
    UINT cNamedArgs;
} DISPPARAMS;

The first field in the structure is the array of arguments, and the third is the length of that array. If we dump out that parameter,

0:022> dd 030df438 l4
030df438  057a40f0 00000000 00000002 00000000

The dwords are the fields in the DISPPARAMS structure, from right to left. We can see that there are two parameters. The first field in the structure points to the array of VARIANTs.

As a quick review, a VARIANT structure is used heavily in COM/VB and looks something like this:

typedef struct tagVARIANT
{
    VARTYPE vt;
    WORD reserved1;
    WORD reserved2;
    WORD reserved3;
    union
    {
        // DWORD-sized value
    }
} VARIANT;

The logical thing to do now is to dump out the array:

0:022> dd 057a40f0 L8
057a40f0  006f0008 00790064 06f50524 40e2e04f
057a4100  012d0008 6b60dc2d 06f504c8 0134d1a8

In each line, the low word of the first dword (0x0008) is the variant type (vt) and the variant value is the third dword. You can look it up in the headers, but I know from memory that 8 is the variant type for a BSTR.

A BSTR is the type of string used by Visual Basic / ASP internally (and is used widely in COM in general). It is a length-prefixed string, but also null-terminated.

I dumped out the string parameters like so:

0:022> du 06f504c8
06f504c8  "A URL"

0:022> du 06f50524
06f50524  "A query string"

(No, these aren’t the real parameters. I’m obfuscating them for security reasons).

This is where I stopped, because the strings were very familiar to me and I knew immediately the function being called.

If I hadn’t narrowed the issue down at this point I might have tried looking in the dumps of other processes for activity, and/or tried to look for familiar strings further down on the stack. The issue isn’t totally resolved at this point, but I at least have somewhere to go for further instrumentation and other measures.

Hopefully, somebody found this slightly interesting. Until next time, amigos. If you’re out on your bike tonight, please, wear white.

Finally, This Site Produces Some Results

In my last post, I complained about an obscure Excel crash. Well, this morning the client got back to me saying that he opened up the original workbook (with no workarounds in place), and obviously, got the crash.

However, something was different this time. Dr. Watson showed him a link for more information, which he followed and found a download link for Office 2003 SP2! Released on September 27th! It fixed the issue! Less than a week after I lost a whole day debugging and working around it.

Thanks to Microsoft for monitoring the blogosphere closely and for being so responsive to my issue. Hopefully, the legion of call center workers on The Sub-Continent who refuse to escalate my other premier support case to anyone even remotely familiar with the Excel code base helped in churning out this incredibly timed hotfix.

Following Excel User-Defined Functions with WinDbg: A Debugging Odyssey

One of several applications I work on is a COM add-in for Excel. To put it mildly, this thing can be a real bear to support sometimes.

I was looking into a very peculiar access violation (AV) today. It seemed to only ever happen if—

  • My add-in was installed.
  • The Analysis Toolpak was installed.
  • I tried to load a very large and very particular workbook from a client.

I took a look at a crash dump and—luckily—was able to reproduce this in my office. This is a very good thing, because when I can’t reproduce it often means that I will have to dress up.

The AV was occurring very deep in some Excel code. It didn’t look like my add-in was doing anything special, or in fact, anything at all.

My first instinct might have been, “Why can’t the Excel team fix their stupid bugs!!?!” But, I have more empathy than that. The AV happened in a function that appeared to put the EBP register through a blender set to “pulverize.”

In other words, it required effort to get an accurate stack trace within this function, and I was far too lazy to pick up the pieces. Any reasonable person would be too—and maybe the bug only occurs with the optimizer turned on. So I will give the Excel team a pass on this one.

So what’s a clever debugger to do? Well, I (begrudgingly) know enough about Excel to know that the Analysis Toolpak is a bit of a dinosaur—apparently just like many Office users. It is actually ANALYS32.XLL, nothing more than a standard DLL with special exports. It is a relic of add-in architectures gone by.

So I used GFLAGS.EXE to set WinDbg as the startup debugger for Excel, and fired it up.

Now, I had absolutely no idea which worksheet functions were actually in the toolpak. Frankly, I had no particular desire to do the research. So, I set a breakpoint on all of them:

0:00> bm ANALYS32!*

The bm command is the set symbol breakpoint command. Since I only have export symbols for the Analysis Toolpak, this effectively puts a break on every exported function from that XLL.

After that I hit “go” and opened up the workbook. The first break was on the XIRR formula:

ANALYS32!xirr:
3d00fbb1 55               push    ebp

At this point I had second thoughts about breaking on all of the formulas, so I cleared those breakpoints.

0:00> bc *

And I figured that it was pretty likely that Excel would be calling any user-defined-function from the same parent procedure, so I stepped one frame down on the stack.

0:00> .frame 1

The .frame command changes the local context. It’s basically the same as clicking on a higher function in the stack trace in Visual Studio. At that point I saw this in my disassembly window:

3046aaac ff5004      call    dword ptr [eax+0x4]  ; call to UDF
3046aaaf 8b4d14      mov     ecx, [ebp+0x14]      ; return address

The first instruction is the call to the user-defined function. The second is the return address, or the next instruction to be executed when the user-defined function returns.

I was still in the information-gathering stage here, so I decided to just dump out the functions being called here rather than try to really dig into any disassembly where the AV was occurring.

I decided to print out messages on both of these instructions—in other words, before and after every call to a UDF. I did that with two breakpoints:

0:000> bp 3046aaac "dds @eax+4 L1; g"
0:000> bp 3046aaaf ".echo return; g"

Let me point out a few things about these. First, I specified the breakpoint location directly by instruction address, and not using any symbols. Second, let me explain what I mean by “dds @eax+4 L1.” dds is the display data with symbols command. And as you recall, the instruction we are breaking on is:

call    dword ptr [eax+0x4]

So what I am trying to do here is print out the name of the function being called, if we indeed have the name. If there’s a symbol for (@eax+0x4) when we get here, dds will print it out for us.

After this I hit “go,” and this is what I saw:

00137974  3d00fbb1 ANALYS32!xirr
return
00137974  3d00fbb1 ANALYS32!xirr
return
00137780  3d00fbb1 ANALYS32!xirr
00133d58  3d00fbb1 ANALYS32!xirr
return
00133d58  3d00fbb1 ANALYS32!xirr
return
return

<KABOOM>

Hm, that’s interesting. The XIRR’s by themselves seem to work fine, but we get an AV as soon as the nested XIRR function returns.

At this point I disabled the Analysis Toolpak and opened the worksheet up. Searching through the formulas, I found the XIRR functions that my client was using. I couldn’t figure out how any of them were dependent upon one another, but I did notice that they were all feeding on a TODAY() function. TODAY is what you call a volatile function, and can cause a lot of crazy BS to go down.

On a hunch, I plugged in a nonvolatile alternative that I had handy in an XLA module. After that, the workbook loaded fine.

So there you have it. Apparently, Excel fails miserably if my add-in loads, there are nested XIRR formulas on a sheet, and they are using the TODAY formula as an input.

I never figured out what the “real” conflict was. It’s not worth the effort and Excel shouldn’t crash, regardless, I told myself.

It’s ironic that I never would have been able to figure this much out if the conflict had been with an XLA add-in or an Automation add-in. Well ok, not never, but I probably would have given up first.

Using WinDbg to Log Exceptions, Part III

Part I

Part II

In my original post, I left you with the assignment of modifying this command to create a series of dump files instead of overwriting the same one repeatedly.

0:003> bp kernel32!RaiseException ".dump /ma /o c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined

Well, this is going to be a short post because shortly after writing that I realized there was a much easier way to do this than what I had in mind.

The /u switch for the .dump command appends a timestamp and the PID to the filename.

0:003> bp kernel32!RaiseException ".dump /ma /u c:\\temp\\myapp.dmp; g"
breakpoint 0 redefined

An effective way to fill up the hard drive on a production machine, if I do say so myself.

Using WinDbg to Log Exceptions, Part II

In my last post, I asked you to explain how to derive this command:

0:000> bp mscorwks!JIT_Throw "du poi(@ecx+10)+c; !clrstack; g"

The purpose of which is to print out the exception text when a CLR exception occurs.

Finding the Function

The first question here might be, how did I know to break on this function? This is the easiest part to get. If we break on kernel32!RaiseException and examine the unmanaged stack, we can see that this is the first function called from managed code when an exception is thrown.

0:000> k
ChildEBP RetAddr
0012f5a8 79238b7d KERNEL32!RaiseException
0012f600 792f0d05 mscorwks!RaiseTheException+0xa0
0012f66c 02e000a0 mscorwks!JIT_Throw+0x4d
WARNING: Frame IP not in any known module. Following frames may be wrong.
0012f69c 791d94bc 0x2e000a0
0012f6a4 791ed194 mscorwks!CallDescrWorker+0x30
…

I have one important note about the JIT_Throw function. I’ve shown it here in the mscorwks module. This is the workstation CLR. If you are working on a server application, you should change all references to that to mscosvr, which is the server version of the CLR.

The more interesting part, of course, is how I figured out how to use du poi(@ecx+10)+c to print out the exception text.

Finding the Exception

If we actually break on JIT_Throw, we can locate the address of the current managed exception using the DumpStackObjects command in the SOS extension.

0:000> !dso
Thread 0
ESP/REG    Object     Name
ecx        0x00aaabfc System.Exception
0x0012f680 0x00aaabfc System.Exception
0x0012f688 0x00aaab6c System.Object[]
…

Notice that !dso gives us an additional piece of information: at the start of this call to the JIT_Throw function, the address of the current exception is in the ECX register. Can we rely on that to be true in every case? It turns out that we can, and I think it is instructive to explain why this is so.

Some of you will already know why ECX is special here, and if we take a look at the prologue of this function in assembly it might confirm your suspicions.

mscorwks!JIT_Throw:
792f0cb8 55               push    ebp
792f0cb9 8bec             mov     ebp,esp
792f0cbb 83ec58           sub     esp,0x58
792f0cbe 56               push    esi
792f0cbf 8bf1             mov     esi,ecx

The last two instructions are a common sight when working with functions declared with the __fastcall calling convention. This convention places the first two arguments to the function in the ECX and EDX registers.

For the sake of completeness I should say that you will also see the same kind of prologue if a C++ member function is being called (__thiscall). However, we know that this is not the case here.

Taking a quick look at the assembly here would be enough proof, but we can confirm that __fastcall is actually being used by reading the comments in fcall.h in the SSCLI, and then looking up the definition of JIT_Throw in jitinterface.cpp.

To summarize, as long as we are running on an x86 machine, we can always grab a pointer to the current exception at this breakpoint from the ECX register.

The Home Stretch (Finding the Exception Text)

Now that we can locate the managed Exception, we can take a stab at finding the text to go along with it. Take a look at the output when we dump the Exception object (using the DumpObject command in SOS):

0:000> !do @ecx
Name: System.Exception
MethodTable 0x79b947ac
EEClass 0x79b94850
Size 64(0x40) bytes
GC Generation: 0
mdToken: 0x02000012  (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll)
FieldDesc*: 0x79b948b4
    MT      Field     Offset          Type   Attr     Value       Name
0x79b947ac 0x400001d     0x4         CLASS   instance 0x00000000  _className
0x79b947ac 0x400001e     0x8         CLASS   instance 0x00000000  _exceptionMethod
0x79b947ac 0x400001f     0xc         CLASS   instance 0x00000000  _exceptionMethodString
0x79b947ac 0x4000020    0x10         CLASS   instance 0x00aaab7c  _message
0x79b947ac 0x4000021    0x14         CLASS   instance 0x00000000  _innerException
0x79b947ac 0x4000022    0x18         CLASS   instance 0x00000000  _helpURL
0x79b947ac 0x4000023    0x1c         CLASS   instance 0x00000000  _stackTrace
0x79b947ac 0x4000024    0x20         CLASS   instance 0x00000000  _stackTraceString
0x79b947ac 0x4000025    0x24         CLASS   instance 0x00000000  _remoteStackTraceString
0x79b947ac 0x4000026    0x2c  System.Int32   instance 0           _remoteStackIndex
0x79b947ac 0x4000027    0x30  System.Int32   instance -2146233088 _HResult
0x79b947ac 0x4000028    0x28         CLASS   instance 0x00000000  _source
0x79b947ac 0x4000029    0x34  System.Int32   instance 0           _xptrs
0x79b947ac 0x400002a    0x38  System.Int32   instance -532459699  _xcode
-----------------
Exception 00aaabfc in MT 79b947ac: System.Exception
_message: You stink!

The third column, Offset, is what is important to us right now. This tells us the address of each field relative to the base address of the object. If we want the System.String object instance for the _message field, we can refer to that like this:

0:000> ? poi(@ecx+10)
Evaluate expression: 11185020 = 00aaab7c

(The poi keyword in this expression returns a pointer-sized variable at the specified address. It is the same as “*(pException+0x10)” would be in C or C++).

We can use the same kind of procedure on the String instance to get the actual characters.

0:000> !do poi(@ecx+10)
Name: System.String
…
FieldDesc*: 0x79b92978
     MT      Field     Offset                 Type       Attr  Value  Name
0x79b925c8 0x4000013      0x4         System.Int32   instance     11  m_arrayLength
0x79b925c8 0x4000014      0x8         System.Int32   instance     10  m_stringLength
0x79b925c8 0x4000015      0xc          System.Char   instance   0x59  m_firstChar
…

The first character in the string is at 0xC relative to the base of the object. We know that the CLR uses Unicode to represent strings internally, so we can use the du command (display memory, Unicode) to print it out.

0:000> du poi(@ecx+10)+c
00aaab88  "You stink!"

Conclusion

Well, there you have it. With a little elbow grease, WinDbg is a tremendously powerful tool. I hope you’ve found this particular odyssey entertaining.