App Server Autopsy
November 10th, 2005

Earlier this week, we had a production issue with application servers that seized up and stopped serving requests. On one impacted server,

  • The private bytes counter for the w3wp process was very high for this type of server. The process is usually under 400MB, but at the time of the problem it was about 1.2GB.
  • The CPU utilization was at 0%.
  • The server immediately returned an OutOfMemoryException to all Remoting callers.
  • The server was down for fifteen minutes with this condition and didn’t recover without an iisreset. We did of course get a hang dump first.

The problem occurred very early in the morning, seemingly on many servers at once. It happened again later in the day on just one server.

The most obvious thing to look for here is a memory leak in the application. Since this is a managed application, that could mean one of a few things:

  • A lot of memory tied up in objects that are accidentally rooted, and therefore correctly not being collected. For example, you could have a static Hashtable somewhere that just kept accumulating references.
  • Not a lot of objects allocated, but many objects pinned. Since the GC can’t move these objects, they could be causing heap fragmentation. In slightly less technical terms, we’ve got plenty of free memory but none of it in large enough blocks to be useful.
  • Memory being leaked in calls into unmanaged code: COM-Interop, P/Invoke, and so forth.

I know the application well enough to say that both (2) and (3) are pretty unlikely. We’ve tried to keep it as purely managed as possible, so there’s not much COM-Interop and few reasons to ever allocate a GCHandle.

Curiously enough, the answer turned out to be none of these possibilities.

The obvious first step was to load SOS and see just what was insisting on having all of this memory. The way we do that with SOS is with the “!dumpheap —stat” command. This was the tail end of the output.

...
0x020126b0     64,951    41,967,420 System.Int32[]
0x0e5e563c  1,059,033    42,361,320 System.Data.DataRow
0x02012970     63,235    47,628,552 System.Collections.Hashtable/bucket[]
0x0201209c    130,914    54,505,988 System.Object[]
0x79b94638  2,441,756   309,847,632 System.String
0x02012c3c     12,703   411,828,216 System.Byte[]

The 300MB of strings is not necessarily out of the ordinary. This is, after all, a web application. The 400MB of byte arrays is very strange indeed, however. Add to that the fact that there are relatively very few arrays taking up a huge amount of space.

Doing some quick math, I saw that the average byte array size was in the neighborhood of 32KB, and I had absolutely no idea what that could be.

I started what could become a grueling process of looking at an awful lot of byte arrays. The !dumpheap —type System.Byte[] command, aside from taking a very long time, printed out many long runs of arrays that looked like this:

...
0x027c8b68 0x02012c3c   31,756    2 System.Byte[]
0x027d0774 0x02012c3c   31,756    2 System.Byte[]
0x027d8380 0x02012c3c   31,756    2 System.Byte[]
0x027dff8c 0x02012c3c   31,756    2 System.Byte[]
0x027e7b98 0x02012c3c   31,756    2 System.Byte[]
0x027ef7a4 0x02012c3c   31,756    2 System.Byte[]
0x027f73b0 0x02012c3c   31,756    2 System.Byte[]
0x027fefbc 0x02012c3c   31,756    2 System.Byte[]
...

These were interspersed with shorter runs of basically random sizes. At this point I was sure I had at least located the objects causing the problem, but I still needed to find out where they were coming from.

The weird thing was that I couldn’t find roots for any of these arrays. I was starting to think something was wrong with !gcroot, because aside from also taking a very long time it wasn’t finding me any roots.

I noticed that almost all of the arrays were in GC generation 2 (gen2), which indicates that relatively speaking, they’d been around a long time. However, the first run of the arrays was still in gen0. Inspired, I tried to find the !gcroot of the address of one of those arrays. This is what I found:

HANDLE(Strong):23811d8:Root:0xa655000(System.Object[])->
0x26de25c(System.Web.UbyteBufferAllocator)->
0x26de274(System.Collections.Stack)->
0x293b12c(System.Object[])->0x2861fdc(System.Byte[])

This did not look at all familiar. This is saying that the byte array is rooted by an instance of a framework type called System.Web.UbyteBufferAllocator, which is itself rooted in a static field somewhere. UbyteBufferAllocator is an internal type, so none of our code could be creating it.

I turned away from WinDbg for a minute and opened up Reflector. I took a look at where this class is used, and mercifully, there weren’t too many possibilities. The parameters to the UbyteBufferAllocator constructor determine the size of the byte arrays that it will create and maintain.

Checking the size of thee allocators beinig created in each of these cases, I found the match. The type initializer for System.Web.HttpResponseBufferElement owned the UbyteBufferAllocator that was creating these arrays.

I took a look at the implementation of the allocator class. It maintains a pool of byte arrays, and provides them to the HttpResponseBufferElement type. It only pools a small number of buffers, yielding new buffers if the demand for them is very high.

The HttpResponseBufferElement class itself is only used from a few places:

  • System.Web.HttpWriter.BufferData
  • System.Web.HttpWriter.BufferResource
  • System.Web.HttpWriter.FlushCharBuffer

This revelation made my job quite a bit easier.

This machine serves clients that, for the most part, are talking to it using .NET Remoting. The responses to these calls would not be returned using an HttpWriter. There are, however, a handful of web services hosted on the server. These would write back XML data to their clients using a HttpWriter object.

To confirm this, I dumped out the data in the byte arrays using the dc command. I guess if I had been thinking a little clearer I might have tried that right away. The data was definitely a web service response.

total 4,834 objects
0:000> dc 0x46c5cac8
46c5cac8  02012c3c 00007c00 69766564 30313e64  <,...|..barid>10
46c5cad8  39383335 6b2f3c38 65647965 3e646976  53898</foobarid>
46c5cae8  79656b3c 45766544 746e6576 65707954  <foobarEventType
46c5caf8  656d614e 6f74533e 53206b63 74696c70  Name>Foobar data
46c5cb08  61262073 203b706d 6e676953 63696669  s & Signific
46c5cb18  20746e61 636f7453 6944206b 65646976  ant Foobar Data
46c5cb28  3c73646e 79656b2f 45766544 746e6576  nds</foobarEvent
46c5cb38  65707954 656d614e 656b3c3e 76654479  TypeName><foobar

All of this would indicate that the server tried to push out a VERY big response in the recent past. The 1.1 GC was evidently crippled by this, since it was not collecting the unrooted objects in gen2. It was reproducible enough to occur on several servers at once. This is, without a doubt, a bug in the GC algorithm or the execution engine.

This may be fixed in 2.0. I won’t be able to tell you because I don’t intend to test it.

We did some more investigation to figure out exactly which web service it was and who was using it, with methods that won’t make sense to anyone without an intimate knowledge of the app.

You may recall that at the beginning of the article I said,

The problem occurred very early in the morning, seemingly on many servers at once. It happened again later in the day on just one server.

This turned out to be significant.

In the early morning hours, our India QA team was stress testing the functionality that uses the problem web service, and came across this bug. Evidently thinking that the server going down was unrelated, the tester tried again and again on other servers.

Later in the day, the developer who owns the functionality tried to reproduce the bug that the tester entered, and brought another server down. A few hours after that I had tracked it down in WinDbg.

The fix, on our end anyway, was to fix a single sql query that resulted in truly epic amounts of unnecessary data being returned.

Boom!


Debugging ASP and Visual Basic Applications, Some Assembly Required
November 2nd, 2005

We had our production release a few days ago, which means some random issues any way you slice it. It’s been much more serious in the past, but I looked at an interesting IIS hang dump today.

Although most of the debugging I’ve gone through here has been managed code, the dump in question is from a classic ASP/COM application. In our production environment, we run a legacy application side-by-side with a new ASP.NET application. To the user, it’s all the same app; to us, it’s a bit of an integration nightmare.

In this particular case, the server was unresponsive and not serving any requests. We took a dump of all of the worker processes before being forced to do an iisreset.

The first thing I did after opening the dump was to take a quick look at the stacks on all of the threads to get an idea of the overall activity. I noticed that there were a ton (ok, about twenty) with a stack more or less like this:

0:022> k
ChildEBP RetAddr
030de954 7c822114 ntdll!KiFastSystemCallRet
030de958 77e6711b ntdll!NtWaitForMultipleObjects+0xc
030dea00 7739cd08 kernel32!WaitForMultipleObjectsEx+0x11a
030dea5c 77697483 user32!RealMsgWaitForMultipleObjectsEx+0x141
030dea84 776974f2 ole32!CCliModalLoop::BlockFn+0x80
030deaac 7778866b ole32!ModalLoop+0x5b
030deac8 77788011 ole32!ThreadSendReceive+0xa0
030deae4 77787ed7 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x112
030debc4 776975b8 ole32!CRpcChannelBuffer::SendReceive2+0xc1
030debe0 7769756a ole32!CCliModalLoop::SendReceive+0x1e
030dec4c 776c4eee ole32!CAptRpcChnl::SendReceive+0x6f
030deca0 77ce127e ole32!CCtxComChnl::SendReceive+0x91
030decbc 77ce13ca rpcrt4!NdrProxySendReceive+0x43
030df0a4 77d0c947 rpcrt4!NdrClientCall2+0x206
030df0bc 77d0c911 oleaut32!IDispatch_RemoteInvoke_Proxy+0x1c
030df37c 6b61c892 oleaut32!IDispatch_Invoke_Proxy+0xb6
WARNING: Stack unwind information not available. Following frames may be wrong.
030df3e4 6b61fb5c vbscript!DllRegisterServer+0x8285
00000000 00000000 vbscript!DllRegisterServer+0xb54f

This is an automation call — either to a local or remote com server. Notice that we’re warned that bottom two frames on the stack “may be wrong.” This is because there is optimized code running that does not set up a clean stack frame.

That is, code compiled without the optimizer will usually do something like this at the start of each function:

push   ebp           ; Save the old stack base
mov    ebp, esp      ; Stack base becomes the current top of the stack
sub    esp, 0xc      ; Save space for local variables
mov    eax, [ebp+8]  ; example reference to one of the parameters

Optimized functions can omit that, and refer to parameters and local variables relative to the ESP register instead of the EBP register. All of this matters to us now because this makes it significantly harder for the debugger to construct a clean stack trace.

So anyway, we’re not 100% sure at this point that the call is originating on an .asp page (executing in vbscript.dll), but since that’s pretty much all this process does, it’s more than a safe bet.

How do we start digging into what these calls are? Well, a good starting point is SIEExtPub, which is a debugger extension just for COM stuff. It’s maintained by and for Microsoft’s SIE group, which employs some of the most badass engineers you will ever meet.

SIEExtPub has a !comcalls function which purports to show the automation calls on all threads. This was the first thing I tried, and in a similar situation it’s what I would recommend.

Unfortunately, this didn’t get me anywhere. The output for all of the threads looked like this:

Thread 49 - STA
Target Process ID: 13a82170 = 329785712
Target Thread  ID: d48526a2  (STA - Possible junk values)

It’s possible I don’t totally understand the output, but the reason I’m saying this looks bogus is because the Process ID and Thread ID values are outside of the range that I’m used to seeing. Typically (although I’m not sure it’s by rule), PID’s and TID’s are word values (ie, not more than 0xffff).

It was time for a radical re-evaluation of the whole scene. I decided to dissect one of the function calls, as close to “our code” as I could get. That would be this function:

030df37c 6b61c892 oleaut32!IDispatch_Invoke_Proxy+0xb6

Which is the last one on the stack before we are warned that “the following frames might be wrong.” Below that, it’s likely to get dicey.

This isn’t a documented API function, but I’m going to guess that its parameters will closely follow those to IDispatch::Invoke or possibly IDispatchEx::InvokeEx since this is being called from a script. I’ve done enough C++ COM coding to know what to expect for those.

I googled the name of the function too, and found this definition in the ReactOS source:

HRESULT CALLBACK IDispatch_Invoke_Proxy(
        IDispatch* This,
        DISPID dispIdMember,
        REFIID riid,
        LCID lcid,
        WORD wFlags,
        DISPPARAMS* pDispParams,
        VARIANT* pVarResult,
        EXCEPINFO* pExcepInfo,
        UINT* puArgErr)

Which is basically the same as IDispatch::Invoke. I decided to take a stab at the parameters being passed. I needed to dump the stack frame out to see them. To do that, I grabbed the child EBP pushed onto the stack at that call:

ChildEBP RetAddr
030df0bc 77d0c911 oleaut32!IDispatch_RemoteInvoke_Proxy+0x1c

And I used that with the dds (dump dwords with symbols) command:

0:022> dds 030df0bc
030df0bc  030df37c
030df0c0  77d0c911 oleaut32!IDispatch_Invoke_Proxy+0xb6  ; return address
030df0c4  141004b4                                       ; this
030df0c8  60030005                                       ; dispIdMember
030df0cc  6b655340 vbscript!DllRegisterServer+0x40d33    ; &IID_NULL
030df0d0  00000409                                       ; en-US
030df0d4  00020001                                       ; DISPATCH_METHOD
030df0d8  030df438                                       ; pDispParams
030df0dc  030df360                                       ; pVarResult
030df0e0  030df454                                       ; pExcepInfo
030df0e4  030df428                                       ; puArgErr

I’d like to point out a few things here before moving on. First, notice that it gives us some text next to the third parameter to the function. It does this because the address is within the vbscript.dll module (meaning it’s a constant or global variable). The name it gives us isn’t very helpful, though, because the only symbols we have for vbscript are export symbols. Unless you work for Microsoft, in which case this whole thing should be much, much easier for you.

If you read the documentation for Invoke, you’ll notice that they say the riid parameter is reserved and must be set to IID_NULL. If we’re looking for signs that what we’re looking at is making sense, we should remember that IID_NULL would be compiled into a dll as a constant (bing). We can dump that GUID-sized memory chunk out and see that it is indeed NULL (bing):

0:022> dd 6b655340 l4
6b655340  00000000 00000000 00000000 00000000

And finally, the next parameter should be a locale and it is set to 0x409. If you handle locales a good bit you might recognize that as 1033 decimal, or en-US (bing).

Generally speaking, I think it’s a good idea to stop every now and then when debugging and do sanity checks like that.

So, moving on, if we want to look at the parameters being passed to the function we should dig into the pDispParams parameter (number six). This is a pointer to a DISPPARAMS structure, which can be found in oaidl.h.

typedef struct tagDISPPARAMS
{
    VARIANTARG *rgvarg;
    DISPID *rgdispidNamedArgs;
    UINT cArgs;
    UINT cNamedArgs;
} DISPPARAMS;

The first field in the structure is the array of arguments, and the third is the length of that array. If we dump out that parameter,

0:022> dd 030df438 l4
030df438  057a40f0 00000000 00000002 00000000

The dwords are the fields in the DISPPARAMS structure, from right to left. We can see that there are two parameters. The first field in the structure points to the array of VARIANTs.

As a quick review, a VARIANT structure is used heavily in COM/VB and looks something like this:

typedef struct tagVARIANT
{
    VARTYPE vt;
    WORD reserved1;
    WORD reserved2;
    WORD reserved3;
    union
    {
        // DWORD-sized value
    }
} VARIANT;

The logical thing to do now is to dump out the array:

0:022> dd 057a40f0 L8
057a40f0  006f0008 00790064 06f50524 40e2e04f
057a4100  012d0008 6b60dc2d 06f504c8 0134d1a8

In each line, the low word of the first dword (0x0008) is the variant type (vt) and the variant value is the third dword. You can look it up in the headers, but I know from memory that 8 is the variant type for a BSTR.

A BSTR is the type of string used by Visual Basic / ASP internally (and is used widely in COM in general). It is a length-prefixed string, but also null-terminated.

I dumped out the string parameters like so:

0:022> du 06f504c8
06f504c8  "A URL"

0:022> du 06f50524
06f50524  "A query string"

(No, these aren’t the real parameters. I’m obfuscating them for security reasons).

This is where I stopped, because the strings were very familiar to me and I knew immediately the function being called.

If I hadn’t narrowed the issue down at this point I might have tried looking in the dumps of other processes for activity, and/or tried to look for familiar strings further down on the stack. The issue isn’t totally resolved at this point, but I at least have somewhere to go for further instrumentation and other measures.

Hopefully, somebody found this slightly interesting. Until next time, amigos. If you’re out on your bike tonight, please, wear white.


Finally, This Site Produces Some Results
September 30th, 2005

In my last post, I complained about an obscure Excel crash. Well, this morning the client got back to me saying that he opened up the original workbook (with no workarounds in place), and obviously, got the crash.

However, something was different this time. Dr. Watson showed him a link for more information, which he followed and found a download link for Office 2003 SP2! Released on September 27th! It fixed the issue! Less than a week after I lost a whole day debugging and working around it.

Thanks to Microsoft for monitoring the blogosphere closely and for being so responsive to my issue. Hopefully, the legion of call center workers on The Sub-Continent who refuse to escalate my other premier support case to anyone even remotely familiar with the Excel code base helped in churning out this incredibly timed hotfix.