Earlier this week, we had a production issue with application servers that seized up and stopped serving requests. On one impacted server,
- The private bytes counter for the w3wp process was very high for this type of server. The process is usually under 400MB, but at the time of the problem it was about 1.2GB.
- The CPU utilization was at 0%.
- The server immediately returned an OutOfMemoryException to all Remoting callers.
- The server was down for fifteen minutes with this condition and didn’t recover without an iisreset. We did of course get a hang dump first.
The problem occurred very early in the morning, seemingly on many servers at once. It happened again later in the day on just one server.
The most obvious thing to look for here is a memory leak in the application. Since this is a managed application, that could mean one of a few things:
- A lot of memory tied up in objects that are accidentally rooted, and therefore correctly not being collected. For example, you could have a static Hashtable somewhere that just kept accumulating references.
- Not a lot of objects allocated, but many objects pinned. Since the GC can’t move these objects, they could be causing heap fragmentation. In slightly less technical terms, we’ve got plenty of free memory but none of it in large enough blocks to be useful.
- Memory being leaked in calls into unmanaged code: COM-Interop, P/Invoke, and so forth.
I know the application well enough to say that both (2) and (3) are pretty unlikely. We’ve tried to keep it as purely managed as possible, so there’s not much COM-Interop and few reasons to ever allocate a GCHandle.
Curiously enough, the answer turned out to be none of these possibilities.
The obvious first step was to load SOS and see just what was insisting on having all of this memory. The way we do that with SOS is with the “!dumpheap —stat” command. This was the tail end of the output.
...
0x020126b0 64,951 41,967,420 System.Int32[]
0x0e5e563c 1,059,033 42,361,320 System.Data.DataRow
0x02012970 63,235 47,628,552 System.Collections.Hashtable/bucket[]
0x0201209c 130,914 54,505,988 System.Object[]
0x79b94638 2,441,756 309,847,632 System.String
0x02012c3c 12,703 411,828,216 System.Byte[]
The 300MB of strings is not necessarily out of the ordinary. This is, after all, a web application. The 400MB of byte arrays is very strange indeed, however. Add to that the fact that there are relatively very few arrays taking up a huge amount of space.
Doing some quick math, I saw that the average byte array size was in the neighborhood of 32KB, and I had absolutely no idea what that could be.
I started what could become a grueling process of looking at an awful lot of byte arrays. The !dumpheap —type System.Byte[]
command, aside from taking a very long time, printed out many long runs of arrays that looked like this:
...
0x027c8b68 0x02012c3c 31,756 2 System.Byte[]
0x027d0774 0x02012c3c 31,756 2 System.Byte[]
0x027d8380 0x02012c3c 31,756 2 System.Byte[]
0x027dff8c 0x02012c3c 31,756 2 System.Byte[]
0x027e7b98 0x02012c3c 31,756 2 System.Byte[]
0x027ef7a4 0x02012c3c 31,756 2 System.Byte[]
0x027f73b0 0x02012c3c 31,756 2 System.Byte[]
0x027fefbc 0x02012c3c 31,756 2 System.Byte[]
...
These were interspersed with shorter runs of basically random sizes. At this point I was sure I had at least located the objects causing the problem, but I still needed to find out where they were coming from.
The weird thing was that I couldn’t find roots for any of these arrays. I was starting to think something was wrong with !gcroot
, because aside from also taking a very long time it wasn’t finding me any roots.
I noticed that almost all of the arrays were in GC generation 2 (gen2), which indicates that relatively speaking, they’d been around a long time. However, the first run of the arrays was still in gen0. Inspired, I tried to find the !gcroot
of the address of one of those arrays. This is what I found:
HANDLE(Strong):23811d8:Root:0xa655000(System.Object[])->
0x26de25c(System.Web.UbyteBufferAllocator)->
0x26de274(System.Collections.Stack)->
0x293b12c(System.Object[])->0x2861fdc(System.Byte[])
This did not look at all familiar. This is saying that the byte array is rooted by an instance of a framework type called System.Web.UbyteBufferAllocator
, which is itself rooted in a static field somewhere. UbyteBufferAllocator
is an internal type, so none of our code could be creating it.
I turned away from WinDbg for a minute and opened up Reflector. I took a look at where this class is used, and mercifully, there weren’t too many possibilities. The parameters to the UbyteBufferAllocator
constructor determine the size of the byte arrays that it will create and maintain.
Checking the size of thee allocators beinig created in each of these cases, I found the match. The type initializer for System.Web.HttpResponseBufferElement
owned the UbyteBufferAllocator
that was creating these arrays.
I took a look at the implementation of the allocator class. It maintains a pool of byte arrays, and provides them to the HttpResponseBufferElement
type. It only pools a small number of buffers, yielding new buffers if the demand for them is very high.
The HttpResponseBufferElement
class itself is only used from a few places:
System.Web.HttpWriter.BufferData
System.Web.HttpWriter.BufferResource
System.Web.HttpWriter.FlushCharBuffer
This revelation made my job quite a bit easier.
This machine serves clients that, for the most part, are talking to it using .NET Remoting. The responses to these calls would not be returned using an HttpWriter
. There are, however, a handful of web services hosted on the server. These would write back XML data to their clients using a HttpWriter
object.
To confirm this, I dumped out the data in the byte arrays using the dc command. I guess if I had been thinking a little clearer I might have tried that right away. The data was definitely a web service response.
total 4,834 objects
0:000> dc 0x46c5cac8
46c5cac8 02012c3c 00007c00 69766564 30313e64 <,...|..barid>10
46c5cad8 39383335 6b2f3c38 65647965 3e646976 53898</foobarid>
46c5cae8 79656b3c 45766544 746e6576 65707954 <foobarEventType
46c5caf8 656d614e 6f74533e 53206b63 74696c70 Name>Foobar data
46c5cb08 61262073 203b706d 6e676953 63696669 s & Signific
46c5cb18 20746e61 636f7453 6944206b 65646976 ant Foobar Data
46c5cb28 3c73646e 79656b2f 45766544 746e6576 nds</foobarEvent
46c5cb38 65707954 656d614e 656b3c3e 76654479 TypeName><foobar
All of this would indicate that the server tried to push out a VERY big response in the recent past. The 1.1 GC was evidently crippled by this, since it was not collecting the unrooted objects in gen2. It was reproducible enough to occur on several servers at once. This is, without a doubt, a bug in the GC algorithm or the execution engine.
This may be fixed in 2.0. I won’t be able to tell you because I don’t intend to test it.
We did some more investigation to figure out exactly which web service it was and who was using it, with methods that won’t make sense to anyone without an intimate knowledge of the app.
You may recall that at the beginning of the article I said,
The problem occurred very early in the morning, seemingly on many servers at once. It happened again later in the day on just one server.
This turned out to be significant.
In the early morning hours, our India QA team was stress testing the functionality that uses the problem web service, and came across this bug. Evidently thinking that the server going down was unrelated, the tester tried again and again on other servers.
Later in the day, the developer who owns the functionality tried to reproduce the bug that the tester entered, and brought another server down. A few hours after that I had tracked it down in WinDbg.
The fix, on our end anyway, was to fix a single sql query that resulted in truly epic amounts of unnecessary data being returned.
Boom!