Posts Tagged ‘Debugging’

PGProxy: A Testing Proxy for Postgres

I've released a package that I've been using for well over a year now for the purpose of writing functional test suites against applications using Postgres. The executive summary is that PGProxy allows you to write tests that are transactionally isolated from one another, without doing anything special in your application code.

This project is aimed more at functional tests of a website (using something like Selenium) than it is for unit tests of a single class or module. In those cases, using mock objects or other strategies is more viable. But, you could certainly use PGProxy in those scenarios as well.

You can get it on github here: http://github.com/mcfunley/pgproxy.

PGProxy is written in Python using Twisted, and has its own extensive set of unit and functional tests.


Why You Would Want This

If you have a sufficiently large set of functional tests written by a sufficiently large team, eventually naively-written tests will begin to interfere with one another. A really simple example of two tests that would interfere with one another would be:

  • Test A: Tests a workflow for user "steve"
  • Test B: Tests deleting the account for user "steve"

Clearly, if we don't take any special precautions in our test suite, Test A will never succeed if it is run after Test B. And if we require that Test A always runs before Test B, then we're forced to recreate our fixture database in between test runs. And this is just two tests–when you're talking about thousands, the potential interactions can be huge. Not to mention that depending on big your fixture databases are, creation at the start of every run can be a pain.

So eventually you will probably decide it would be better to restore the fixture data to a known state in between test cases. There are a few different ways to try to accomplish this:

  1. Restore the fixture database between test cases using something like CREATE DATABASE .. WITH TEMPLATE.
  2. Have each test be responsible for undoing whatever it does.
  3. Have each test be responsible for creating any data that it's going to use.
  4. Don't write tests for user "steve," but rather locate a user in the fixture data that meets the criteria that you need.
  5. Make each test work inside of a transaction, and roll the transaction back when the test is completed.

There are problems with all of these approaches. 1) is very slow. 2) is very tedious for developers and error prone. 3) is similarly tedious and can be slow, depending on how much logic creating your entities entails. 4) is really just moving the goalposts, because tests are still going to interfere with each other. 5) works to the extent that you can have all of your database access code share a connection and to the extent that your code does not try to use its own transactions.

But! I am writing this to tell you about an exciting new option, namely, "do something crazy." PGProxy is that crazy thing, and it works pretty well.


How it Works

As mentioned above, your test case can only work inside of a transaction if you are able to use a single connection per database per test case. If your test case is making calls to multiple processes that all want to use your fixture database, that is a pretty difficult thing to do.

PGProxy solves the problem by, you guessed, it, proxying all of your database connections. So if you have a PHP site running in Apache and a scala service running in Jetty, they can now share database connections in your tests. And consequently, they can share a transaction.

The other issue that comes up in using transactions to make test cases is that it's pretty common for the code that you're testing to want to use its own transactions. PGProxy solves this by rewriting transaction usage within the test case into SAVEPOINT usage. In other words, if you have a test that runs this SQL:

BEGIN;
update users set username='chuck' where username='steve';
COMMIT;
BEGIN;
update users set password='foo' where username='chuck';
ROLLBACK;

PGProxy will rewrite that to this:

BEGIN; -- my test case
SAVEPOINT x;
update users set username='chuck' where username='steve';
RELEASE SAVEPOINT x;
SAVEPOINT y;
update users set password='foo' where username='chuck';
ROLLBACK TO SAVEPOINT y;
ROLLBACK; -- my test case


Running the Proxy

There are a few ways to run the proxy. If you are writing your test suite using python (ie, using unittest), you can set up your test runner like this:

from __future__ import with_statement
import pgproxy

this_dir = os.path.realpath(os.path.dirname(__file__))
pidfile = os.path.join(this_dir, 'pgproxy.pid')
logfile = os.path.join(this_dir, 'pgproxy.log')

def run():
    # this will shut down the proxy when the tests complete.
    with pgproxy.run(pidfile=pidfile, logfile=logfile):
        run_test_suite()

def run_test_suite():
    # this should actually run your tests
    pass

if __name__ == '__main__':
    run()

Or you can use something like this script to start a standalone pgproxy process:

#! /usr/bin/env python
import pgproxy
import os

this_dir = os.path.realpath(os.path.dirname(__file__))
pidfile = os.path.join(this_dir, 'pgproxy.pid')
logfile = os.path.join(this_dir, 'pgproxy.log')

pgproxy.run(listenPort=5433, serverAddr=('localhost', 5432),
            pidfile=pidfile, logfile=logfile)

In both of these cases, PGProxy is configured to accept connections on port 5433, and to connect to the Postgres server running on port 5432. In these examples you would tell your application to connect to port 5433.

In order to run PGProxy, you need Twisted version 8.1.0 or later.


Test Harness Integration

PGProxy accepts two special queries that signal the start and the end of tests. Your test suite will need to invoke these in setUp and tearDown (or whatever the equivalents are in the language / framework that you're using). Here's a unittest example:

class TestCase(unittest.TestCase):
    def setUp(self):
        self.query("BEGIN TEST '%s'" % self._testMethodName)

    def tearDown(self):
        self.query("ROLLBACK TEST '%s'" % self._testMethodName)

    def test_something(self):
        # now this has a transaction and can't do any serious damage
        # to the fixture data.
        pass

As you may have noticed above, the BEGIN/ROLLBACKS are sent to postgres with comments stating which test is running, which can be pretty handy if you find yourself needing to look at the postgres logs to debug something. Here you would see:

BEGIN; -- test_something
ROLLBACK; -- test_something

Since setUp and tearDown are frequently overridden by developers for other purposes, I generally like to use a metaclass to wrap test cases in transactions instead. This way if a developer forgets to call the base test case's setUp method, it's no big deal for the rest of the suite. I'll leave that as an exercise. You get the idea.


A Word About Twisted

Lord knows I have a complicated opinion of all things Twisted, and maybe someday I will write something about that. And by "someday" I mean I am almost certainly never going to, because I have enough trouble staying out of nerd fights on the internet.

But I have to say that for this project, with the precise set of requirements that it had, and taking as a given my pre-existing wealth of experience with Twisted, things worked out great. This was a from-scratch rewrite of my first version, which was written using asyncore. The asyncore version was riddled with obscure race conditions, and it turned out to be much easier to just rewrite the damned thing using an event-driven framework than it ever was to debug the original. There was not a better choice of Python framework for this project, though I did toy with the idea of using scala. Anyway, I hope this praise for Twisted doesn't come across as excessively faint.


The End

Enjoy! Don't hesitate to drop me a line if you find this useful or have bug reports.


How to Debug Python in GDB

Although I haven't found this to be necessary nearly as often as Windbg was for me on Windows, it's still somewhat handy to be able to look at a Python daemon or multithreaded program in GDB. I've had to set this up a few times now, and I've forgotten the steps each time, so here they are in one place.

Step 1 - Get a debug build of Python.

You need a debug version of the interpreter. This is what I do from source on *nix:

./configure --prefix=/usr/local/dbg
make OPT=-g
sudo make install

sudo ln -s /usr/local/bin/dbgpy /usr/local/dbg/bin/python

Step 2 - Fix the botched .gdbinit that comes with the Python source.

The Python source comes with a .gdbinit file (Misc/gdbinit), but it's broken in 2.5. Apply these changes, which will fix pystack to 1) work and 2) work on threads other than the main thread.

@@ -119,8 +122,8 @@

 # print the entire Python call stack
 define pystack
-    while $pc < Py_Main || $pc > Py_GetArgcArgv
-        if $pc > PyEval_EvalFrame && $pc < PyEval_EvalCodeEx
+    while ($pc < Py_Main || $pc > Py_GetArgcArgv) && ($pc < t_bootstrap || $pc > thread_PyThread_start_new_thread)
+        if $pc > PyEval_EvalFrameEx && $pc < PyEval_EvalCodeEx
         pyframe
         end
         up-silently 1
@@ -130,8 +133,8 @@

 # print the entire Python call stack - verbose mode
 define pystackv
-    while $pc < Py_Main || $pc > Py_GetArgcArgv
-        if $pc > PyEval_EvalFrame && $pc < PyEval_EvalCodeEx
+    while ($pc < Py_Main || $pc > Py_GetArgcArgv) && ($pc < t_bootstrap || $pc > thread_PyThread_start_new_thread)
+        if $pc > PyEval_EvalFrameEx && $pc < PyEval_EvalCodeEx
         pyframev
         end
         up-silently 1

Note that the locals still won't work. I'll post a diff that fixes those when I get around to doing that. In the meantime it's not too hard to poke through the python structures to get them.

On OS X you can just copy the result of this to ~/.gdbinit. Some linux distros might be cranky about where you put this and where you launch gdb.

Step 3 - Launch your program with the debug interpreter and attach.

From the command line you can attach with "gdb /usr/local/bin/dbgpy [pid]." I like gdba mode in emacs (see below - M-x gdba, then the same kind of command line).

gdba mode in Emacs

Why is my application hanging in weird places?

Here's a problem:

  1. You have an application that's hanging permanently or temporarily.
  2. The hang does not occur in any synchronization code that you have written yourself.
  3. You are writing a windows forms application, or any kind of application using COM interop.

Let me present an educated guess as to what is happening. Note to people getting here by googling: there are actually many other reasons this might happen in addition to the one I am going to tell you about. I merely suggest that you spend more time thinking before being led too far down a rabbit hole by what I am about to write. Experience has taught me that people tend to lock on to one internet analysis (say, an unrelated KB article mentioning the same exception text) and waste a lot of time on it. None of the other sites seem to be nice enough to explicitly warn you about your own tendencies–I am, you should thank me.

With those disclaimers, here is my psychic debugging response:

  1. You are inadvertently calling a method on an STA object created on a different thread.
  2. That other thread is busy doing something else.

For people that managed (har har) to avoid COM development, or stayed completely in the insulated VB6 world, I will now explain briefly everything you really need to know about threading models in this, our fantastic futuristic age.

Methods on single threaded apartment (STA) objects are synchronized. The object receives all calls to its methods on the thread that owns it. This is achieved by creating a hidden window that receives messages for any calls made on those objects. Multi-threaded apartment (MTA) objects behave more or less like any .NET object you create: calls are not automatically synchronized, and you are responsible for doing any synchronization necessary yourself.

So in order for calls to STA objects to work correctly, the thread that owns them has to be free to run its message loop a sufficient percentage of the time. Running a computationally-intensive process on an STA thread is a great way to hang the other threads in your application, if you are careless or ignorant of where the STA calls are.

If you attach with Windbg and dump out the native stacks (~*k), you will probably see threads in calls to WaitOne that got there through functions named things like "SendReceive" above some interop marshalling code if this is indeed your problem. If you are lucky you might see a function called "GetToSTA," however, the stacks unfortunately do not always make things that easy.

This is one of those things that I have seen a bunch of times, but seems to be on the radar of an increasingly small number of people. It often takes me longer than it should to figure it out, too. If you are doing any kind of practical Windows development, you still need to understand the basics of STA/MTA. That's because any development that is "practical" will almost necessarily involve interoperating with old code.

Here's a related post that I wrote, about how STA objects can result in blocked finalizer threads.

Crazy Hacker Tricks

Let me be the first to say that this is probably a really bad idea, unless you are very desperate.

On the other hand, this was really fun to write.

Today I wrote some code that patches the first few bytes of user32!MessageBoxExW, in order to keep a pesky third-party library from showing a prompt when nobody was around to click on it. This overwrites the start of that function with a relative jump to one of my own, which then forwards to a managed function that logs what the message box was trying to tell us.

Actually, this was for a friend's program. I would probably shy away from doing this if I weren't only indirectly responsible.

void InstallHook()
{
    HMODULE hUser32 = LoadLibrary(L"user32.dll");
    PROC pMsgBox = GetProcAddress(hUser32, "MessageBoxExW");
    FailIf(pMsgBox == NULL);

    // Need to change the memory protection for the
    // (read/execute) page in user32.dll before we
    //write to it.
    MEMORY_BASIC_INFORMATION mbi = {0};
    VirtualQuery(pMsgBox, &mbi,
        sizeof(MEMORY_BASIC_INFORMATION));
    FailIf(!VirtualProtect(mbi.BaseAddress, mbi.RegionSize,
        PAGE_READWRITE, &mbi.Protect));

    // Copy in the jump instruction... you could of course
    // opt to not write this in assembly, but if we've gone this
    // far, what's the difference?
    __asm
    {
         // edi <- user32!MessageBoxExW
        mov edi, pMsgBox;              

        // write the opcode for JMP rel32
        mov byte ptr [edi], 0xe9;       

        // eax <- _MessageBoxEx, then
        // eax <- _MessageBoxEx - user32!MessageBoxExW
        mov eax, offset _MessageBoxEx;
        sub eax, edi;                   

        // One byte for jmp, four for the address.
        sub eax, 5;
        inc edi;
        stosd;
    };

    // Restore the old protection
    DWORD oldProtect = 0;
    FailIf(!VirtualProtect(mbi.BaseAddress,
        mbi.RegionSize, mbi.Protect, &oldProtect));
}

With that in place it's not hard to erect some managed scaffolding around it, and nobody needs to know your secret. Until they try to run it on an x64, I guess:

static void Main(string[] args)
{
    HookMonkey.Install(delegate(MessageBoxInfo mbi)
    {
        Console.WriteLine(mbi.Text);
        return DialogResult.OK;
    });
    MessageBox.Show("foo bar");
}

You can download some sample code here.

Debugging and Symbols FAQ

This is something I typed up internally, to help resolve the confusion that precipitates when Visual Studio begins stepping through comments. Hopefully this will be helpful to someone else.

What is required for source mode debugging?

  • Binaries (.dll, .exe, .ocx, …).
  • Symbols (the .pdb file).
  • Source files.

The symbol files tell your debugger how points in the binary relate to the source files. When you set a breakpoint in a source file, the debugger uses the symbol file to look up the corresponding instruction in the binary.

I have all of these things. Why is my debugger behaving strangely or not working at all?

Well, there are a bunch of different things that can go wrong.

Your symbols might be mismatched.

You can't (as an example) use the symbols for version 2.0 of a DLL to debug version 1.0 of that DLL. Perhaps, in this case, you have copied a new binary into your run directory without copying the corresponding .pdb file. It's also possible that your IDE or compiler did that for you.

Some debuggers, like WinDbg, will NOT load a mismatched PDB unless you force this to happen. Others, like Visual Studio 2003, will do this without so much as a complaint. That can be very confusing in some situations.

The module you are trying to debug is not loaded into the process.

If you are attempting to hit a breakpoint, a good sanity check is to make sure that the module you are trying to debug is being loaded. If it isn't, you are experiencing something other than a PDB problem. For example, your application could be experiencing an exception before your code is called. Make sure you have "break on exceptions" turned on.

In Visual Studio, you can look at the "modules" window to make sure your binary is loaded. The corresponding command in WinDbg is lm.

You are attached with the wrong debugger.

This is another non-symbol issue. There are many different kinds of debuggers:

  • Native debuggers: cdb, ntsd, windbg, Visual Studio.
  • CLR debuggers: cordbg, mdbg, deblector, Visual Studio.
  • Oddballs: the VB6 IDE, VS.NET's script debugger, etc.

You should make sure you are attaching to or starting your process using the right kind of debugger. Visual Studio gives you a dialog to choose the appropriate debugger(s).

The source file is not recognized.

Say I build a .dll from c:\foo\x.cpp, and you take that library from me and try to set a breakpoint in c:\bar\x.cpp. You might need to tell the debugger that your source path is different. Visual Studio is usually pretty good at figuring this out on its own and/or asking you when it is having a problem.

The _NT_SOURCE_PATH environment variable is honored as a search path for source files by most of the Windows debuggers. The .srcpath command in WinDbg/CDB can be used to display and change the source search path in those debuggers.

The symbols are corrupt.

This is rare, but it can happen. The fix is usually to just rebuild the binary and regenerate the PDB.

The binary is not debuggable.

.NET assemblies have the added limitation that they must be built in /debug mode for conventional CLR debugging to work. You can tell if a module is debuggable by opening it up in Reflector or ILDASM. You should see this attribute:

[assembly: Debuggable(true, true)]

Where does the debugger get the symbols?

Most Windows debuggers work like this:

  • The folder containing the binary is searched for a matching PDB.
  • The debugger's symbol search path is then searched. Most Windows debuggers (including Visual Studio) take their default search path from the _NT_SYMBOL_PATH environment variable.

How can I tell where the debugger is getting its symbols?

In WinDbg/CDB, the !sym noisy command can be used to show verbose output when the debugger is trying to resolve symbols. Visual Studio 2005 shows the symbol path as a column in the Modules window.

I am not aware of a simple way to identify where Visual Studio 2003 is loading its symbols from. The easiest way might be to use the handle utility from Sysinternals:

C:\src>handle foo.pdb

Handle v3.01
Copyright (C) 1997-2005 Mark Russinovich
Sysinternals - www.sysinternals.com

devenv.exe      pid: 4832   C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\Temp
orary ASP.NET Files\someapp\dbf846f9\1ab0397c\assembly\dl2\c33a0bcb\eca89e48_3
b78c601\foo.PDB

In this case ASP.NET has copied the binary and the PDB to a temporary location. This is another way a PDB can wind up being mismatched.

How are PDB's matched to binaries?

The short version: the PDB contains a date stamp and a checksum which is matched against the binaries.

How can I tell if a PDB matches a binary?

The symchk utility that comes with the Debugging Tools for Windows can be used to do this. There are many options for this utility, so you should read the output of symchk /?.

Some Twists on Blocked Finalizers

Blocked finalizer threads have gotten some recent publicity on Tess Ferrandez's blog. I recently ran into this myself, although the particulars were slightly different.

Our problem was manifesting itself in a very long-running console application that both connected to a database and made web service calls. Eventually the app would start hitting OutOfMemoryExceptions. (Completely unrelated to these OutOfMemoryExceptions. We have a high incidence of OutOfMemoryExceptions around here.)

Fixing the Wrong Problem and Moving the Goalposts

My initial reaction to this problem was (sensibly, I think) to attempt to reduce the amount of memory that the application was using. I admit that I was also focused on different problems at the time, so any makeshift solution I could get to work here would have been just fine with me.

We threw in some buffer pools, et cetera, and managed to get the app running slightly longer, but the OOM's always resurfaced. Every time they did, the heap looked completely different due to the semi-drastic changes we were making to the app's memory profile. After the third or maybe fourth change it dawned on me that perhaps we were just confusing the real issue.

The Proverbial Lightbulb Over the Head

The turning point came when I noticed that most of the overly-abundant objects on the heap in a particular dump were types that were likely to be associated with finalizers.

0:000> !dumpheap -stat
------------------------------
...
2,431,972    48,639,440 System.Threading.WaitHandle/__WaitHandleHandleProtector
...

This hadn't been the case in the previous dumps. Those were eaten alive by byte arrays and strings. By this point, we had done such a fantastic job of optimizing the application for memory usage that the problem came more clearly into focus.

The Road to Victory

At this point I used !threads in SOS to locate the finalizer thread (some useless info trimmed out here):

0:000> !threads
ID APT   Exception
0  STA System.OutOfMemoryException
4  MTA (Finalizer)

Thread 4 was stuck in a WaitForSingleObject and, not unrelated to the problem here, the native stacks of almost all of the other threads in the process were in some variation of "WaitUntilGCComplete." Probably should have noticed that sooner. So sue me.

Anyway, the top of the finalizer's stack was interesting:

0:004> ~4k
ChildEBP RetAddr
00f0f658 7c822124 ntdll!KiFastSystemCallRet
00f0f65c 77e6baa8 ntdll!NtWaitForSingleObject+0xc
00f0f6cc 77e6ba12 kernel32!WaitForSingleObjectEx+0xac
00f0f6e0 776c54ef kernel32!WaitForSingleObject+0x12
00f0f6fc 77789905 ole32!GetToSTA+0x6f
00f0f71c 77787ed7 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xcb

The finalizer is waiting to make a call over to an STA thread. Why's it doing that? And wait, what STA thread? Funny you should ask.

Visual Basic Complications

No technical problem at my post-VBScript company would be complete without Visual Basic putting a unique spin on things. This one is no exception. The authors of this app are fairly oblivious to COM apartments and didn't intentionally create any STA threads. They also weren't using COM Interop directly.

So how did thread zero end up being an STA thread (the astute among you may have noticed this in the !threads output above)? The MSDN documentation seems to clearly say that MTA is the default.

The reason was that the app was written in Visual Basic. If the Main() method in a VB program is lacking the STAThreadAttribute and the MTAThreadAttribute, the VB compiler will stick an STAThreadAttribute on it for you when the MSIL is emitted (the C# compiler doesn't do this). Presumably–and I'm guessing here but I think this is a good guess–this is done for compatibility with VB6.

The main thread of this application was sleeping the vast majority of the time. Admittedly, this is not a fantastic example of multithreaded design, but it wouldn't have caused any problems if the thread were MTA. As Raymond has pointed out, sleeping on an STA thread can have some ill effects.

The Fix

This problem was solved by decorating the main method with <MTAThread()>. This is possibly the smallest fix to an apparently massive problem I have ever personally witnessed.

How to Notice This Problem Much Sooner

There are a few ways this could have been less painful. The easiest might have been to check the output of SOS's !finalizequeue command. If I had done that I would have seen the following (again trimmed for clarity):

Heap 0
generation 0 has 0 finalizable objects
generation 1 has 0 finalizable objects
generation 2 has 11,212 finalizable objects
Ready for finalization 1,163,295 objects

Which does not look healthy. Looking more closely at the native stacks would have also probably yielded a solution on day one.

Please Stop Spamming the Debugger Output

When I'm writing a GUI or a multithreaded server application, I make heavy use of OutputDebugString (and various API's that map to it). This is a very useful tool when the act of stepping through an application has side effects that drastically change the experiment. Unfortunately, it looks everyone else likes this API, too.

I try to be courteous and pay my developer taxes here–when my application is deployed to customers, rest assured that it will not be writing debugger output by default. I put my calls to OutputDebugString on a switch in one way or another, or I just compile them out in release mode.

(This is not just being kind to anyone trying to debug something else on a machine running my application. I need to do this to avoid embarrassment. I curse quite a bit in my traces.)

Not everybody is taking the trouble to do this. To wit:

Debugger output spam from Visual Studio

If I open up dbmon or debugview and wave my mouse around a little, i'm inundated with messages from these applications:

  • Visual Studio 2005 [by far the worst].
  • Visual SourceSafe [a close second place].
  • Trillian
  • The remnants of Symantec products that I have not yet succeeded in disabling.

I find it particularly annoying that my development tools are hampering my development in this way.

Debugger Exercise: Displaying a Function’s Return Value

I found myself needing to automatically manipulate the return value of a managed function in release code today. I thought this would make an interesting little writeup.

For the purposes of the demonstration, let's use this example program:

class Program
{
    static void Main()
    {
        while (!Console.KeyAvailable)
        {
            Console.WriteLine(FooBar(false));
            Console.WriteLine(FooBar2());
        }
    }

    static string FooBar(bool b)
    {
        if (b)
        {
            return "b";
        }
        return "a";
    }

    static string FooBar2()
    {
        return FooBar(true);
    }
}

The function we're interested in will be FooBar. I've made things interesting by calling the function from two places–let's assume we're not sure where the function is called from. We'll make things more interesting still by accepting the limitation that we won't cheat and set a breakpoint on the ret instruction in FooBar. The function I needed to instrument today was a little monolithic, so let's take it as a given that we need to be able to handle a function that might be long and might have more than one exit point.

Compile this program and start it in WinDbg. Start by loading SOS:

0:003> .loadby sos mscorwks

The first thing we'll do is set a breakpoint on FooBar. (I'm going to do this manually so that these steps will work with the CLR v1.1. The new version of SOS makes this slightly easier. I'm also going to cover this quickly, so if you want a clearer explanation of what's going on in this next step see Eran Sandler's writeup here.)

To set the breakpoint, let the program run for a second so that the method is compiled by the JIT. Then we need to find and dump out the MethodTable for the class:

0:003> !name2ee retaddr!retaddr.Program
Module: 00912d5c (retaddr.exe)
Token: 0x02000002
MethodTable: 00913140
EEClass: 009112ec
Name: retaddr.Program

0:003> !dumpmt -md 00913140
EEClass: 009112ec
Module: 00912d5c
Name: retaddr.Program
mdToken: 02000002
BaseSize: 0xc
ComponentSize: 0x0
Number of IFaces in IFaceMap: 0
Slots in VTable: 8
--------------------------------------
MethodDesc Table
   Entry MethodDesc      JIT Name
...
00d00070   00913120      JIT retaddr.Program.Main()
00d000d8   00913128      JIT retaddr.Program.FooBar(Boolean)
00d00130   00913130      JIT retaddr.Program.FooBar2()
...

Now that we have the MethodTable, we need to dump out the MethodDesc for FooBar to get the virtual address of the jitted code.

0:003> !dumpmd 00913128
Method Name: retaddr.Program.FooBar(Boolean)
Class: 009112ec
MethodTable: 00913140
mdToken: 06000002
Module: 00912d5c
IsJitted: yes
m_CodeOrIL: 00d000d8

The compiled method begins at m_CodeOrIL (this is called "Method VA" in SOS 1.1). We can set a breakpoint there with "bp 00d000d8."

Here comes the trick. We know that the return address to the method should be on the top of the stack when the method is called. In other words, the ESP register points at the return address. We can set a dynamic breakpoint on that return address using a command like this:

bp 00d000d8 "bp poi(@esp); g"

If you try this, you'll notice that we break right after FooBar has returned to its caller. As with any function obeying one of the standard calling conventions, the return value is stored in the EAX register. We can give the dynamic breakpoint a command of its own that manipulates this value like this:

bp 00d000d8 "bp poi(@esp) \"!do -nofields @eax; g\"; g"

If we run the sample program now, we'll see this output repeating itself:

breakpoint 1 redefined
Name: System.String
MethodTable: 790fa3e0
EEClass: 790fa340
Size: 20(0x14) bytes
String: a

breakpoint 2 redefined
Name: System.String
MethodTable: 790fa3e0
EEClass: 790fa340
Size: 20(0x14) bytes
String: b

Voila.

Beware the OutOfMemoryException Red Herring

Low memory conditions have a way of executing lightly-or-never-tested code paths. Bugs in these paths can sometimes cast a penumbra over the real reason your application is crashing.

Worse, the resulting exceptions or crashes are likely to be more or less nondeterministic. Worse still, the most obvious steps with SOS rarely work once the CLR or the IIS worker process has started to OOM and has tossed you an access violation.

As a concrete example–this is a !clrstack from an application that appears to be crashing with an IOException.

0:026> !clrstack
Thread 26
ESP         EIP
0x10e3ea90  0x77e55dea [FRAME: GCFrame]

This doesn't tell us much, but if we look at the native stack:

0:026> k
ChildEBP RetAddr
10e3e9e8 79216aed kernel32!RaiseException+0x53
WARNING: Stack unwind information not available. Following frames may be wrong.
10e3ea40 79216a70 mscorsvr!GetAssemblyMDImport+0x2e1d4
10e3ea68 79216a24 mscorsvr!GetAssemblyMDImport+0x2e157
10e3ea78 79265ec2 mscorsvr!GetAssemblyMDImport+0x2e10b
10e3eab4 7924e9d3 mscorsvr!CoEEShutDownCOM+0x12dc1
10e3eac8 7c82eeb2 mscorsvr!GetMetaDataPublicInterfaceFromInternal+0x7e73
10e3eaec 7c82ee84 ntdll!ExecuteHandler2+0x26
10e3eb94 7c82ecc6 ntdll!ExecuteHandler+0x24
10e3eb94 77e55dea ntdll!KiUserExceptionDispatcher+0xe
10e3eee0 79216aed kernel32!RaiseException+0x53
10e3ef38 7924c8d0 mscorsvr!GetAssemblyMDImport+0x2e1d4
10e3efe8 791b3208 mscorsvr!GetMetaDataPublicInterfaceFromInternal+0x5d70
10e3eff4 791b3ad7 mscorsvr!Ordinal76+0x3208
10e3f108 791d3ef0 mscorsvr!Ordinal76+0x3ad7
10e3f1c4 791d3fa4 mscorsvr!GetCompileInfo+0x13e3
10e3f1ec 7931392f mscorsvr!GetCompileInfo+0x1497
10e3f21c 791cc3c8 mscorsvr!ReleaseFusionInterfaces+0x5f3d7
10e3f264 793139e1 mscorsvr!Ordinal76+0x1c3c8
10e3f2d8 7924a3f2 mscorsvr!ReleaseFusionInterfaces+0x5f489
10e3f2f0 791d4096 mscorsvr!GetMetaDataPublicInterfaceFromInternal+0x3892

We can see from the frames in the middle (the lower kernel32!RaiseException through the subsequent ntdll calls) that the current exception is being raised from inside another exception handler (a catch block, in managed terms). The current exception doesn't appear to reflect this:

0:026> !cen
System.IO.IOException (0xbab602fc)

0:026> !do 0xbab602fc
Name: System.IO.IOException
-----------------
Exception bab602fc in MT 79bd1234: System.IO.IOException

But with some wizardry, we can coerce the relevant information. We can hypothesize that the original exception would be passed as a parameter somewhere before the original call to kernel32!RaiseException. Isolating just that frame on the stack, and the frame before it:

0:026> kb
…
10e3eee0 79216aed e0434f4d 00000001 00000000 kernel32!RaiseException+0x53
…
10e3ef38 7924c8d0 bab5ce60 00000000 bab5ce60 mscorsvr!GetAssemblyMDImport+0x2e1d4

The address 0xbab5ce60 (second line) looks like a good candidate, being close to the address of the current exception while not being equal to it.

0:026> !do bab5ce60
Name: System.IO.IOException
-----------------
Exception bab5ce60 in MT 79bd1234: System.IO.IOException
_message: Unable to create a transport connection.
_innerException:
   Exception 06c6003c in MT 79b96cec: System.OutOfMemoryException
   _remoteStackTraceString:

Bingo. (Mind you, we haven't taken any steps yet to actually fix the problem.) This was relatively easy, since all of the relevant information could be reconstructed from the offending stack. In other situations, you may need to !DumpAllExceptions to see what has been happening recently in the application.

Note: a very good hint that you are looking at an OOM situation is that your crash dump is exactly two gigabytes in size.

The Worst Possible Way to Handle Exceptions

The worst possible way to "handle" exceptions is to show the user a message box with __FILE__ as the text. This is extremely poor form.

Intel VTune error dialog

Apparently, Intel did not get the memo. Worse, this happens as many as ten times when I open Visual Studio. Why do I feel like I was mistakenly given a debug build of VTune?