March 13th, 2010

Hung perl backtrace

Another weird Perl problem at work. A script that connects to a router via ssh to get some stats was taking 20 seconds longer to run on the new server than the old server. The extra 20 seconds weren't showing up in profiling anywhere I could find. I started printf-debugging and ran into something like this:

sub foo {
    # ...
    print "XXX 1\n";
    $x = bar();
    print "XXX 3\n";
    return $x;
}

sub bar {
   # ...
   print "XXX 2\n";
   return 42;
}

The output would be something like:

XXX 1
XXX 2
20 seconds pass
XXX 3

What the hell could be taking 20 seconds with no obvious statements in between? To figure that out, I changed the debugging slightly:

sub foo {
    # ...
    local $SIG{USR1} = sub { Carp::cluck('wtf') };
    print "XXX 1 $$\n";
    $x = bar();
    print "XXX 3\n";
    return $x;
}

sub bar {
   # ...
   print "XXX 2\n";
   return 42;
}

With that in place, it looked something like this:

XXX 1 12840
XXX 2

Then, during the 20 second delay, I did a kill -USR1 12840, and lo and behold, the problem was in Expect::DESTROY, which ran when the Expect object went out of scope (hence no obvious code executing between my print statements). It was trying for 20 seconds to do a "graceful" close on the spawned ssh process, which hadn't actually terminated. The fix was to send an "exit" through the spawned process so it quit on its own. 20 seconds eliminated.

Took me two hours to find it, though.

I interrupt my Lisp programs all the time to see where they are in a backtrace, but this is the first time I've done it for a Perl program.