Debugging Wily Stack Overflows in MPI Programs

I recently was debugging a problem in MPICH2 with the new MPI-3 nonblocking collectives implementation that involved a stack overflow. The code in question was usually blowing up with a SIGABRT that came from a stack trace ending in __stack_chk_fail. This bug was particularly nasty because it was a Heisenbug: it kept changing the failure location and mode whenever I would insert debugging code in order to track it down (hence the aforementioned “usually”).

For reasons I will explain later, my usual trick for this sort of thing did not help. Adding in those macros would cause the bug to vanish, or to reappear in a different function that had not been instrumented with MPIU_SG_ macros.

Luckily, I ran across this very helpful blog post that eventually helped me track down the real problem. My basic approach was the same as the one outlined there, but it needed modification due to the pain inherent in debugging a parallel MPI application. This particular bug required running at least 4 processes in order to manifest itself, and it would only appear after several iterations within the test program. So I used mpiexec + xterm + gdb + a gdb script file to handle the job.

First, here’s an example stack trace of the failure (using valgrind to easily obtain stack traces):

% mpiexec -l -n 4 valgrind -q ./icbarrier |& sort_labeled_output
=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Abort trap (signal 6)
[1] ==47313== 
[1] ==47313== Process terminating with default action of signal 6 (SIGABRT)
[1] ==47313==    at 0x10027D616: __kill (in /usr/lib/libSystem.B.dylib)
[1] ==47313==    by 0x1003022CB: __stack_chk_fail (in /usr/lib/libSystem.B.dylib)
[1] ==47313==    by 0x10005D9D3: MPIDI_CH3I_Progress (ch3_progress.c:539)
[1] ==47313==    by 0x100109B0C: MPIR_Wait_impl (wait.c:67)
[1] ==47313==    by 0x100109FFA: PMPI_Wait (wait.c:171)
[1] ==47313==    by 0x100004506: MPI_Barrier (nbc_pmpi_adapter.c:24)
[1] ==47313==    by 0x10000141D: main (icbarrier.c:37)
[2] ==47314== 
[2] ==47314== Process terminating with default action of signal 6 (SIGABRT)
[2] ==47314==    at 0x10027D616: __kill (in /usr/lib/libSystem.B.dylib)
[2] ==47314==    by 0x1003022CB: __stack_chk_fail (in /usr/lib/libSystem.B.dylib)
[2] ==47314==    by 0x10005D9D3: MPIDI_CH3I_Progress (ch3_progress.c:539)
[2] ==47314==    by 0x100109B0C: MPIR_Wait_impl (wait.c:67)
[2] ==47314==    by 0x100109FFA: PMPI_Wait (wait.c:171)
[2] ==47314==    by 0x100004506: MPI_Barrier (nbc_pmpi_adapter.c:24)
[2] ==47314==    by 0x10000143B: main (icbarrier.c:45)
[3] ==47315== 
[3] ==47315== Process terminating with default action of signal 6 (SIGABRT)
[3] ==47315==    at 0x10027D616: __kill (in /usr/lib/libSystem.B.dylib)
[3] ==47315==    by 0x1003022CB: __stack_chk_fail (in /usr/lib/libSystem.B.dylib)
[3] ==47315==    by 0x10005D9D3: MPIDI_CH3I_Progress (ch3_progress.c:539)
[3] ==47315==    by 0x100109B0C: MPIR_Wait_impl (wait.c:67)
[3] ==47315==    by 0x100109FFA: PMPI_Wait (wait.c:171)
[3] ==47315==    by 0x100004506: MPI_Barrier (nbc_pmpi_adapter.c:24)
[3] ==47315==    by 0x10000143B: main (icbarrier.c:45)

So I took a quick gander at the assembly for MPIDI_CH3I_Progress to find the canary value initialization and comparison code:

% objdump -d icbarrier > icbarrier.s
% sed -n -e '/MPIDI_CH3I_Progress>:/,/^$/ p' icbarrier.s | head -n 15
000000010005c7f0 <_MPIDI_CH3I_Progress>:
   10005c7f0:   55                      push   %rbp
   10005c7f1:   48 89 e5                mov    %rsp,%rbp
   10005c7f4:   41 57                   push   %r15
   10005c7f6:   41 56                   push   %r14
   10005c7f8:   41 55                   push   %r13
   10005c7fa:   41 54                   push   %r12
   10005c7fc:   53                      push   %rbx
   10005c7fd:   48 81 ec 08 02 00 00    sub    $0x208,%rsp
   10005c804:   48 89 bd 10 fe ff ff    mov    %rdi,-0x1f0(%rbp)
   10005c80b:   89 b5 0c fe ff ff       mov    %esi,-0x1f4(%rbp)
   10005c811:   48 8b 05 60 c8 10 00    mov    0x10c860(%rip),%rax        # 100169078 <_pvars+0x78>
   10005c818:   48 8b 10                mov    (%rax),%rdx
   10005c81b:   48 89 55 c8             mov    %rdx,-0x38(%rbp)
   10005c81f:   31 d2                   xor    %edx,%edx

So in this code it looks like the offset is -0x38 rather than -0x14. Let’s find the two addresses at which we will set our breakpoints:

% sed -n -e '/MPIDI_CH3I_Progress>:/,/^$/ p' icbarrier.s | grep -- '-0x38(%rbp)'
   10005c81b:   48 89 55 c8             mov    %rdx,-0x38(%rbp)
   10005ce61:   48 8b 55 c8             mov    -0x38(%rbp),%rdx

And now for our gdb script:

% cat gdb.script
break *0x10005c81b
break *0x10005ce61
commands 1
silent
set variable $cow = (unsigned long*)($rbp - 0x38)
watch *$cow
continue
end
set variable $count = 3
commands 2
silent
delete $count
set variable $count = $count + 1
continue
end
run

% mpiexec -n 4 xterm -e gdb -x gdb.script ./icbarrier

That command launches four xterm windows, each containing a single gdb instance that was running icbarrier as the inferior. Here’s the resulting xterm+gdb screenshot for the process that hit the watchpoint first:

Unlike in many situations, this backtrace alone was not sufficient for me to figure out what the real bug was. But, given that I changed some high level code (the nonblocking version of barrier, MPIX_Ibarrier), I knew it was extremely unlikely that I had tickled some long-latent bug in the frequently used routine MPIDI_CH3U_Request_unpack_uebuf. So after a few “WTFs” and some further head scratching, I took a hard look at the upper level code (error checking and boring code omitted for clarity):

int MPIR_Ibarrier_inter(MPID_Comm *comm_ptr, MPID_Sched_t s)
{
    int mpi_errno = MPI_SUCCESS;
    int size, rank, root;
    int buf = 0;

    /* do a barrier on the local intracommunicator */
    mpi_errno = MPIR_Ibarrier_intra(comm_ptr->local_comm, s);
    MPID_SCHED_BARRIER(s);

    /* rank 0 on each group does an intercommunicator broadcast to the
       remote group to indicate that all processes in the local group
       have reached the barrier. We do a 1-byte bcast because a 0-byte
       bcast will just return without doing anything. */

    /* first broadcast from left to right group, then from right to
       left group */
    if (comm_ptr->is_low_group) {
        root = (rank == 0) ? MPI_ROOT : MPI_PROC_NULL;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);
        MPID_SCHED_BARRIER(s);

        /* receive bcast from right */
        root = 0;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);
    }
    else {
        /* receive bcast from left */
        root = 0;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);        MPID_SCHED_BARRIER(s);        /* bcast to left */
        root = (rank == 0) ? MPI_ROOT : MPI_PROC_NULL;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);
    }

fn_exit:
    return mpi_errno;
fn_fail:
    goto fn_exit;
}

This function is called, adding operations to a communication schedule, then an MPI_Wait call is made later to actually force the bulk of the schedule to complete.

Do you see the bug? I’d be surprised if you did, since there are few people familiar with the style of MPICH2 code, and even fewer people who know how the NBC implementation works. But if you do, give yourself a pat on the back, and then please send me bug reports and patches for my other bugs :)

Take a closer look at the buffer used for the intercomm broadcast. It’s a stack allocated variable, which worked great in the blocking version of this routine. Unfortunately, this is a nonblocking barrier, so this routine exits before the ibcast is actually ever performed. So now we’ve managed to schedule a broadcast to occur that will overwrite a location on the stack that no longer actually refers to int buf;. Instead, if we are unlucky it will refer to some random piece of a different function’s stack frame. In the backtraces above, this random piece was the -fstack-protector/SSP canary memory. The bug moved around when I would insert debugging variables because they would affect the specific location of the canary value in memory, sometimes yielding a corrupted parameter or local variable instead.

Changing that buffer to be malloced instead fixed the problem.