Skip to content

Open IB deadlocks with coalescing and multiple pending sends #343

@abouteiller

Description

@abouteiller

When posting a large number of Isend, the application deadlocks when using the OpenIB BTL. The issue is not sensitive to the size of the messages (actually it does deadlock with size=0). It seems to be related to the number of pending sends only. Other BTL are immune to the problem. I have observed that the send sequence numbers and the expected sequence at the receiver are mismatching when the deadlock happens.

Open MPI trunk rev: 955f3c2

The following code illustrates the issue (when nmsg is large enough ~ >= 2000) ;
There are 2 known ways to workaround the issue:

  • Using the MCA option -mca btl_openib_use_message_coalescing false fixes the issue.
  • Blocking sends (as in the #if 0 case) so the problem is only the number of pending sends, not recv).
        if( rank % 2 ) {
            for( k = 0; k < nmsg; k++ ) {
                MPI_Irecv( &(rb[k*msgsize]), msgsize, MPI_CHAR, 
                           rank-1, 0, comm, &rreqs[k] );                                                                        
            }
            MPI_Barrier( comm );
            MPI_Waitall( nmsg, rreqs, MPI_STATUSES_IGNORE );
        } 
        else {
            MPI_Barrier( comm );
            for( k = 0;  k < nmsg; k++ ) {
#if 0
                MPI_Send( &(sb[k*msgsize]), msgsize, MPI_CHAR, 
                           rank+1, 0, comm );
#else
                MPI_Isend( &(sb[k*msgsize]), msgsize, MPI_CHAR, 
                           rank+1, 0, comm, &sreqs[k] );
#endif
            }
            MPI_Waitall( nmsg, sreqs, MPI_STATUSES_IGNORE );
        }

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions