-
Notifications
You must be signed in to change notification settings - Fork 936
Closed
Labels
Description
When posting a large number of Isend, the application deadlocks when using the OpenIB BTL. The issue is not sensitive to the size of the messages (actually it does deadlock with size=0). It seems to be related to the number of pending sends only. Other BTL are immune to the problem. I have observed that the send sequence numbers and the expected sequence at the receiver are mismatching when the deadlock happens.
Open MPI trunk rev: 955f3c2
The following code illustrates the issue (when nmsg is large enough ~ >= 2000) ;
There are 2 known ways to workaround the issue:
- Using the MCA option -mca btl_openib_use_message_coalescing false fixes the issue.
- Blocking sends (as in the
#if 0case) so the problem is only the number of pending sends, not recv).
if( rank % 2 ) {
for( k = 0; k < nmsg; k++ ) {
MPI_Irecv( &(rb[k*msgsize]), msgsize, MPI_CHAR,
rank-1, 0, comm, &rreqs[k] );
}
MPI_Barrier( comm );
MPI_Waitall( nmsg, rreqs, MPI_STATUSES_IGNORE );
}
else {
MPI_Barrier( comm );
for( k = 0; k < nmsg; k++ ) {
#if 0
MPI_Send( &(sb[k*msgsize]), msgsize, MPI_CHAR,
rank+1, 0, comm );
#else
MPI_Isend( &(sb[k*msgsize]), msgsize, MPI_CHAR,
rank+1, 0, comm, &sreqs[k] );
#endif
}
MPI_Waitall( nmsg, sreqs, MPI_STATUSES_IGNORE );
}