Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: In-progress send via CMA(iov protocol) blocks following sends #9853

Open
wenduwan opened this issue Mar 4, 2024 · 9 comments
Open

Comments

@wenduwan
Copy link
Contributor

wenduwan commented Mar 4, 2024

Describe the bug
We have observed an MPI application hang in shm between 2 processes where:
Receiver:

  • posts 8 B recv with tag 0 (Recv1)
  • posts 8 B KB recv with tag 0 (Recv2)

Sender:

  • Process 1 posts 8 B send with tag 0 (Send1)
  • Process 1 posts 8 KB send with tag 1 (Send2)
  • Process 1 posts 8 B KB send with tag 0 (Send3)

While the application waits for all requests to complete, we observed:

  • Recv1, Send1 completed
  • Recv2, Send2 and Send3 are unable to complete

Upon investigation, we found Send2 was using prov/shm's iov protocol via CMA, and it was stuck in progress due to the absence of a matching recv. Furthermore, it blocked subsequent send operations, i.e. Send3.

To Reproduce
I will provide a simpler reproducer later.

Expected behavior
I'm under the assumption that send operations should be progressed and completed independently, and not block subsequent sends.

In this case, we should at least ensure that Recv1, Recv2, Send1 and Send3 all complete.

Output
If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)

Environment:
Multiple OS including Amazon Linux 2 and Ubuntu 22.04

Additional context
Add any other context about the problem here.

@wenduwan wenduwan added the bug label Mar 4, 2024
@hppritcha
Copy link
Contributor

Can you check if you see this issue with the 4.1.x release stream? the cma usage has been around for a long time so i'm surprised the btl/vader (aka sm) BTL isn't handling this sort of case correctly.

@shijin-aws
Copy link
Contributor

I talked with @aingerson and she pointed out changing https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_progress.c#L221 to continue may work. But IIUC we need more change to shift the pointer of the cirque entry to the next. Currently the loop always polls the head of the cirque and discard the progressed one to move forward.

@shijin-aws
Copy link
Contributor

shijin-aws commented Mar 4, 2024

Can you check if you see this issue with the 4.1.x release stream? the cma usage has been around for a long time so i'm surprised the btl/vader (aka sm) BTL isn't handling this sort of case correctly.

I don't think it's related to Ompi version. It's a restriction in Libfabric shm provider's CMA protocol implementation and currently can be exposed by running OMPI with EFA provider in libfabric >= 1.19 version (which uses shm provider as peer provider and offload the unexpected message handling to shm completely)

@hppritcha
Copy link
Contributor

I don't think it's related to Ompi version. It's a restriction in Libfabric shm provider's CMA protocol implementation and currently can be exposed by running OMPI with EFA provider in libfabric >= 1.19 version (which uses shm provider as peer provider and offload the unexpected message handling to shm completely)

oh sorry I missed that.

@shijin-aws
Copy link
Contributor

The customer is currently not blocked by this issue after modifying their application to not wait on send completions before posting receives. But this should be fixed after switching to the new shm developed by @aingerson, likely in libfabric 1.22 or 2.0

@wenduwan
Copy link
Contributor Author

I should mention that the user is lucky in this case to be able to mitigate by modifying the application - I wouldn't be surprised if they run into a scenarios and get stuck - and we should fix this soon.

@aingerson
Copy link
Contributor

@wenduwan Thanks for the context. We will work on getting new shm ready. We just need to root caused the inline performance issues and then we should be ready to move.

@shefty
Copy link
Member

shefty commented Mar 14, 2024

Can you disable the offending protocol to avoid the hang?

@wenduwan
Copy link
Contributor Author

Can you disable the offending protocol to avoid the hang?

We can. This is an alternative with a (slight) performance penalty, by switching to SAR. Therefore we did not recommend it to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants