-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the same stream as the OpenMM context when replaying a CUDA graph. #122
Conversation
Use a non-default stream for CUDA graph capturing.
Sadly this was not enough and the error popped out again. force.setProperty("CUDAGraphWarmupSteps", "5") I have no idea why 5 is the magic number. Requiring more than 1 (the default) does not surprise me (for instance, DDP requires 11) The error that is generated makes me think warmup is indeed the issue here, this warning goes away when the reproducer runs. [W manager.cpp:335] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
(function runCudaFusionGroup) I guess torch is sometimes choosing a non-graphable path during the first calls. It sounds like torch is waiting for a few calls to try and pass the model through nvfuser. By CUDA graphing early we break this. Not sure how to go about this, perhaps increasing the default warmup steps to something like 10 would be sensible? I could also try to rewrite it so that instead of TorchForce calling the torch model X times during initialization, it tracks the number of times it has been called and only captures after X calls. I am more inclined to leave it as is, since whatever torch does to warmup might only happen if there is no other code in between calls. |
@RaulPPelaez good work! I'll test the fix. |
@RaulPPelaez I can preliminarily confirm that the fix works. I'll run a complete simulation to verify definitely. |
Yes, I agree. You can leave it as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just be explicit in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
This issue torchmd/torchmd-net#220 unveiled a race condition in the current graph replay mechanism. By asking torch for a CUDA stream we risk a later OpenMM workload to run on a different one and thus access non-ready data.
In this PR I make sure that the CUDA graph replay uses the same stream as the OpenMM context, while ensuring that CUDA graph capture occurs in a non-default stream (a requirement for capture).
@raimis please confirm this fixes the original issue in your environment.