Use the same stream as the OpenMM context when replaying a CUDA graph. #122

RaulPPelaez · 2023-09-26T10:57:26Z

This issue torchmd/torchmd-net#220 unveiled a race condition in the current graph replay mechanism. By asking torch for a CUDA stream we risk a later OpenMM workload to run on a different one and thus access non-ready data.

In this PR I make sure that the CUDA graph replay uses the same stream as the OpenMM context, while ensuring that CUDA graph capture occurs in a non-default stream (a requirement for capture).

@raimis please confirm this fixes the original issue in your environment.

Use a non-default stream for CUDA graph capturing.

RaulPPelaez · 2023-09-26T12:07:05Z

Sadly this was not enough and the error popped out again.
I had to also increase the warmup steps to at least 5:

            force.setProperty("CUDAGraphWarmupSteps", "5")

I have no idea why 5 is the magic number. Requiring more than 1 (the default) does not surprise me (for instance, DDP requires 11)

The error that is generated makes me think warmup is indeed the issue here, this warning goes away when the reproducer runs.

[W manager.cpp:335] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.                             
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`                                                            
 (function runCudaFusionGroup)

I guess torch is sometimes choosing a non-graphable path during the first calls. It sounds like torch is waiting for a few calls to try and pass the model through nvfuser. By CUDA graphing early we break this.

Not sure how to go about this, perhaps increasing the default warmup steps to something like 10 would be sensible?
I feel like we really do not have control over things like torch calling nvfuser...

I could also try to rewrite it so that instead of TorchForce calling the torch model X times during initialization, it tracks the number of times it has been called and only captures after X calls. I am more inclined to leave it as is, since whatever torch does to warmup might only happen if there is no other code in between calls.

raimis · 2023-09-26T15:21:26Z

@RaulPPelaez good work! I'll test the fix.

raimis · 2023-09-26T16:31:24Z

@RaulPPelaez I can preliminarily confirm that the fix works. I'll run a complete simulation to verify definitely.

raimis · 2023-09-26T16:32:50Z

I could also try to rewrite it so that instead of TorchForce calling the torch model X times during initialization, it tracks the number of times it has been called and only captures after X calls. I am more inclined to leave it as is, since whatever torch does to warmup might only happen if there is no other code in between calls.

Yes, I agree. You can leave it as it is.

raimis

Looks good! Just be explicit in the docs.

README.md

peastman

Looks good.

Use the same stream as the OpenMM context when replaying a CUDA graph.

6c332c7

Use a non-default stream for CUDA graph capturing.

Increase default warmup steps to 10

575d427

raimis requested review from raimis and peastman September 26, 2023 16:35

raimis approved these changes Sep 26, 2023

View reviewed changes

README.md Show resolved Hide resolved

peastman approved these changes Sep 26, 2023

View reviewed changes

raimis merged commit a8f467b into openmm:master Sep 27, 2023

RaulPPelaez mentioned this pull request Oct 3, 2023

Simple TorchScript test fails torchmd/torchmd-net#219

Closed

raimis mentioned this pull request Oct 9, 2023

OpenMM-Torch 1.4 #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the same stream as the OpenMM context when replaying a CUDA graph. #122

Use the same stream as the OpenMM context when replaying a CUDA graph. #122

RaulPPelaez commented Sep 26, 2023

RaulPPelaez commented Sep 26, 2023

raimis commented Sep 26, 2023

raimis commented Sep 26, 2023

raimis commented Sep 26, 2023

raimis left a comment

peastman left a comment

Use the same stream as the OpenMM context when replaying a CUDA graph. #122

Use the same stream as the OpenMM context when replaying a CUDA graph. #122

Conversation

RaulPPelaez commented Sep 26, 2023

RaulPPelaez commented Sep 26, 2023

raimis commented Sep 26, 2023

raimis commented Sep 26, 2023

raimis commented Sep 26, 2023

raimis left a comment

Choose a reason for hiding this comment

peastman left a comment

Choose a reason for hiding this comment