Making TorchForce CUDA-graph aware #103

RaulPPelaez · 2023-03-29T11:08:13Z

This PR is a continuation of the work started by @raimis in #68 .
CUDA graphs provide a way to group several operations into a single "graph". The benefit of this being that by providing CUDA with a certain set of guarantees (mainly static shapes and memory addresses, no synchronization and no cuda* calls) it is capable of preventing some overhead (mainly related to the time spent preparing kernel launches).
CUDA graphs shine with workloads that consist of many small kernel launches put together.

In its most basic form, a CUDA graph is constructed by "capturing" a stream. In essence you do a dry run of the workload, which must happen in the same stream and CUDA records everything that happen in it into a graph. Then the graph can be replayed as many times as needed.

For TorchForce, the most evident use of this is to try and make the forward and backwards calls into a graph.

For that, this PR aims to introduce the following changes:

Capture the model into a graph
- Add members to store graph objects
- Guard the functionality with a boolean flag and a macro. This flag is force to false if the pytorch version is too old.
- Let the user control if graphs are enabled.
  For this, the solution proposed by raimis in Support CUDA Graphs #68 , adding a property system to TorchForce, seems the most sensible.
  - Implement the property system
Fail with a meaningful and recoverable error if the model fails to capture.
- Torch throws a verbose exception pointing to the line that failed to capture, this is rethrown as an OpenMM exception and can be handled without issue.
Add tests for the new functionality
- CUDA graph tests
- Set property tests

Main changes in this PR go into this function:

openmm-torch/platforms/cuda/src/CudaTorchKernels.cpp

Line 97 in 769302a

    
           double CudaCalcTorchForceKernel::execute(ContextImpl& context, bool includeForces, bool includeEnergy) {

which performs three main operations:

Copy positions and box vectors from OpenMM into torch tensors
Execute the model to get energies and/or forces
Return energy and, if required, add forces to OpenMM

Following from #101, only step 2 is introduced into a graph. The other two operations are essentially two additional kernel launches, in principle, they could also be introduced into the graph.
Right now, there are several synchronization barriers between each step. Also, I do not know what kind of guarantees ContextSelector in OpenMM provides (a.i does it involve synchronization?, cudaSetDevice?), I would need guidance on this.

Apart from this, I added the cherry-picked commits from #68 that implement the functionality to provide "Properties" to TorchForce. This includes modifying the constructors of TorchForce to provide an optional dictionary with properties, and the addition of a setProperty and getProperty members.

Caveats

Ungraphable operations

Many apparently innocuous operations are not graphable. For instance, this model is OK:

class ForceModule(torch.nn.Module):

    def forward(self, positions):
        return (torch.sum(torch.norm(positions,dim=1)), -2*positions)

While this one is not:

class ForceModule(torch.nn.Module):

    def forward(self, positions):
        return (torch.sum(positions**2), -2*positions)

Luckily, CUDA and torch are really informative at saying which line is the offending one. Torch throws an exception that is easy to catch and handle.

RaulPPelaez · 2023-03-29T13:16:23Z

This is a MWE:

import openmmtorch as ot
import torch
import openmm as mm
import numpy as np

class ForceModule(torch.nn.Module):
    def forward(self, positions):
        #return (torch.sum(torch.norm(positions,dim=1)), -2*positions)
        return (torch.sum(positions**2), -2*positions)

module = torch.jit.script(ForceModule())
torch_force = ot.TorchForce(module)
torch_force.setOutputsForces(True)
numParticles = 10
system = mm.System()
positions = np.random.rand(numParticles, 3)
for _ in range(numParticles):
    system.addParticle(1.0)
system.addForce(torch_force)
integ = mm.VerletIntegrator(1.0)
platform = mm.Platform.getPlatformByName('CUDA')
context = mm.Context(system, integ, platform)
context.setPositions(positions)
state = context.getState(getEnergy=True, getForces=True)

This is the exception printed if one tries to capture this module:

[W manager.cpp:329] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (function runCudaFusionGroup)
Traceback (most recent call last):
  File "/shared/raul/openmm-torch/python/tests/graph.py", line 48, in <module>
    state = context.getState(getEnergy=True, getForces=True)
  File "/shared/raul/mambaforge/envs/openmmtorchtest/lib/python3.9/site-packages/openmm/openmm.py", line 10009, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: TorchForce Failed to capture the model into a CUDA graph. Torch reported the following error:
The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 8, in fallback_cuda_fuser
  def forward(self: __torch__.ForceModule,
    positions: Tensor) -> Tuple[Tensor, Tensor]:
    _0 = (torch.sum(torch.pow(positions, 2)), torch.mul(positions, -2))
                    ~~~~~~~~~ <--- HERE
    return _0

Traceback of TorchScript, original code (most recent call last):
  File "/shared/raul/openmm-torch/python/tests/graph.py", line 24, in fallback_cuda_fuser
        """
        #return (torch.sum(torch.norm(positions,dim=1)), -2*positions)
        return (torch.sum(positions**2), -2*positions)
                          ~~~~~~~~~~~~ <--- HERE
RuntimeError: status != cudaStreamCaptureStatus::cudaStreamCaptureStatusInvalidated INTERNAL ASSERT FAILED at "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/work/c10/cuda/CUDACachingAllocator.cpp":1082, please report a bug to PyTorch.

Using the commented line instead results in no error.

Finish capturing before rethrowing if an exception occurred during capture

python/tests/TestCUDAGraphs.py

RaulPPelaez · 2023-03-30T11:08:33Z

All the functionality for this PR is implemented and tested. Please review @raimis @peastman

…rsistent.

CUDA graph capture

peastman · 2023-03-31T20:28:51Z

Same as my comments in #101 (comment), we should consider the alternatives for how to enable CUDA graphs. Could you comment on what you see as the advantages and disadvantages of the possibilities listed there, and why you chose this approach?

RaulPPelaez · 2023-04-04T08:03:03Z

Pros of the current implementation:

Non-intrusive. If one does not care about CUDA graphs nothing changes in the interface.
Explicit in the enabling of the functionality. You require to write useCUDAGraphs when creating TorchForce or setting a property. The user must thus know what they are doing.
Easy recovery and informative if the model is not capturable. One can just try to capture and fall back to eager (by using setProperty) if it does not work.
Compatible with other kinds of graph construction. Since the default is eager mode one can capture the model into a graph in some other way and pass it to TorchForce (loosing only some kernels, like the backward pass in energy-only mode)

Cons

Only graphs the model itself, not the copies from/to OpenMM containers. Thus, TorchForce is not a graph node and cannot be included easily into larger graphs.
Requires adding a Properties infrastructure. This is an additional dev burden that we might or might not want.

I believe adding a Property system will be benefitial in the future. There are surely some other functionalities we can implement for TorchForce that we would like to be turned on-off.

As far as other ways to use graphs, I can think of the following:

Do not touch TorchForce: We can tweak TorchForce in a way such that calls to execute can be put through a graph. This would require ensuring that the copies to/from OpenMM arrays can be graphable (I do not know how to do that). Since TorchForce is supposed to be called by OpenMM I do not think this a real possibility, a.i there is nowhere to capture the thing.
Create a CudaGraphableTorchForce wrapper class: In essence, a copy of TorchForce that does what this PR does now. I believe this is too much code and dev burden. Also, it would hinder the addition of new functionality in the future.
Let the user do it: graph the forward call inside the model using pytorch. This has several downsides:
- Requires modifying the model, which might not be a possibility (for instance if the user does torch.load of a pt file).
- Would only capture the forward call (currently the backwards pass is also included).

As for the mechanism to enable the CUDA graph functionality, I can think of some alternatives to the Property system:

A member function called "enableCUDAGraphs()". Cons:
- Only allows clear communication of "on/off", what if we want mode graph modes in the future? for instance "only_model" or "nowarmup".
- Requires the same level of documentation as the current approach while being less extensible.
Some global state flag. This seems far from the rest of the interface, would also prevent a per-instance enabling/disabling.
Communicate the cuda graph enabling via the Model:
- Requires modification of the model (which might not be possible).
- Some models might not be compatible with CUDA graphs in general but could be graphed by torch force. A model might have lots of branching, being thus "incompatible with CUDA graphs", but as long as the model is ensured to follow the same path every time while being owned by TorchForce then it is ok to graph it.
  For instance, this is not graphable in general:

class ForceModule(torch.nn.Module):
    def forward(self, positions, some_flag):
        factor = 1
        if(some_flag is not None):
            flag = 10
            pt.cuda.synchronize()
        return (factor*torch.sum(torch.norm(positions,dim=1)), -2*positions)

But if I know some_flag is None during the lifetime of TorchForce, I can enable cuda graphs on it.

platforms/cuda/src/CudaTorchKernels.cpp

…ported

This reverts commit d20f4bf.

raimis · 2023-04-14T10:06:41Z

@peastman any more comments?

README.md

openmmapi/include/TorchForce.h

openmmapi/src/TorchForce.cpp

platforms/cuda/src/CudaTorchKernels.cpp

RaulPPelaez · 2023-04-18T08:08:28Z

This may be reviewed again @peastman

peastman

This is looking great! Just a few more very minor comments, and then it should be ready to merge.

openmmapi/include/TorchForce.h

README.md

peastman · 2023-04-20T17:18:58Z

Looks good to me! @raimis do you have any more comments before we merge it?

RaulPPelaez added 4 commits March 28, 2023 15:00

Add CUDA graph draft

732e36d

Initialize energy and force tensors in the GPU.

b1106e8

Add comment on graph capture

b972b5b

Catch torch exception if the model fails to capture.

2f9238e

RaulPPelaez and others added 15 commits March 29, 2023 16:39

Replay graph just after construction

6ff7f1a

Finish capturing before rethrowing if an exception occurred during capture

Add python-side test script for CUDA graphs

84a5460

Implement properties

6e8c873

Update the Python bindings

a82e77d

Unify the API for properties

91cf545

Pass the propery map to the constructor

e81dad6

Skip graph tests if no GPU is present

efc2589

Guard CUDA graph behavior with the CUDA_GRAPH_ENABLE macro

389862d

Check validity of the useCUDAGraphs property

f200e43

Add missing bracket to openmmtorch.i

cd6abc6

Fix bug in useCUDAgraph selection

af6b7b8

Update tests

7f65b04

Add test for get/setProperty

3c31110

Update documentation with new functionality

2c7309c

Add a CUDA graph test for a model that returns only energy

87640b1

RaulPPelaez commented Mar 30, 2023

View reviewed changes

python/tests/TestCUDAGraphs.py Outdated Show resolved Hide resolved

RaulPPelaez changed the title ~~[WIP] Making TorchForce CUDA-graph aware~~ Making TorchForce CUDA-graph aware Mar 30, 2023

Add contributors

c4746dc

raimis requested review from raimis and peastman March 30, 2023 11:27

RaulPPelaez added 4 commits March 30, 2023 16:31

Reset pos grads after graph capture. Make energy and force tensors pe…

f80b797

…rsistent.

Add tests that execute the model many times to catch bugs related with

ea6bc4a

CUDA graph capture

Run formatter

331ba31

Warmup model for several steps

1378d1a

raimis reviewed Apr 11, 2023

View reviewed changes

RaulPPelaez added 10 commits April 11, 2023 16:31

Remove unnecessary compilation guard now that Pytorch 1.10 is not sup…

c94711f

…ported

Simplify getTensorPointer now that Pytorch 1.7 is not supported

edd37a9

Change addForcesToOpenMM to addForces

6de2c5b

Change execute_graph to executeGraph

19594e3

Wrap graph warming up in a try/catch block

2f6e88a

Add correctness test for modules that only provide energy

d20f4bf

Revert "Add correctness test for modules that only provide energy"

7ccd5a3

This reverts commit d20f4bf.

Merge remote-tracking branch 'origin/master' into cuda_graphs_raul

71af9be

Explicit conversion to correct type in getTensorPointer

3d88c26

Added a new property for TorchForce, CUDAGraphWarmupSteps.

e2bb3a0

raimis approved these changes Apr 14, 2023

View reviewed changes

raimis mentioned this pull request Apr 14, 2023

Support CUDA Graphs #68

Closed

8 tasks

peastman reviewed Apr 14, 2023

View reviewed changes

RaulPPelaez added 8 commits April 17, 2023 09:28

Clarify docs

0e04b63

Document properties

54b20fa

Throw if requested property does not exist

f800784

Change getProperty(string) to getProperties()

0cc936a

Add getProperties to python wrappers

40c8fef

Fix formatting

d82eb53

Set default properties

34a555a

Update tests

8e6949c

peastman reviewed Apr 19, 2023

View reviewed changes

openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved

openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved

openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

Update some comments

2be7fd9

raimis merged commit b76deb4 into openmm:master Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making TorchForce CUDA-graph aware #103

Making TorchForce CUDA-graph aware #103

RaulPPelaez commented Mar 29, 2023 •

edited

Loading

RaulPPelaez commented Mar 29, 2023

RaulPPelaez commented Mar 30, 2023

peastman commented Mar 31, 2023

RaulPPelaez commented Apr 4, 2023

raimis commented Apr 14, 2023

RaulPPelaez commented Apr 18, 2023

peastman left a comment

peastman commented Apr 20, 2023

Making TorchForce CUDA-graph aware #103

Making TorchForce CUDA-graph aware #103

Conversation

RaulPPelaez commented Mar 29, 2023 • edited Loading

Caveats

Ungraphable operations

RaulPPelaez commented Mar 29, 2023

RaulPPelaez commented Mar 30, 2023

peastman commented Mar 31, 2023

RaulPPelaez commented Apr 4, 2023

raimis commented Apr 14, 2023

RaulPPelaez commented Apr 18, 2023

peastman left a comment

Choose a reason for hiding this comment

peastman commented Apr 20, 2023

RaulPPelaez commented Mar 29, 2023 •

edited

Loading