Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making TorchForce CUDA-graph aware #103

Merged
merged 47 commits into from
Apr 21, 2023
Merged

Conversation

RaulPPelaez
Copy link
Contributor

@RaulPPelaez RaulPPelaez commented Mar 29, 2023

This PR is a continuation of the work started by @raimis in #68 .
CUDA graphs provide a way to group several operations into a single "graph". The benefit of this being that by providing CUDA with a certain set of guarantees (mainly static shapes and memory addresses, no synchronization and no cuda* calls) it is capable of preventing some overhead (mainly related to the time spent preparing kernel launches).
CUDA graphs shine with workloads that consist of many small kernel launches put together.

In its most basic form, a CUDA graph is constructed by "capturing" a stream. In essence you do a dry run of the workload, which must happen in the same stream and CUDA records everything that happen in it into a graph. Then the graph can be replayed as many times as needed.

For TorchForce, the most evident use of this is to try and make the forward and backwards calls into a graph.

For that, this PR aims to introduce the following changes:

  • Capture the model into a graph
    • Add members to store graph objects
    • Guard the functionality with a boolean flag and a macro. This flag is force to false if the pytorch version is too old.
    • Let the user control if graphs are enabled.
      For this, the solution proposed by raimis in Support CUDA Graphs #68 , adding a property system to TorchForce, seems the most sensible.
      • Implement the property system
  • Fail with a meaningful and recoverable error if the model fails to capture.
    • Torch throws a verbose exception pointing to the line that failed to capture, this is rethrown as an OpenMM exception and can be handled without issue.
  • Add tests for the new functionality
    • CUDA graph tests
    • Set property tests

Main changes in this PR go into this function:

double CudaCalcTorchForceKernel::execute(ContextImpl& context, bool includeForces, bool includeEnergy) {

which performs three main operations:

  1. Copy positions and box vectors from OpenMM into torch tensors
  2. Execute the model to get energies and/or forces
  3. Return energy and, if required, add forces to OpenMM

Following from #101, only step 2 is introduced into a graph. The other two operations are essentially two additional kernel launches, in principle, they could also be introduced into the graph.
Right now, there are several synchronization barriers between each step. Also, I do not know what kind of guarantees ContextSelector in OpenMM provides (a.i does it involve synchronization?, cudaSetDevice?), I would need guidance on this.

Apart from this, I added the cherry-picked commits from #68 that implement the functionality to provide "Properties" to TorchForce. This includes modifying the constructors of TorchForce to provide an optional dictionary with properties, and the addition of a setProperty and getProperty members.

Caveats

Ungraphable operations

Many apparently innocuous operations are not graphable. For instance, this model is OK:

class ForceModule(torch.nn.Module):

    def forward(self, positions):
        return (torch.sum(torch.norm(positions,dim=1)), -2*positions)

While this one is not:

class ForceModule(torch.nn.Module):

    def forward(self, positions):
        return (torch.sum(positions**2), -2*positions)

Luckily, CUDA and torch are really informative at saying which line is the offending one. Torch throws an exception that is easy to catch and handle.

@RaulPPelaez
Copy link
Contributor Author

This is a MWE:

import openmmtorch as ot
import torch
import openmm as mm
import numpy as np

class ForceModule(torch.nn.Module):
    def forward(self, positions):
        #return (torch.sum(torch.norm(positions,dim=1)), -2*positions)
        return (torch.sum(positions**2), -2*positions)

module = torch.jit.script(ForceModule())
torch_force = ot.TorchForce(module)
torch_force.setOutputsForces(True)
numParticles = 10
system = mm.System()
positions = np.random.rand(numParticles, 3)
for _ in range(numParticles):
    system.addParticle(1.0)
system.addForce(torch_force)
integ = mm.VerletIntegrator(1.0)
platform = mm.Platform.getPlatformByName('CUDA')
context = mm.Context(system, integ, platform)
context.setPositions(positions)
state = context.getState(getEnergy=True, getForces=True)

This is the exception printed if one tries to capture this module:

[W manager.cpp:329] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (function runCudaFusionGroup)
Traceback (most recent call last):
  File "/shared/raul/openmm-torch/python/tests/graph.py", line 48, in <module>
    state = context.getState(getEnergy=True, getForces=True)
  File "/shared/raul/mambaforge/envs/openmmtorchtest/lib/python3.9/site-packages/openmm/openmm.py", line 10009, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: TorchForce Failed to capture the model into a CUDA graph. Torch reported the following error:
The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 8, in fallback_cuda_fuser
  def forward(self: __torch__.ForceModule,
    positions: Tensor) -> Tuple[Tensor, Tensor]:
    _0 = (torch.sum(torch.pow(positions, 2)), torch.mul(positions, -2))
                    ~~~~~~~~~ <--- HERE
    return _0

Traceback of TorchScript, original code (most recent call last):
  File "/shared/raul/openmm-torch/python/tests/graph.py", line 24, in fallback_cuda_fuser
        """
        #return (torch.sum(torch.norm(positions,dim=1)), -2*positions)
        return (torch.sum(positions**2), -2*positions)
                          ~~~~~~~~~~~~ <--- HERE
RuntimeError: status != cudaStreamCaptureStatus::cudaStreamCaptureStatusInvalidated INTERNAL ASSERT FAILED at "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/work/c10/cuda/CUDACachingAllocator.cpp":1082, please report a bug to PyTorch. 

Using the commented line instead results in no error.

@RaulPPelaez
Copy link
Contributor Author

All the functionality for this PR is implemented and tested. Please review @raimis @peastman

@RaulPPelaez RaulPPelaez changed the title [WIP] Making TorchForce CUDA-graph aware Making TorchForce CUDA-graph aware Mar 30, 2023
@raimis raimis requested review from raimis and peastman March 30, 2023 11:27
@peastman
Copy link
Member

Same as my comments in #101 (comment), we should consider the alternatives for how to enable CUDA graphs. Could you comment on what you see as the advantages and disadvantages of the possibilities listed there, and why you chose this approach?

@RaulPPelaez
Copy link
Contributor Author

Pros of the current implementation:

  • Non-intrusive. If one does not care about CUDA graphs nothing changes in the interface.
  • Explicit in the enabling of the functionality. You require to write useCUDAGraphs when creating TorchForce or setting a property. The user must thus know what they are doing.
  • Easy recovery and informative if the model is not capturable. One can just try to capture and fall back to eager (by using setProperty) if it does not work.
  • Compatible with other kinds of graph construction. Since the default is eager mode one can capture the model into a graph in some other way and pass it to TorchForce (loosing only some kernels, like the backward pass in energy-only mode)

Cons

  • Only graphs the model itself, not the copies from/to OpenMM containers. Thus, TorchForce is not a graph node and cannot be included easily into larger graphs.
  • Requires adding a Properties infrastructure. This is an additional dev burden that we might or might not want.

I believe adding a Property system will be benefitial in the future. There are surely some other functionalities we can implement for TorchForce that we would like to be turned on-off.

As far as other ways to use graphs, I can think of the following:

  1. Do not touch TorchForce: We can tweak TorchForce in a way such that calls to execute can be put through a graph. This would require ensuring that the copies to/from OpenMM arrays can be graphable (I do not know how to do that). Since TorchForce is supposed to be called by OpenMM I do not think this a real possibility, a.i there is nowhere to capture the thing.
  2. Create a CudaGraphableTorchForce wrapper class: In essence, a copy of TorchForce that does what this PR does now. I believe this is too much code and dev burden. Also, it would hinder the addition of new functionality in the future.
  3. Let the user do it: graph the forward call inside the model using pytorch. This has several downsides:
    • Requires modifying the model, which might not be a possibility (for instance if the user does torch.load of a pt file).
    • Would only capture the forward call (currently the backwards pass is also included).

As for the mechanism to enable the CUDA graph functionality, I can think of some alternatives to the Property system:

  1. A member function called "enableCUDAGraphs()". Cons:
    • Only allows clear communication of "on/off", what if we want mode graph modes in the future? for instance "only_model" or "nowarmup".
    • Requires the same level of documentation as the current approach while being less extensible.
  2. Some global state flag. This seems far from the rest of the interface, would also prevent a per-instance enabling/disabling.
  3. Communicate the cuda graph enabling via the Model:
    • Requires modification of the model (which might not be possible).
    • Some models might not be compatible with CUDA graphs in general but could be graphed by torch force. A model might have lots of branching, being thus "incompatible with CUDA graphs", but as long as the model is ensured to follow the same path every time while being owned by TorchForce then it is ok to graph it.
      For instance, this is not graphable in general:
class ForceModule(torch.nn.Module):
    def forward(self, positions, some_flag):
        factor = 1
        if(some_flag is not None):
            flag = 10
            pt.cuda.synchronize()
        return (factor*torch.sum(torch.norm(positions,dim=1)), -2*positions)

But if I know some_flag is None during the lifetime of TorchForce, I can enable cuda graphs on it.

platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
@raimis
Copy link
Contributor

raimis commented Apr 14, 2023

@peastman any more comments?

@raimis raimis mentioned this pull request Apr 14, 2023
8 tasks
README.md Outdated Show resolved Hide resolved
openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved
openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved
openmmapi/src/TorchForce.cpp Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
platforms/cuda/src/CudaTorchKernels.cpp Outdated Show resolved Hide resolved
@RaulPPelaez
Copy link
Contributor Author

This may be reviewed again @peastman

Copy link
Member

@peastman peastman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Just a few more very minor comments, and then it should be ready to merge.

openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved
openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved
openmmapi/include/TorchForce.h Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@peastman
Copy link
Member

Looks good to me! @raimis do you have any more comments before we merge it?

@raimis raimis merged commit b76deb4 into openmm:master Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants