This plugin adds the Metal platform that accelerates OpenMM on Metal 3 GPUs. It supports Apple, AMD, and Intel GPUs running macOS Ventura or higher. The plugin uses Apple's OpenCL compatiblity layer (cl2Metal
) to translate OpenCL kernels to AIR.
Vendors:
- Apple Mac GPUs are supported in current and future releases.
- AMD and Intel GPUs are in long-term service mode. The v1.0.0 tag on GitHub works, but newer versions are failing tests on Intel Macs.
In Finder, create an empty folder and right-click it. Select New Terminal at Folder
from the menu, which launches a Terminal window. Copy and enter the following commands one-by-one.
conda install -c conda-forge openmm
git clone https://github.com/openmm/openmm
cd openmm && git fetch && git checkout 8.1_branch
cd ../
git clone https://github.com/philipturner/openmm-metal
cd openmm-metal
# requires password
bash build.sh --install --quick-tests
Next, you will benchmark OpenCL against Metal. In your originally empty folder, enter openmm/examples
. Arrange the files by name and locate benchmark.py
. Then, open the file with TextEdit or Xcode. Modify the top of the script as shown below:
from __future__ import print_function
import openmm.app as app
import openmm as mm
import openmm.unit as unit
from datetime import datetime
import argparse
import os
# What you need to add:
mm.Platform.loadPluginsFromDirectory(mm.Platform.getDefaultPluginsDirectory())
# Rest of the code:
def cpuinfo():
"""Return CPU info"""
Press Cmd + S
to save. Back at the Terminal window, type the following commands:
cd ../
cd openmm
cd examples
python3 benchmark.py --test apoa1rf --seconds 15 --platform OpenCL
python3 benchmark.py --test apoa1rf --seconds 15 --platform HIP
OpenMM's current energy minimizer hard-codes checks for the CUDA
, OpenCL
, and HIP
platforms. The Metal backend is currently labeled HIP
everywhere to bypass this limitation. The plugin name will change to Metal
after the next OpenMM version release, which fixes the issue. To prevent source-breaking changes, check for both the HIP
and Metal
backends in your client code.
Due to its dependency on OpenMM 8.0.0, the Metal plugin can't implement an optimization inside the main code base that speeds up small systems. Therefore, there is an environment variable that can force-disable the usage of nearest neighbor lists inside Metal kernels. Turning off the neighbor list can substantially improve simulation speed for systems with under 3000 atoms.
export OPENMM_METAL_USE_NEIGHBOR_LIST=0 # accepted, force-disables neighor list
export OPENMM_METAL_USE_NEIGHBOR_LIST=1 # accepted, forces usage of neighbor list
export OPENMM_METAL_USE_NEIGHBOR_LIST=2 # runtime crash
unset OPENMM_METAL_USE_NEIGHBOR_LIST # accepted, automatically chooses whether to use the list
Scaling behavior (Water Box, Amber Forcefield, PME):
Atoms | Force No List | Force Use List | Auto |
---|---|---|---|
6 | 1380 ns/day | 1120 ns/day | 1380 ns/day |
21 | 1260 ns/day | 1120 ns/day | 1260 ns/day |
96 | 1210 ns/day | 1040 ns/day | 1210 ns/day |
306 | 1010 ns/day | 860 ns/day | 1010 ns/day |
774 | 739 ns/day | 637 ns/day | 739 ns/day |
2661 | 490 ns/day | 438 ns/day | 490 ns/day |
4158 | 347 ns/day | 340 ns/day | 340 ns/day |
The above table is not a great example of the speedup possible by eliminating the sequential throughput bottleneck. Typically, this will provide an order of magnitude speedup. As in, not 18% speedup, but 18x speedup. Why that is not happening needs to be investigated further.
The Metal plugin can use the OpenCL queue profiling API to extract performance data about kernels. This API prevents batching of commands into command buffers, and reports the GPU start and end time of each buffer. The results are written to stdout in the JSON format used by https://ui.perfetto.dev.
export OPENMM_METAL_PROFILE_KERNELS=0 # accepted, does not profile
export OPENMM_METAL_PROFILE_KERNELS=1 # accepted, prints data to the console
export OPENMM_METAL_PROFILE_KERNELS=2 # runtime crash
unset OPENMM_METAL_PROFILE_KERNELS # accepted, does not profile
By default, energy summation is serialized among a single threadgroup. The reduceEnergy
kernel consumes a significant proportion of execution time for small systems. You can make reduction occur across more than one threadgroup with the following variable.
export OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS=1 # accepted, 1 threadgroup used
export OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS=3 # accepted, 3 threadgroups used
export OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS=-1 # runtime crash
unset OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS # accepted, 1024 threadgroups used
Scaling behavior (Water Box, Amber Forcefield, No Cutoff):
- top of table shows number of threadgroups
- cells show time to finish the
reduceEnergy
kernel
Atoms | OpenMM OpenCL | Metal TG=1 | Metal TG=2 | Metal TG=4 | Metal TG=8 |
---|---|---|---|---|---|
96 | ~65 µs | 9 µs | 7 µs | 6 µs | 6 µs |
306 | ~65 µs | 9 µs | 7 µs | - | - |
774 | ~65 µs | 9 µs | 7 µs | - | - |
2661 | ~65 µs | 9 µs | 7 µs | - | - |
4158 | ~65 µs | 9 µs | 7 µs | 6 µs | 6 µs |
You can increase the precision of energy measurements by setting METAL_REDUCE_ENERGY_THREADGROUPS
to a very large number. The GPU splits the energy into a large number of partial sums, which the CPU casts from FP32 -> FP64 and accumulates in mixed precision. An improvement in energy conservation has not yet been observed, because the tested system was too small to occupy many threadgroups. It is theorized to take effect with very large systems.
A recent study proved there is a ~4x reduction in standard deviation of singlepoint energies, provided a massive number of threadgroups is used. In addition the latency appeared to remain at 6 µs in Perfetto. Even with 1024 threadgroups dispatched, the execution time of reduceEnergy
was about 1/1000 of the time spent computing forces. In light of this study, the default parameter for OPENMM_METAL_REDUCE_ENERGY_THREADGROUPS
was changed from 1 to 1024.
https://gist.github.com/philipturner/20df5482f75b07d91466b972a1a46cd5
At the several million atom range, OpenMM starts to experience
export OPENMM_METAL_USE_LARGE_BLOCKS=0 # accepted, force-disables large blocks
export OPENMM_METAL_USE_LARGE_BLOCKS=1 # accepted, forces usage of large blocks
export OPENMM_METAL_USE_LARGE_BLOCKS=2 # runtime crash
unset OPENMM_METAL_USE_LARGE_BLOCKS # accepted, uses large blocks
Scaling behavior (Water Box, Amber Forcefield, PME):
- s/100K/500 means "seconds per 100,000 atoms per 500 steps"
Water Box Sizes | Atoms | s/100K/500 (32-blocks) | s/100K/500 (1024-blocks) |
---|---|---|---|
5.0 nm | 12255 | 4.29 | 4.21 |
10.0 nm | 98880 | 2.02 | 2.02 |
15.0 nm | 335025 | 1.95 | 1.90 |
20.0 nm | 792450 | 2.03 | 1.92 |
25.0 nm | 1550160 | 2.35 | 2.09 |
30.0 nm | 2682600 | 2.61 | 2.32 |
Quick tests:
Apple | AMD | Intel | |
---|---|---|---|
Last tested version | 1.1.0-dev | 1.0.0 | 1.0.0 |
Test | Apple FP32 | AMD FP32 | Intel FP32 |
---|---|---|---|
CMAPTorsion | ✅ | ✅ | ✅ |
CMAPMotionRemover | ✅ | ✅ | ✅ |
CMMotionRemover | ✅ | ✅ | ✅ |
Checkpoints | ✅ | ✅ | ✅ |
CompoundIntegrator | ✅ | ✅ | ✅ |
CustomAngleForce | ✅ | ✅ | ✅ |
CustomBondForce | ✅ | ✅ | ✅ |
CustomCVForce | ✅ | ✅ | ✅ |
CustomCentroidBondForce | ✅ | ✅ | ✅ |
CustomCompoundBondForce | ✅ | ✅ | ✅ |
CustomExternalForce | ✅ | ✅ | ✅ |
CustomGBForce | ✅ | ✅ | ✅ |
CustomHbondForce | ✅ | ✅ | ✅ |
CustomTorsionForce | ✅ | ✅ | ✅ |
DeviceQuery | ✅ | ✅ | ✅ |
FFT | ✅ | ✅ | ❌ |
GBSAOBCForce | ✅ | ✅ | ✅ |
GayBerneForce | ✅ | ✅ | ❌ |
HarmonicAngleForce | ✅ | ✅ | ✅ |
HarmonicBondForce | ✅ | ✅ | ✅ |
MultipleForces | ✅ | ✅ | ✅ |
PeriodicTorsionForce | ✅ | ✅ | ✅ |
RBTorsionForce | ✅ | ✅ | ✅ |
RMSDForce | ✅ | ✅ | ✅ |
Random | ✅ | ✅ | ✅ |
Settle | ✅ | ✅ | ✅ |
Sort | ✅ | ✅ | ✅ |
VariableVerlet | ✅ | ✅ | ✅ |
AmoebaExtrapolatedPolarization | ✅ | ✅ | ❌ |
AmoebaGeneralizedKirkwoodForce | ✅ | ✅ | ❌ |
AmoebaMultipoleForce | ✅ | ✅ | ❌ |
AmoebaTorsionTorsionForce | ✅ | ✅ | ✅ |
HippoNonbondedForce | ✅ | ✅ | ❌ |
WcaDispersionForce | ✅ | ✅ | ✅ |
DrudeForce | ✅ | ✅ | ✅ |
Long tests:
Apple | AMD | Intel | |
---|---|---|---|
Last tested version | 1.1.0-dev | 1.0.0 | 1.0.0 |
Test | Apple FP32 | AMD FP32 | Intel FP32 |
---|---|---|---|
AndersenThermostat | ✅ | ✅ | ✅ |
CustomManyParticleForce | ✅ | ✅ | ✅ |
CustomNonbondedForce | ✅ | ✅ | ✅ |
DispersionPME | ✅ | ✅ | ❌ |
Ewald | ✅ | ✅ | ❌ |
LangevinIntegrator | ✅ | ✅ | ✅ |
LangevinMiddleIntegrator | ✅ | ✅ | ✅ |
LocalEnergyMinimizer | ✅ | ❌ | ✅ |
MonteCarloFlexibleBarostat | ✅ | ✅ | ✅ |
NonbondedForce | ✅ | ✅ | ❌ |
VerletIntegrator | ✅ | ✅ | ✅ |
VirtualSites | ✅ | ✅ | ✅ |
DrudeNoseHoover | ✅ | ✅ | ✅ |
Very long tests:
Apple | AMD | Intel | |
---|---|---|---|
Last tested version | 1.1.0-dev | n/a | n/a |
Test | Apple FP32 | AMD FP32 | Intel FP32 |
---|---|---|---|
BrownianIntegrator | ✅ | - | - |
CustomIntegrator | ✅ | - | - |
MonteCarloAnisotropicBarostat | ✅ | - | - |
MonteCarloBarostat | ✅ | - | - |
NoseHooverIntegrator | ✅ | - | - |
VariableLangevinIntegrator | ✅ | - | - |
DrudeLangevinIntegrator | ✅ | - | - |
DrudeSCFIntegrator | ✅ | - | - |
Releases:
- v1.0.0
- OpenMM 8.0.0
- Initial release, bringing proper support for Mac GPUs
- v1.1.0
- OpenMM 8.1.0
- AMD and Intel go into long-term service
- Remove "HIP" platform name workaround, introduce "Metal"
- Remove double and mixed precision
- v1.2.0
- OpenMM 8.2.0
- openmm/openmm#4346
- openmm/openmm#4348
The Metal Platform uses OpenMM API under the terms of the MIT License. A copy of this license may be found in the accompanying file MIT.txt.
The Metal Platform is based on the OpenCL Platform of OpenMM under the terms of the GNU Lesser General Public License. A copy of this license may be found in the accompanying file LGPL.txt. It in turn incorporates the terms of the GNU General Public License, which may be found in the accompanying file GPL.txt.
The Metal Platform uses VkFFT by Dmitrii Tolmachev under the terms of the MIT License. A copy of this license may be found in the accompanying file MIT-VkFFT.txt.