ROCm Sparse Marlin Kernels #1206

petrex · 2024-10-31T21:44:15Z

[WIP] Built on top pf #1201. This pull request introduces support for ROCm (Radeon Open Compute) for sparse marling kernel in addition to CUDA, enabling the code to run on AMD GPUs.

The main changes involve conditional compilation to handle differences between CUDA and ROCm, as well as adding ROCm-specific intrinsics for MI300x.

co-author : @lcskrishna

Key changes include:

ROCm Support in `setup.py`:

hip kernels generation

Conditional Compilation in CUDA Source Files:

Added conditional compilation directives to exclude certain code for ROCm and include ROCm-specific implementations.

ROCm-specific Implementations:

Implemented ROCm-specific versions of functions and macros that are different from their CUDA counterparts, ensuring compatibility and performance on AMD GPUs.

validation and benchmark across workloads on MIxxx GPUs

ROCm build infrastructure

[ROCm] Enable Tiled layout extension and minor changes to setup

pytorch-bot · 2024-10-31T21:44:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1206

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 New Failures

As of commit 00bc94d with merge base ce4822b ():

NEW FAILURES - The following jobs have failed:

Build Linux Wheels / build / upload / manywheel-py3_9-cuda11_8 (gh)
Build Linux Wheels / build / upload / manywheel-py3_9-cuda12_1 (gh)
Build Linux Wheels / build / upload / manywheel-py3_9-cuda12_4 (gh)
Build M1 Wheels / pytorch/ao / upload / wheel-py3_9-cpu (gh)
##[error]The process '/usr/bin/git' failed with exit code 1
Build M1 Wheels / pytorch/ao / wheel-py3_9-cpu (gh)
##[error]The process '/usr/bin/git' failed with exit code 1
Run Float8 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh)
RuntimeError: Command docker exec -t 1f7465b3eb9aa93e43c773a8302063c2f96eefff9b938fc8babafea4ade1734f /exec failed with exit code 1
Run Regression Tests / test (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/whl/nightl... / linux-job (gh)
test/prototype/test_parametrization.py::TestFakeSparsity::test_jit_trace
Run Regression Tests / test (CUDA 2.3, linux.g5.12xlarge.nvidia.gpu, torch==2.3.0, cuda, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t 9d43d3322cfdfc3745486298fb01c74751280b18b3ef420175ee250c17f60516 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t fcf3c67fc4e261d69f1a63889d0b855ded539392483b54d7425a54eab29d0fae /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.5, linux.g5.12xlarge.nvidia.gpu, torch==2.5.0 --index-url https://download.pytorch.o... / linux-job (gh)
RuntimeError: Command docker exec -t 9e3d147090aade825ff2c23f5c24053aadc8b0dac7368b45843121871ddc9ade /exec failed with exit code 1
Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh)
RuntimeError: Command docker exec -t 9b51cc6131f66dae42099f5406a5129ffe5656071234904948ff0cdc89ffbbf6 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-11-04T05:24:04Z

setup.py

@@ -46,9 +46,11 @@ def read_version(file_path="version.txt"):
    CUDAExtension,


I think you might enjoy stack based PR development https://github.com/modularml/stack-pr

msaroufim · 2024-11-04T05:25:32Z

torchao/csrc/cuda/sparse_marlin/mem.h

@@ -19,6 +19,28 @@
 #include "base.h"

 namespace torchao {
+


@xw285cornell looking for some quick advice, do you recommend we support AMD by adding conditional compilation flags to our existing cuda kernels or be OK with some more copy paste?

Chatted offline and indeed ifdefs are the way to go

msaroufim · 2024-11-04T20:32:26Z

Do you have performance numbers by any chance relative to fp16? wanna make sure the performance improvements are competitive with CUDA

petrex · 2024-11-05T17:17:32Z

Do you have performance numbers by any chance relative to fp16? wanna make sure the performance improvements are competitive with CUDA

still WIP, but would you share the benchmark you guys are using? will try that on mi300x when the PR is ready.

msaroufim · 2024-11-05T19:10:06Z

Ok holler at me again whenever you need a review. Really excited to see this land

drisspg · 2024-11-05T23:06:03Z

For benchmarking it is a little ad hoc the best place for this today would be to verify on: https://github.com/pytorch/ao/blob/main/torchao/_models/llama/generate.py

lcskrishna and others added 20 commits October 16, 2024 05:19

enable build for rocm for fp6_llm

6d92e40

Merge pull request #1 from lcskrishna/cl/rocm-enablement

14b3fce

ROCm build infrastructure

add sparse_marlin kernel to the build

d2aadf2

drop .h from conversion

3f31e4e

cp_asyc4_pred_zfill() AMD implementation

7139bf1

implement matching mem utility with amd GCN isa

2e389f1

implement mma util with amd gcn isa

8b307d5

enable rocm path

9c918f7

update copy from global to lds

76ff70a

enable tiled layout extension

f1a22cf

fix build error related to option

0bef6ca

require rocm 6.2

893ae03

implement cvta_to_shared()

362d3cc

consolidate code with cvta_to_shared()

5c7d77b

enable tensor tiled layout extension with successful compilation

a0d3788

enable successful build

e4e654d

clean-up

3e2c6a1

Merge pull request #3 from lcskrishna/csrikris_enable_tensor_tile

c86880e

[ROCm] Enable Tiled layout extension and minor changes to setup

fix potential memory access issue

91d3c75

Merge branch 'rocm_enablement_staging' into rocm_sparse_marlin

00bc94d

pytorch-bot bot added the module: rocm label Oct 31, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2024

msaroufim requested review from msaroufim and removed request for msaroufim November 2, 2024 22:51

msaroufim reviewed Nov 4, 2024

View reviewed changes

jcaip mentioned this pull request Nov 11, 2024

AMD integration tracker #1260

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm Sparse Marlin Kernels #1206

ROCm Sparse Marlin Kernels #1206

petrex commented Oct 31, 2024

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

msaroufim Nov 4, 2024

msaroufim Nov 4, 2024

msaroufim Nov 4, 2024

msaroufim commented Nov 4, 2024

petrex commented Nov 5, 2024 •

edited

Loading

msaroufim commented Nov 5, 2024

drisspg commented Nov 5, 2024

		@@ -46,9 +46,11 @@ def read_version(file_path="version.txt"):
		CUDAExtension,

ROCm Sparse Marlin Kernels #1206

Are you sure you want to change the base?

ROCm Sparse Marlin Kernels #1206

Conversation

petrex commented Oct 31, 2024

ROCm Support in setup.py:

Conditional Compilation in CUDA Source Files:

ROCm-specific Implementations:

pytorch-bot bot commented Oct 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1206

❌ 11 New Failures

msaroufim Nov 4, 2024

Choose a reason for hiding this comment

msaroufim Nov 4, 2024

Choose a reason for hiding this comment

msaroufim Nov 4, 2024

Choose a reason for hiding this comment

msaroufim commented Nov 4, 2024

petrex commented Nov 5, 2024 • edited Loading

msaroufim commented Nov 5, 2024

drisspg commented Nov 5, 2024

ROCm Support in `setup.py`:

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

petrex commented Nov 5, 2024 •

edited

Loading