[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI#38757
[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI#38757mikaylagawarecki wants to merge 21 commits into
Conversation
e65aea2 to
44a3cbc
Compare
There was a problem hiding this comment.
Code Review
This pull request ports several CUDA kernels, including activation, AWQ, AllSpark, MLA, and Hadacore, to the stable ABI extension (_C_stable_libtorch) and enables this extension for ROCm (HIP). Key changes include the introduction of a device property cache using raw CUDA/HIP APIs, updating headers to use header-only Torch utilities, and refactoring source files to use torch::stable::Tensor. Review feedback identified a compilation error in torch_utils.h where std::once_flag cannot be used with std::deque::resize due to its non-copyable nature, and logic errors in hadacore_transform related to in-place processing and padding.
| #include <deque> | ||
| #include <mutex> | ||
| #include <string> | ||
| #include <vector> | ||
|
|
||
| // Stable ABI equivalent of TORCH_CHECK_NOT_IMPLEMENTED. | ||
| #define STD_TORCH_CHECK_NOT_IMPLEMENTED(cond, ...) \ | ||
| STD_TORCH_CHECK(cond, "NotImplementedError: ", __VA_ARGS__) | ||
|
|
||
| // Device properties cache for stable ABI compatibility. | ||
| // Uses raw CUDA/HIP APIs instead of ATen functions. | ||
| // Using inline ensures a single instance across all translation units. | ||
| inline std::deque<std::once_flag> device_flags; | ||
| inline std::vector<cudaDeviceProp> device_properties; | ||
| inline std::once_flag vectors_init_flag; | ||
|
|
||
| inline void do_init_device_vectors() { | ||
| int device_count; | ||
| cudaError_t err = cudaGetDeviceCount(&device_count); | ||
| if (err != cudaSuccess) { | ||
| STD_TORCH_CHECK(false, "cudaGetDeviceCount failed: " + | ||
| std::string(cudaGetErrorString(err))); | ||
| } | ||
| device_flags.resize(device_count); | ||
| device_properties.resize(device_count); |
There was a problem hiding this comment.
The use of std::deque<std::once_flag> and calling resize() on it will cause a compilation error because std::once_flag is non-copyable and non-movable. std::deque::resize requires the element type to be MoveInsertable. A better approach is to use a std::unique_ptr<std::once_flag[]> to manage the flags dynamically after querying the device count. This requires including <memory>.
#include <memory>
#include <mutex>
#include <string>
#include <vector>
// Stable ABI equivalent of TORCH_CHECK_NOT_IMPLEMENTED.
#define STD_TORCH_CHECK_NOT_IMPLEMENTED(cond, ...) \
STD_TORCH_CHECK(cond, "NotImplementedError: ", __VA_ARGS__)
// Device properties cache for stable ABI compatibility.
// Uses raw CUDA/HIP APIs instead of ATen functions.
// Using inline ensures a single instance across all translation units.
inline std::unique_ptr<std::once_flag[]> device_flags;
inline std::vector<cudaDeviceProp> device_properties;
inline std::once_flag vectors_init_flag;
inline void do_init_device_vectors() {
int device_count;
cudaError_t err = cudaGetDeviceCount(&device_count);
if (err != cudaSuccess) {
STD_TORCH_CHECK(false, "cudaGetDeviceCount failed: " +
std::string(cudaGetErrorString(err)));
}
device_flags = std::make_unique<std::once_flag[]>(device_count);
device_properties.resize(device_count);
}There was a problem hiding this comment.
see my comment on [5/n] #38671 (comment)
| hadacore::run_fht<SCALAR_TYPE>(x.data_ptr(), x.data_ptr(), x.numel(), had_size, stream); | ||
| }); | ||
|
|
||
| if (numel % 256 != 0) { | ||
| out = out.narrow(0, 0, numel / had_size); | ||
| out = torch::stable::narrow(out, 0, 0, numel / had_size); | ||
| } | ||
|
|
||
| if (inplace && out.data_ptr() != x.data_ptr()) { | ||
| x.copy_(out.view(res_shape)); | ||
| torch::stable::copy_(x, torch::stable::view(out, res_shape)); |
There was a problem hiding this comment.
There are critical logic errors in hadacore_transform regarding in-place operations and padding:
- Incorrect Output Pointer: The kernel
run_fhtis called withx.data_ptr()as the output pointer (line 799). Ifinplaceis false, the input tensorxis modified, and the returned tensoroutremains uninitialized. - Incorrect Copy-back Logic: When
inplaceis true and padding occurs,xis reassigned to a new padded tensor. The result is written to this new tensor. However, line 807 copies fromout(the original tensor) tox(the result tensor), which overwrites the computed result with the original input data. It should copy fromxback toout.
To fix this, the kernel should write to out.data_ptr(). If inplace is true and padding occurred, the result should be copied from out (the padded result) back to the original input storage (after narrowing).
There was a problem hiding this comment.
pre-existing?
969bfb0 to
60e21ce
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pure move, no code changes. Preparatory step for stable ABI migration. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Restructure the stable ABI extension build so it compiles on both CUDA and HIP: - Widen outer guard to include HIP - Move CUDA-only sources (CUTLASS, FP4, AWQ, permute_cols) into a CUDA-conditional block - Gate USE_CUDA / CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL to CUDA; define USE_ROCM for HIP - Link PyTorch's bundled libamdhip64.so on ROCm to avoid a dual HIP runtime (from 985769a) - Enable _C_stable_libtorch in setup.py for HIP builds Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Move 9 basic activation ops (silu_and_mul, mul_and_silu, gelu_and_mul, gelu_tanh_and_mul, fatrelu_and_mul, swigluoai_and_mul, gelu_new, gelu_fast, gelu_quick) from the _C extension to _C_stable_libtorch. Convert ATen types/APIs to stable ABI equivalents: - torch::Tensor -> torch::stable::Tensor - ATen device guard/stream -> stable accelerator APIs - VLLM_DISPATCH_FLOATING_TYPES -> VLLM_STABLE_DISPATCH_FLOATING_TYPES - data_ptr -> mutable_data_ptr Quantized activation ops (silu_and_mul_quant, persistent_masked_m_silu_mul_quant) remain in _C. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
| @@ -1,17 +1,20 @@ | |||
| #include <cuda_fp16.h> | |||
| #include <cuda_runtime.h> | |||
|
|
|||
| #include <torch/all.h> | |||
| #include <c10/cuda/CUDAGuard.h> | |||
| #include "../../../cuda_compat.h" | |||
There was a problem hiding this comment.
changes to use relative paths are due to hipification errors I ran into after moving files into the libtorch_stable subdirectory
| torch::Tensor& out, // [..., d] | ||
| torch::Tensor const& input, // [..., d] | ||
| torch::Tensor const& scale, // various shapes | ||
| std::optional<std::tuple<int64_t, int64_t>> |
There was a problem hiding this comment.
schema change here is bc stableivalue conversions don't support tuple, from the user perspective in python there is no change, also we assert the size below on 219
| void gelu_and_mul(torch::Tensor& out, torch::Tensor& input); | ||
|
|
||
| void gelu_tanh_and_mul(torch::Tensor& out, torch::Tensor& input); | ||
|
|
||
| void fatrelu_and_mul(torch::Tensor& out, torch::Tensor& input, |
There was a problem hiding this comment.
note: some declarations are not removed a the cpu build is using the declarations from ops.h too (tests will fail if we remove them)
There was a problem hiding this comment.
Can you track these in the PR description?
675961a to
e8f3b67
Compare
Migrate static_scaled_fp8_quant, dynamic_scaled_fp8_quant, and dynamic_per_token_scaled_fp8_quant from _C to _C_stable_libtorch. Shared headers (common.cuh, utils.cuh) updated to work with both targets: utils.cuh uses torch::headeronly types; common.cuh uses Schema changed from (int,int)? to int[]? for group_shape to work with TORCH_BOX (std::tuple is not trivially copyable). Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Migrate gptq_gemm and gptq_shuffle from _C to _C_stable_libtorch. Key conversions: torch::Tensor -> torch::stable::Tensor, cuBLAS handle via get_current_cuda_blas_handle(), device().is_meta() check via DeviceType::Meta comparison, tensor creation via new_zeros/empty. Also removes #ifndef USE_ROCM guard on get_current_cuda_blas_handle() in torch_utils.h — hipify handles cublas_v2.h -> hipblas/hipblas.h and cublasHandle_t -> hipblasHandle_t automatically. Sub-headers (compat.cuh, matrix_view.cuh, qdq_*.cuh) are pure CUDA device code and need no changes. Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Claude Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Claude Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
e8f3b67 to
deea661
Compare
| void gelu_and_mul(torch::Tensor& out, torch::Tensor& input); | ||
|
|
||
| void gelu_tanh_and_mul(torch::Tensor& out, torch::Tensor& input); | ||
|
|
||
| void fatrelu_and_mul(torch::Tensor& out, torch::Tensor& input, |
There was a problem hiding this comment.
Can you track these in the PR description?
|
|
||
| #ifdef USE_ROCM | ||
| #include <hip/hip_runtime.h> | ||
| #include <hip/hip_bf16.h> |
There was a problem hiding this comment.
Concrete error without these is
AILED: [code=1] CMakeFiles/_C_stable_libtorch.dir/csrc/libtorch_stable/activation_kernels.hip.o
ccache /opt/rocm/lib/llvm/bin/clang++ -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_C_stable_libtorch -DTORCH_TARGET_VERSION=0x020A000000000000ULL -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_PROF_API=1 -DUSE_ROCM -DUSE_RPC -DUSE_TENSORPIPE -D_C_stable_libtorch_EXPORTS -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -I/data/users/mg1998/vllm/build/temp.linux-x86_64-cpython-312/csrc -isystem /home/mg1998/.conda/envs/pytorch/include/python3.12 -isystem /home/mg1998/.conda/envs/pytorch/lib/python3.12/site-packages/torch/include -isystem /home/mg1998/.conda/envs/pytorch/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/fbcode/platform010/lib/rocm-6.4.2/include/hiprand -isystem /usr/local/fbcode/platform010/lib/rocm-6.4.2/include/rocrand -Wno-unused-result -O2 -g -DNDEBUG -std=gnu++17 --offload-arch=gfx942 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -DUSE_ROCM -DENABLE_FP8 -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF_OPERATORS__ -Werror=unused-variable -fno-gpu-rdc -DTORCH_HIP_VERSION=701 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT CMakeFiles/_C_stable_libtorch.dir/csrc/libtorch_stable/activation_kernels.hip.o -MF CMakeFiles/_C_stable_libtorch.dir/csrc/libtorch_stable/activation_kernels.hip.o.d -o CMakeFiles/_C_stable_libtorch.dir/csrc/libtorch_stable/activation_kernels.hip.o -x hip -c /data/users/mg1998/vllm/build/temp.linux-x86_64-cpython-312/csrc/libtorch_stable/activation_kernels.hip
In file included from /data/users/mg1998/vllm/build/temp.linux-x86_64-cpython-312/csrc/libtorch_stable/activation_kernels.hip:8:
/data/users/mg1998/vllm/build/temp.linux-x86_64-cpython-312/csrc/libtorch_stable/../hip_vec_utils.cuh:78:28: error: unknown type name '__hip_bfloat162'; did you mean 'hip_bfloat16'?
78 | struct PackedTypeConverter<__hip_bfloat162> {
| ^~~~~~~~~~~~~~~
| hip_bfloat16
/opt/rocm/include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
57 | struct hip_bfloat16
| ^
In file included from /data/users/mg1998/vllm/build/temp.linux-x86_64-cpython-312/csrc/libtorch_stable/activation_kernels.hip:8:
/data/users/mg1998/vllm/build/temp.linux-x86_64-cpython-312/csrc/libtorch_stable/../hip_vec_utils.cuh:79:16: error: unknown type name '__hip_bfloat16'; did you mean 'hip_bfloat16'?
79 | using Type = __hip_bfloat16;
| ^~~~~~~~~~~~~~
| hip_bfloat16
/opt/rocm/include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
57 | struct hip_bfloat16
| ^
Ah concretely it seems like there is a bug here https://github.com/pytorch/pytorch/blob/main/torch/headeronly/util/BFloat16.h#L15-L17, the #include <cuda_bf16.h> gets hipified but the !defined(USE_ROCM) doeesn't
We defined USE_ROCM for _C_stable_libtorch to expose some of the shims that are gated :/
| if(VLLM_GPU_LANG STREQUAL "CUDA") | ||
| list(APPEND VLLM_STABLE_EXT_SRC | ||
| "csrc/cuda_utils_kernels.cu" | ||
| "csrc/cutlass_extensions/common.cpp" |
There was a problem hiding this comment.
how come this common.cpp needs to move to if CUDA?
There was a problem hiding this comment.
edit: hmm wait just to confirm you were referring to csrc/cuda_utils_kernels.cu not common.cpp (which is correctly CUDA-only) right?
technically csrc/cuda_utils_kernels.cu should be shared cuda/rocm, but there's some issues with it being in sources for both extensions when building on rocm, so I want to punt that problem which will be solved when we fully migrate it out of _C
The error looks like
CMake Error:
Running
'/home/mg1998/.conda/envs/pytorch/bin/ninja' '-C' '/data/users/mg1998/vllm/build/temp.linux-x86_64-cpython-312' '-t' 'recompact'
failed with:
ninja: error: build.ninja:713: multiple rules generate csrc/hip_utils_kernels.hip
| # also _is_hip() once https://github.com/vllm-project/vllm/issues/35163 is | ||
| # fixed | ||
| if _is_cuda(): | ||
| if _is_cuda() or _is_hip(): |
| # If PyTorch doesn't bundle libamdhip64 (built from source against system | ||
| # ROCm), there is only one copy in the process and no action is needed — | ||
| # the HIP compiler already links the system libamdhip64 automatically. | ||
| if(VLLM_GPU_LANG STREQUAL "HIP") |
There was a problem hiding this comment.
to improve my understanding, this code basically specifically picks out the amdhip64 that pytorch bundles in order to have deterministic correct results and not get corrupted?
There was a problem hiding this comment.
yea concretely there seem to be two cases
- pip installed pytorch on rocm (local) -- torch bundles libamdhip64, if we have raw
hipFoocalls in vllm they will use the system rocm, but calls intohipFoofrom libtorch itself will use torch's bundled libamdhip64 and there will be two device contexts -- we get gpucore dumps - rocm pytorch docker image (e.g. vllm CI) -- torch does not bundle libamdhip64, we are good.
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This PR will be landed in #42663 and can be closed |
|
Superseded by newer PRs. |
Stacked on #38671, only the top 11 commits are relevant. Commits to review https://github.com/vllm-project/vllm/pull/38757/changes/2c13410412de95b648e9cd8562431dbe9481f9ee..deea6618c38afb4735b442c61e2697c273654292
Note: some declarations are not deleted from csrc/ops.h despite being moved to csrc/libtorch_stable/ops.h. This is because the CPU build also uses these declarations. These are
Purpose
#26946
Test Plan
pytest tests/kernels/core/test_activation.pypytest tests/kernels/quantization/test_ggml.pypytest tests/kernels/quantization/test_fp8_quant.pypytest tests/kernels/quantization/test_int8_quant.pyTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.