In-Tree AMD Zen CPU Backend via zentorch [1/N]#35970
In-Tree AMD Zen CPU Backend via zentorch [1/N]#35970tlrmchlsmth merged 11 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for AMD Zen CPUs via zentorch, including new Docker build targets, platform detection logic, and GEMM dispatching through zentorch. The changes are well-structured and include comprehensive tests. My review focuses on a few key areas for improvement: reducing Docker image size by cleaning up build dependencies, eliminating code duplication between setup.py and the runtime platform detection, and correcting the configuration for torch.compile's caching mechanism to prevent potential cache corruption. These changes will improve maintainability and ensure correctness.
|
Hi @amd-lalithnc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com> Change-Id: I51b2aa53b66937c6c388b50db09072ba83ec44de
6b0a20a to
a23c37c
Compare
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I223526ad52e11a94787f1e7770a63924ec2e8bb1
|
Hi @amd-lalithnc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I09b760c50c62dbafcc910e46664754b56a8dbd57
tlrmchlsmth
left a comment
There was a problem hiding this comment.
I think we can simplify the selection of the zentorch dependency
My suggestion would be to drop the auto-detection from setup.py, removing the _is_amd_zen_cpu() import, the VLLM_ZENTORCH_INSTALL env var, and the conditional requirements.append("zentorch"), and keep only the "zen" extra. This cleans up logic across setup.py and envs.py, removes a compile factor that shouldn't exist, and makes the dependency management explicit and predictable.
docker/Dockerfile.cpu
Outdated
| ######################### AMD SOURCE BUILD IMAGE ######################### | ||
| FROM vllm-openai AS vllm-openai-amd-source |
There was a problem hiding this comment.
What is this needed for? Let's remove it to keep the upstream build simple
There was a problem hiding this comment.
removed the PyPI based target as the build from source is a customer requirement.
There was a problem hiding this comment.
How long does it take to build from source?
Let's install zentorch from pypi with this image in order to reduce resources used in vLLM's CI and to make it quicker and easier for folks to build the image. We don't need to build zentorch from source in upstream vLLM, and like I said, better to keep it simpler and easier.
There was a problem hiding this comment.
hi @tlrmchlsmth the source build takes a few minutes (it will depend on the machine config on which the build is running though). Overall the build time is << vllm from source build time. Also going forward when we support running the vLLM + zentorch CI on supported AMD CPU instances we would like to have the dockerfile support this target. Should we also try this with the pypi based flow or is a source based build ok for this?
There was a problem hiding this comment.
hi @tlrmchlsmth we have removed the source build in the latest update. We are only doing a pypi build in the docker as well. We can bring back the source build in future PRs if there are specific requests for it
vllm/model_executor/layers/utils.py
Outdated
| layer.cpu_linear = ( | ||
| lambda x, weight, bias: torch.ops.zentorch.zentorch_linear_unary( | ||
| x, zen_weight, bias, is_weight_prepacked=is_prepacked | ||
| ) | ||
| ) |
There was a problem hiding this comment.
This captures zen_weight and is_prepacked by reference, but these are reassigned in the enclosing scope. This looks like it does the right thing for now but is is fragile. Let's capture by value
| layer.cpu_linear = ( | |
| lambda x, weight, bias: torch.ops.zentorch.zentorch_linear_unary( | |
| x, zen_weight, bias, is_weight_prepacked=is_prepacked | |
| ) | |
| ) | |
| layer.cpu_linear = ( | |
| lambda x, weight, bias, _w=zen_weight, _p=is_prepacked: | |
| torch.ops.zentorch.zentorch_linear_unary(x, _w, bias, is_weight_prepacked=_p) | |
| ) |
There was a problem hiding this comment.
hi @tlrmchlsmth we had tried this before submitting the PR but this is failing in the aot_compile flow of PyTorch due to deviation from the op schema. The AOTAutogradCache expects the op schema to be maintained (x, weight, bias, is_weight_prepacked) whereas with the additional argument to pass the modified weight by value it fails with the following error
TypeError: DefaultsSource.__init__() takes from 3 to 4 positional arguments but 6 were given
Another related change to only pass the weight (and as a result the bias and the is_weight_prepacked) by value resulted in a FX graph propagation error with Dynamo (most likely due to the schema mismatch again)
I am guessing this is also why the VLLM_CPU_SGL_KERNEL path is currently passing the packed weight and bias by reference
There was a problem hiding this comment.
ah, OK then. Thanks for the explanation.
cc @ProExpertProg in case you have any ideas (but we should handle any improvements in a future PR)
There was a problem hiding this comment.
I think getting a more minimal torch repro would be helpful, not sure I understand exactly what's going on
There was a problem hiding this comment.
Can we add a vllm issue link here? so the torch team can investigate
vllm/platforms/zen_cpu.py
Outdated
| cls._patch_fxgraphcache_pickle() | ||
|
|
||
| @classmethod | ||
| def _patch_fxgraphcache_pickle(cls): | ||
| """Backport mainline ValueError fix to FxGraphCachePickler.dumps().""" | ||
| import pickle | ||
|
|
||
| from torch._inductor.codecache import FxGraphCachePickler | ||
|
|
||
| original_dumps = FxGraphCachePickler.dumps | ||
| if hasattr(original_dumps, "_zen_patched"): | ||
| return | ||
|
|
||
| def patched_dumps_method(self, obj): | ||
| import logging as _logging | ||
|
|
||
| from torch._inductor.codecache import BypassFxGraphCache | ||
|
|
||
| _logger = _logging.getLogger("torch._inductor.codecache") | ||
| try: | ||
| self.dump(obj) | ||
| return self._stream.getvalue() | ||
| except (TypeError, AttributeError, pickle.PicklingError, ValueError) as e: | ||
| _logger.warning("Failed to pickle cache key", exc_info=True) | ||
| raise BypassFxGraphCache("Failed to pickle cache key") from e | ||
| finally: | ||
| self._stream.seek(0) | ||
| self._stream.truncate(0) |
There was a problem hiding this comment.
Patching like this to replace the function definition is brittle. IMO it would be better (less prone to future breakage) to wrap the function like this
| cls._patch_fxgraphcache_pickle() | |
| @classmethod | |
| def _patch_fxgraphcache_pickle(cls): | |
| """Backport mainline ValueError fix to FxGraphCachePickler.dumps().""" | |
| import pickle | |
| from torch._inductor.codecache import FxGraphCachePickler | |
| original_dumps = FxGraphCachePickler.dumps | |
| if hasattr(original_dumps, "_zen_patched"): | |
| return | |
| def patched_dumps_method(self, obj): | |
| import logging as _logging | |
| from torch._inductor.codecache import BypassFxGraphCache | |
| _logger = _logging.getLogger("torch._inductor.codecache") | |
| try: | |
| self.dump(obj) | |
| return self._stream.getvalue() | |
| except (TypeError, AttributeError, pickle.PicklingError, ValueError) as e: | |
| _logger.warning("Failed to pickle cache key", exc_info=True) | |
| raise BypassFxGraphCache("Failed to pickle cache key") from e | |
| finally: | |
| self._stream.seek(0) | |
| self._stream.truncate(0) | |
| @classmethod | |
| def _patch_fxgraphcache_pickle(cls): | |
| from torch._inductor.codecache import BypassFxGraphCache, FxGraphCachePickler | |
| original_dumps = FxGraphCachePickler.dumps | |
| if hasattr(original_dumps, "_zen_patched"): | |
| return | |
| def patched_dumps(self, obj): | |
| try: | |
| return original_dumps(self, obj) | |
| except ValueError as e: | |
| raise BypassFxGraphCache("Failed to pickle cache key") from e | |
| patched_dumps._zen_patched = True | |
| FxGraphCachePickler.dumps = patched_dumps |
There was a problem hiding this comment.
Why do we need to patch inductor anyway?
There was a problem hiding this comment.
When VLLM_ZENTORCH_WEIGHT_PREPACK=1 (default), zentorch prepacks weights into ZenDNN blocked-layout tensors at load time. These tensors raise ValueError when FxGraphCachePickler.dumps() tries to serialize the graph cache key, because PyTorch 2.10's dumps() doesn't catch ValueError.
PyTorch mainline already fixed this in pytorch/pytorch
#176557 (merged 2026-03-04) by adding ValueError to the except clause. Our patch is a thin backport of that fix for 2.10 users -- scoped to 2.10 <= version < 2.11 and will be removed once we drop 2.10 support.
There was a problem hiding this comment.
Can we move this to the env_override.py file with other pytorch patches?
There was a problem hiding this comment.
Seems like a good fix either way, no need to be zen-only
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I3e294003184afa678e3ece0e97ec0561fc619290
| # (Zen CPU backend) eagerly prepack weights into ZenDNN blocked layout | ||
| # at model load time. Eliminates per-inference layout conversion overhead. | ||
| "VLLM_ZENTORCH_WEIGHT_PREPACK": lambda: bool( | ||
| int(os.getenv("VLLM_ZENTORCH_WEIGHT_PREPACK", "1")) | ||
| ), |
There was a problem hiding this comment.
We're trying to keep the number of environment variables minimal in vLLM.
Why wouldn't someone want to pre-pack the weights? If there's no compelling reason, could we remove the env?
There was a problem hiding this comment.
hi @tlrmchlsmth since this environment variable is enabled by default it should be transparent for most users. Currently, this is mostly a debug feature for enabling other kernel variants from the zendnn library backend
|
Hi @amd-lalithnc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I943a52eae3dfaefe33130c5869af47d7f89deb54
|
Hi @amd-lalithnc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: Ic12ec89133419214aa618d27cc855db4cefb8a01
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I13513045dd88b93a5f68ac72e0bf6507fa1bea85
d7408e7 to
9b77dfc
Compare
docker/Dockerfile.cpu
Outdated
|
|
||
|
|
||
| ######################### AMD PYPI IMAGE ######################### | ||
| FROM vllm-openai AS vllm-openai-amd |
There was a problem hiding this comment.
Do we like this layer name, or would it be better to be more specific and call it vllm-openai-zen so that it's clearly not related to AMD GPUs or to the amd64 isa?
tlrmchlsmth
left a comment
There was a problem hiding this comment.
This looks good to me now, thank you!
- Use is_torch_equal_or_newer instead of TorchVersion for version check - Add comment to is_zen_cpu() that is_cpu is also True - Remove import torch (no longer needed) - Remove zen_cpu.py pickle exemption (no longer imports pickle) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
The name vllm-openai-amd was ambiguous — it could be mistaken for AMD GPU (ROCm) support or the amd64 ISA. vllm-openai-zen is more specific and consistent with the [zen] extra and ZenCpuPlatform. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Change-Id: I51b2aa53b66937c6c388b50db09072ba83ec44de
Purpose
ZenCpuPlatform, a new platform subclass ofCpuPlatformthat activates when an AMD EPYC CPU with AVX-512 is detected andzentorchis installed. Detection reads/proc/cpuinfofor AuthenticAMD + avx512 flags.zentorch::zentorch_linear_unaryinstead of the default oneDNN/torch linear path.zentorch_weight_prepack_for_linearat model load time (controlled byVLLM_ZENTORCH_WEIGHT_PREPACK, default on). Eliminates per-inference layout conversion overhead.VLLM_ZENTORCH_WEIGHT_PREPACK(toggle weight prepacking) andVLLM_ZENTORCH_INSTALL(toggle auto-install of zentorch during pip install vllm). Also adds pip installvllm[zen]extras entry.vllm-openai-amd(installs zentorch from PyPI) andvllm-openai-amd-source(builds zentorch from source).FxGraphCachePickler.dumpsbug fix (missing ValueError catch) that causestorch.compilecache misses. Scoped toZenCpuPlatformonly and will be removed once PT 2.10 is no longer supported.is_zen_cpu()to the basePlatforminterface (returns False by default), overridden toTrueinZenCpuPlatform.zentorch_linear_unaryinvocation + weight removal) and 4 for platform detection logic.Test Plan
Test 1: Platform Detection (tests/test_zen_cpu_platform_detection.py)
Tests
_is_amd_zen_cpu()fromvllm.platformswith mocked/proc/cpuinfo. Four cases:Run with:
pytest tests/test_zen_cpu_platform_detection.py -vTest 2: Op Dispatch (tests/model_executor/test_cpu_unquantized_gemm_dispatch.py)
Tests
dispatch_cpu_unquantized_gemm()fromvllm.model_executor.layers.utils. Registers a mockzentorch_linear_unaryop (viatorch.library) if realzentorchisn't installed, then monkeypatchescurrent_platform.is_zen_cputo returnTrue.Two cases:
layer.cpu_linear(x, weight, bias)produces the same output asF.linear(x, weight, bias)remove_weight=True,layer.weight.numel() == 0after dispatchRun with:
pytest tests/model_executor/test_cpu_unquantized_gemm_dispatch.py -vTest Result
Tested on an EPYC Zen4 server
test_zen_cpu_platform_detection.py::test_is_amd_zen_cpu_detects_amd_with_avx512 PASSED
test_zen_cpu_platform_detection.py::test_is_amd_zen_cpu_returns_false_for_amd_without_avx512 PASSED
test_zen_cpu_platform_detection.py::test_is_amd_zen_cpu_returns_false_for_intel_with_avx512 PASSED
test_zen_cpu_platform_detection.py::test_is_amd_zen_cpu_returns_false_when_cpuinfo_missing PASSED
test_cpu_unquantized_gemm_dispatch.py::test_dispatch_cpu_unquantized_gemm_uses_zentorch_on_zen PASSED
test_cpu_unquantized_gemm_dispatch.py::test_dispatch_cpu_unquantized_gemm_zen_remove_weight PASSED
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.