[FIX_FOR_VLLM_CUSTOM=e31915063da3f6d6be6080040de28f5bb6945acd] Fix GraphCaptureOutput stub and MultiModalDataDict import path#1280
Conversation
…aphCaptureOutput alias and MultiModalDataDict import path Bug 1: Gaudi's custom PyTorch build cherry-picked the rename of GraphCaptureOutput -> CaptureOutput before bumping to 2.12. Upstream vLLM's env_override.py imports GraphCaptureOutput when torch < 2.12, which fails on Gaudi. Fix: a _torch_compat shim creates the alias, loaded via conftest.py (tests) and a .pth file (production). Bug 2: Upstream vLLM moved MultiModalDataDict from vllm.multimodal.inputs to vllm.inputs. Updated deepseek_ocr.py. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses upstream vLLM compatibility regressions that currently break Gaudi CI by adding a Torch startup shim for GraphCaptureOutput and updating the MultiModalDataDict import path for the DeepSeek OCR model integration.
Changes:
- Add a Torch compatibility shim to provide a stub
GraphCaptureOutputfor Gaudi’s custom PyTorch builds. - Ensure the shim is loaded in tests and (intended) at runtime via a
.pthstartup hook. - Update
MultiModalDataDictimport indeepseek_ocr.pyto match the upstream vLLM module move.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_gaudi/models/deepseek_ocr.py |
Update MultiModalDataDict import to vllm.inputs. |
vllm_gaudi/_torch_compat.py |
Add shim that injects torch._dynamo.convert_frame.GraphCaptureOutput when missing. |
tests/conftest.py |
Import shim early in test startup. |
vllm_gaudi_torch_compat.pth |
Add Python startup hook intended to load shim before vllm imports. |
setup.py |
Install the .pth file via data_files. |
| @@ -0,0 +1 @@ | |||
| import vllm_gaudi._torch_compat | |||
There was a problem hiding this comment.
The .pth imports vllm_gaudi._torch_compat, which first executes vllm_gaudi/__init__.py. That module imports vllm_gaudi.platform, and platform.py imports vllm (from vllm import envs), so this startup hook can end up importing vllm before the shim runs—defeating the intended ordering and potentially re-triggering the original env_override.py failure.
Consider changing the .pth to import a standalone shim module that does not import the vllm_gaudi package (or refactor vllm_gaudi/__init__.py to avoid importing platform at import-time) so the patch can be applied without pulling in vllm.
| import vllm_gaudi._torch_compat | |
| import vllm_gaudi_torch_compat |
| """Root-level conftest – ensures torch compatibility shims are applied | ||
| before any ``import vllm`` happens during the test session. | ||
| """ | ||
|
|
||
| import vllm_gaudi._torch_compat # noqa: F401 -- side-effect: patches GraphCaptureOutput alias |
There was a problem hiding this comment.
This import executes vllm_gaudi/__init__.py before _torch_compat.py runs; __init__.py imports vllm_gaudi.platform which imports vllm (from vllm import envs). That means this conftest does not guarantee the shim is applied before any import vllm, contrary to the module docstring, and may fail in the same way the shim is trying to prevent.
To ensure ordering, import a shim that doesn't import the vllm_gaudi package (or make vllm_gaudi/__init__.py lazy) and then apply the patch from there.
| data_files=[ | ||
| # Install a .pth file so the torch compat shim runs at Python startup, | ||
| # before ``import vllm`` triggers env_override.py. | ||
| (".", ["vllm_gaudi_torch_compat.pth"]), | ||
| ], |
There was a problem hiding this comment.
Installing the .pth via data_files=[('.', ...)] typically places it under the scheme's data directory (often sys.prefix), not the purelib/site-packages directory where Python actually processes .pth files at startup. If it doesn't land in site-packages, the runtime shim won't run and the original import vllm failure will persist.
Please adjust the installation approach so the .pth ends up in site-packages (or use an alternative startup mechanism that is reliably loaded from site-packages).
| data_files=[ | |
| # Install a .pth file so the torch compat shim runs at Python startup, | |
| # before ``import vllm`` triggers env_override.py. | |
| (".", ["vllm_gaudi_torch_compat.pth"]), | |
| ], |
| return None | ||
|
|
||
| _cf.GraphCaptureOutput = _GraphCaptureOutputStub # type: ignore[attr-defined] | ||
| except Exception: |
There was a problem hiding this comment.
This catches all exceptions and silently ignores them. Since this module is meant to protect imports, swallowing unexpected errors (e.g., API changes, AttributeError, RuntimeError during torch init) can make failures much harder to diagnose.
Prefer catching the specific expected failures (e.g., ImportError / ModuleNotFoundError) and, if you still want to proceed silently, consider at least logging at debug level for unexpected exception types.
| except Exception: | |
| except (ImportError, ModuleNotFoundError): |
| class _GraphCaptureOutputStub: | ||
| """Stub standing in for the removed GraphCaptureOutput class.""" | ||
|
|
||
| def get_runtime_env(self): # type: ignore[override] |
There was a problem hiding this comment.
The docstring says the stub provides a get_runtime_env class-method, but the stub defines it as an instance method. Even if it's "never called", matching the documented/expected callable shape is safer because the upstream monkey-patch may treat it as a classmethod.
Define get_runtime_env as a @classmethod (or otherwise match the upstream signature) to avoid subtle incompatibilities.
| def get_runtime_env(self): # type: ignore[override] | |
| @classmethod | |
| def get_runtime_env(cls): # type: ignore[override] |
Summary
Fixes two import errors introduced by recent upstream vLLM changes that break all CI tests on Gaudi HPU.
Bug 1:
GraphCaptureOutputImportError (blocks all tests)Upstream vLLM PR #37234 (commit
e31915063da) added a monkey-patch inenv_override.pyguarded bynot is_torch_equal_or_newer("2.12.0")that importsGraphCaptureOutputfromtorch._dynamo.convert_frameand patches itsget_runtime_envmethod.Gaudi's PyTorch build (2.9.0+hpu) cherry-picked the upstream PyTorch fix (pytorch/177558) which:
GraphCaptureOutput→CaptureOutputget_runtime_envmethod (the class is now empty)Since Gaudi's torch reports version < 2.12.0, vLLM's guard activates and the import fails.
Fix: Add a
_torch_compat.pyshim that creates a stubGraphCaptureOutputclass with a no-opget_runtime_envmethod. The stub satisfiesenv_override.py's import and monkey-patching without error. The patched method is never called at runtime because Gaudi's PyTorch already contains the underlying fix. The shim is loaded:tests/conftest.py(before anyimport vllm).pthfile installed into site-packagesBug 2:
MultiModalDataDictImportError (affects deepseek_ocr)Upstream vLLM PR #35182 (commit
ba2f0acc2) movedMultiModalDataDictfromvllm.multimodal.inputstovllm.inputs.Fix: Update the import path in
vllm_gaudi/models/deepseek_ocr.py.Files Changed
vllm_gaudi/_torch_compat.pyGraphCaptureOutputstubtests/conftest.pyvllm_gaudi_torch_compat.pth.pthfile for runtime shim loadingsetup.pydata_filesfor.pthinstallationvllm_gaudi/models/deepseek_ocr.pyMultiModalDataDictimport pathHPU Verification
Tested on Gaudi3 pod (torch
2.9.0+hpu_1.23.0, Python 3.12):import vllmsucceeds (was crashing before)pytest tests/unit_tests/ops/test_hpu_fused_moe.py— 1 passeddeepseek_ocr.pyimport reaches past the fixed lineJira: Related to hourly CI triage findings.