[DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure#6123
Conversation
|
|
||
| // Always print recovery message to stderr to follow up on the | ||
| // "L0 build module failed" error that was already printed. | ||
| std::cerr << "(I): Build failure recovered by retrying with large GRF " |
There was a problem hiding this comment.
I write to stderr because stderr at this point already contains error message from IGC and we need to tell the user that issue is probably fixed
anmyachev
left a comment
There was a problem hiding this comment.
Make sense for me. Since it's a change in the driver - let's update your test to cover more cases.
| ("256", False, False), # Explicit large GRF — compiles on first attempt | ||
| ("128", False, True), # Explicit small GRF — should fail, no retry | ||
| ]) | ||
| def test_auto_grf_on_build_failure(device, monkeypatch, capfd, grf_mode, expect_retry, expect_fail): |
There was a problem hiding this comment.
Let's test both: make_zebin and load_binary. I guess TRITON_XPU_GEN_NATIVE_CODE should help for this.
| compileLevelZeroObjects(binary_ptr, binary_size, kernel_name, l0_device, | ||
| l0_context, build_flags(), is_spv); | ||
| if (PyErr_Occurred()) { | ||
| // Retry also failed — propagate the error. |
There was a problem hiding this comment.
Should we raise the initial exception here to be align with make_zebin?
|
@anmyachev I updated the PR based on your feedback, but GH doesn't show changes for some reason |
|
Ok, after 36ab67d PR got updated. |
anmyachev
left a comment
There was a problem hiding this comment.
One small comment, everything else is LGTM!
|
may I know which triton-xpu release will contain this fix? |
* src/main: [DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure (#6123)
Problem
When a Triton kernel requires too many registers on Intel XPU, the IGC backend
compiler fails with "total scratch space exceeds HW supported limit" (PTSS).
The existing retry logic in
load_binary()only handles the case wherecompilation succeeds but has a high spill count (>1000). When compilation
fails entirely, the error propagates immediately without attempting large
GRF mode.
Users must manually add
grf_mode='256'to work around this, which is notdiscoverable and differs from NVIDIA where such issues don't occur.
Fixes #3777
Solution
Extend the GRF retry logic in
load_binary()(driver.c) to also covercomplete build failures:
zeModuleCreatefails withZE_RESULT_ERROR_MODULE_BUILD_FAILUREand no GRF mode was explicitly set, clear the error and retry compilation
with
-cl-intel-256-GRF-per-threadwas handled
oclocoffline compilationpath in
make_zebin()(compiler.py)The existing spill-count-based retry (successful compilation but >1000 spills)
is unchanged and still applies when the build-failure retry is not triggered.
Testing
Added parametrized test
test_auto_grf_on_build_failurecovering:grf_mode='default': build fails → auto-retries with large GRF → succeedsgrf_mode='256': explicit large GRF → compiles directly, no retrygrf_mode='128': explicit small GRF → fails, no retry (respects user choice)When I run the original reproducer from #3777 I now get a pass: