[DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure by Egor-Krivov · Pull Request #6123 · intel/intel-xpu-backend-for-triton

Egor-Krivov · 2026-02-17T15:36:15Z

Problem

When a Triton kernel requires too many registers on Intel XPU, the IGC backend
compiler fails with "total scratch space exceeds HW supported limit" (PTSS).
The existing retry logic in load_binary() only handles the case where
compilation succeeds but has a high spill count (>1000). When compilation
fails entirely, the error propagates immediately without attempting large
GRF mode.

Users must manually add grf_mode='256' to work around this, which is not
discoverable and differs from NVIDIA where such issues don't occur.

Fixes #3777

Solution

Extend the GRF retry logic in load_binary() (driver.c) to also cover
complete build failures:

When zeModuleCreate fails with ZE_RESULT_ERROR_MODULE_BUILD_FAILURE
and no GRF mode was explicitly set, clear the error and retry compilation
with -cl-intel-256-GRF-per-thread
If the retry succeeds, continue with the large-GRF kernel
If the retry also fails, propagate the original error
Print a recovery message to stderr so the user knows the initial error
was handled
Apply the same retry-on-failure pattern to the ocloc offline compilation
path in make_zebin() (compiler.py)

The existing spill-count-based retry (successful compilation but >1000 spills)
is unchanged and still applies when the build-failure retry is not triggered.

Testing

Added parametrized test test_auto_grf_on_build_failure covering:

grf_mode='default': build fails → auto-retries with large GRF → succeeds
grf_mode='256': explicit large GRF → compiles directly, no retry
grf_mode='128': explicit small GRF → fails, no retry (respects user choice)

When I run the original reproducer from #3777 I now get a pass:

(triton) (312) jovyan@jupyter-ekrivov:~/triton/intel-xpu-backend-for-triton/issues/3777/ut$ python test_old.py 
L0 build module failed. Log: 
warning: [RetryManager] Start recompilation of the kernel
in kernel: 'sample_recovered_tokens_kernel'

error: total scratch space exceeds HW supported limit for kernel sample_recovered_tokens_kernel: 270848 bytes (max permitted PTSS 262144 bytes)
error: backend compiler failed build.

(I): Build failure recovered by retrying with large GRF mode for "sample_recovered_tokens_kernel"

…iled, close #3777

Egor-Krivov · 2026-02-17T15:38:15Z

+
+    // Always print recovery message to stderr to follow up on the
+    // "L0 build module failed" error that was already printed.
+    std::cerr << "(I): Build failure recovered by retrying with large GRF "


I write to stderr because stderr at this point already contains error message from IGC and we need to tell the user that issue is probably fixed

anmyachev

Make sense for me. Since it's a change in the driver - let's update your test to cover more cases.

anmyachev · 2026-02-17T19:49:24Z

+                          ("256", False, False),  # Explicit large GRF — compiles on first attempt
+                          ("128", False, True),  # Explicit small GRF — should fail, no retry
+                          ])
+def test_auto_grf_on_build_failure(device, monkeypatch, capfd, grf_mode, expect_retry, expect_fail):


Let's test both: make_zebin and load_binary. I guess TRITON_XPU_GEN_NATIVE_CODE should help for this.

anmyachev · 2026-02-17T19:50:43Z

+        compileLevelZeroObjects(binary_ptr, binary_size, kernel_name, l0_device,
+                                l0_context, build_flags(), is_spv);
+    if (PyErr_Occurred()) {
+      // Retry also failed — propagate the error.


Should we raise the initial exception here to be align with make_zebin?

Egor-Krivov · 2026-02-18T12:16:56Z

@anmyachev I updated the PR based on your feedback, but GH doesn't show changes for some reason
main...egor/issue_3777

Egor-Krivov · 2026-02-18T13:10:19Z

Ok, after 36ab67d PR got updated.

anmyachev

One small comment, everything else is LGTM!

jikunshang · 2026-02-25T14:53:22Z

may I know which triton-xpu release will contain this fix?

* src/main: [DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure (#6123)

Egor-Krivov added 2 commits February 17, 2026 14:35

Add compilation retry with large grf mode when compilation backend fa…

ad28b56

…iled, close #3777

Add message to stdout to improve UX

b02a10a

Egor-Krivov requested review from anmyachev, etiotto and whitneywhtsang and removed request for anmyachev February 17, 2026 15:36

Egor-Krivov commented Feb 17, 2026

View reviewed changes

anmyachev requested changes Feb 17, 2026

View reviewed changes

Fixes based on review

d7a8ba1

Some diff

36ab67d

Egor-Krivov requested a review from anmyachev February 18, 2026 13:10

anmyachev reviewed Feb 18, 2026

View reviewed changes

Comment thread third_party/intel/backend/driver.c Outdated

anmyachev approved these changes Feb 18, 2026

View reviewed changes

Less code

e99a74a

Egor-Krivov enabled auto-merge (squash) February 18, 2026 16:13

whitneywhtsang approved these changes Feb 18, 2026

View reviewed changes

anmyachev changed the title ~~[VLLM] Auto-retry kernel compilation with large GRF mode on build failure~~ [DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure Feb 18, 2026

anmyachev disabled auto-merge February 18, 2026 17:32

anmyachev enabled auto-merge (squash) February 18, 2026 17:32

anmyachev merged commit 3b92b8f into main Feb 18, 2026
15 checks passed

anmyachev deleted the egor/issue_3777 branch February 18, 2026 19:42

whitneywhtsang mentioned this pull request Feb 19, 2026

Fix test_auto_grf_on_build_failure failure on LTS/CRI #6136

Closed

yma11 mentioned this pull request Feb 25, 2026

[BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 vllm-project/vllm#35298

Merged

5 tasks

wdziurdz pushed a commit that referenced this pull request Apr 7, 2026

Merge from main@3b92b8f3d3648476e8fb7b9ee1449df5ab079679

9d72923

* src/main: [DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure (#6123)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure#6123

[DRIVER][VLLM] Auto-retry kernel compilation with large GRF mode on build failure#6123
anmyachev merged 5 commits into
mainfrom
egor/issue_3777

Egor-Krivov commented Feb 17, 2026

Uh oh!

Egor-Krivov Feb 17, 2026 •

edited

Loading

Uh oh!

anmyachev left a comment

Uh oh!

anmyachev Feb 17, 2026

Uh oh!

anmyachev Feb 17, 2026

Uh oh!

Egor-Krivov commented Feb 18, 2026

Uh oh!

Egor-Krivov commented Feb 18, 2026

Uh oh!

Uh oh!

anmyachev left a comment

Uh oh!

Uh oh!

jikunshang commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Egor-Krivov commented Feb 17, 2026

Problem

Solution

Testing

Uh oh!

Egor-Krivov Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anmyachev left a comment

Choose a reason for hiding this comment

Uh oh!

anmyachev Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

anmyachev Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Egor-Krivov commented Feb 18, 2026

Uh oh!

Egor-Krivov commented Feb 18, 2026

Uh oh!

Uh oh!

anmyachev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jikunshang commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Egor-Krivov Feb 17, 2026 •

edited

Loading