feat: Intel Arc / XPU GPU support for Studio and Gemma training#6106
feat: Intel Arc / XPU GPU support for Studio and Gemma training#6106LeoBorcherding wants to merge 30 commits into
Conversation
Adds first-class Intel XPU support to unsloth studio across hardware
detection, GPU selection, visibility reporting, telemetry, cache
clearing, OOM messaging, and the DAC autocast path, matching feature
parity with the existing CUDA and ROCm backends. This is a rebased
linear version of the leizhenyuan/zhenyuan_enable_studio branch with
three rounds of review-loop fixes layered in.
Originally authored by leizhenyuan. Rebased and review fixes by
danielhanchen.
Highlights:
hardware.py
- detect_hardware: promote DeviceType.XPU when torch.xpu is
available and CUDA is not, preserving CPU / MLX / ROCm paths.
- New helpers: _parse_ze_mask_roots, _resolve_xpu_smi_device_id,
_get_xpu_utilization (xpu-smi -m 0,1,3 with cached binary path
resolution and robust N/A parsing).
- _get_parent_visible_gpu_spec adds an XPU branch that honors
ZE_AFFINITY_MASK, including subdevice syntax (0.0,0.1) and
wildcard / unparseable masks, and mirrors the CUDA UUID / MIG
path by returning numeric_ids=None for subdevice masks so the
rest of the stack enumerates torch-visible ordinals and can
still shard.
- _backend_visible_devices_env routes XPU through ZE_AFFINITY_MASK
so /system-info and get_backend_visible_gpu_info report the
active mask instead of a stale or None CUDA_VISIBLE_DEVICES.
- get_visible_gpu_utilization and get_backend_visible_gpu_info
short-circuit on "no devices visible" masks (ZE_AFFINITY_MASK=""
or CUDA_VISIBLE_DEVICES="" / "-1") so the torch-ordinal
enumeration fallback does not report hidden devices.
- get_visible_gpu_count and get_device_map add XPU branches that
prefer torch.xpu.device_count() and fall back to ZE mask roots
(str.isdecimal() to reject Unicode superscripts).
- apply_gpu_ids routes XPU through ZE_AFFINITY_MASK but leaves
inherited CUDA_VISIBLE_DEVICES alone so hybrid NVIDIA+Intel
hosts that hid CUDA via CUDA_VISIBLE_DEVICES="" stay on XPU.
- auto_select_gpu_ids and prepare_gpu_selection allow XPU in the
accelerator guard and use "non_accelerator" as the selection
mode label for CPU / MLX.
- resolve_requested_gpu_ids error message covers "CUDA and Intel
XPU devices" and mentions ZE_AFFINITY_MASK for XPU.
- clear_gpu_cache XPU branch guards synchronize and empty_cache
with hasattr so older torch-xpu builds do not propagate
AttributeError.
- dataset_map_num_proc returns None only when
torch.xpu.is_initialized exists and returns True, so older
torch-xpu builds still get pre-init CPU dataset parallelism.
- get_package_versions reports torch.version.xpu alongside cuda.
utils/utils.py
- format_error_message recognises Intel XPU OOM strings including
out_of_device_memory, out_of_host_memory, memory allocation
failed, and bare MemoryError instances, labelling messages as
"Intel GPU" when DeviceType is XPU.
core/inference/inference.py
- _generate_dac derives autocast device from model.device.type
instead of the global backend, clamps to cuda/xpu/mps/cpu, and
falls through to nullcontext on cpu/xpu float32 (unsupported by
torch.amp.autocast).
core/inference/llama_cpp.py
- _start_process pins the llama-server subprocess via
ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere
so SYCL builds of llama-server respect the selected GPUs.
- unload_model clears the XPU cache via the backend-neutral
clear_gpu_cache helper.
- init_audio_codec / generate_audio_response retain the pre-PR
"cuda if torch.cuda.is_available() else cpu" fallback for the
SNAC / BiCodec / DAC codecs because those upstream codecs are
not yet validated on Intel XPU.
core/training/trainer.py
- clear_gpu_cache is hoisted to the module-level import block.
- _preprocess_snac_dataset / _preprocess_bicodec_dataset /
_preprocess_dac_dataset keep the pre-PR CPU fallback on
non-CUDA hosts for Spark-TTS BiCodec, SNAC, and OuteTTS DAC /
Whisper preprocessing.
tests/test_gpu_selection.py
- Replace TestXpuRejection with TestXpuSelection (positive tests
for auto_select_gpu_ids and prepare_gpu_selection on XPU).
- Update non-accelerator rejection assertion to the new
"CUDA and Intel XPU" wording.
- Update test_explicit_ids_are_rejected_for_uuid_parent_visibility
regex to match the broadened "uses non-numeric or subdevice"
error message.
tests/test_gpu_selection_sandbox.py
- Rename test_non_cuda_returns_none to
test_non_accelerator_returns_none and assert
selection_mode="non_accelerator".
for more information, see https://pre-commit.ci
Replace device-specific CUDA calls in GemmaFixedRotaryEmbedding with device-agnostic helpers to fix crashes when loading Gemma v1 models on Intel XPU systems. Changes: - torch.cuda.current_device() → get_current_device() - torch.cuda.empty_cache() → clean_gpu_cache() - torch.device(device) → torch.device(DEVICE_TYPE_TORCH, device) Fixes "Torch not compiled with CUDA enabled" error on non-CUDA platforms. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…vation, FLAT hierarchy, backend-aware GGUF discovery
Six targeted changes following the 10-reviewer pass on this PR:
1. studio/backend/core/inference/llama_cpp.py
Remove env.pop("CUDA_VISIBLE_DEVICES", None) in the XPU llama-server
spawn branch. The pop contradicted the documented preservation in
apply_gpu_ids() for hybrid NVIDIA+Intel hosts where the parent sets
CUDA_VISIBLE_DEVICES="" to force Studio onto Intel/SYCL. Added a
comment referencing that design note.
2. studio/backend/core/inference/llama_cpp.py
Make _get_gpu_free_memory backend-aware. NVIDIA keeps the cheap
nvidia-smi fast path; AMD ROCm and Intel XPU fall through to
get_visible_gpu_utilization so GGUF offload selection has per-GPU
free memory on all backends, not just NVIDIA. Returns an empty list
only when no telemetry is available.
3. studio/backend/utils/hardware/hardware.py
detect_hardware now honors a non-empty ZE_AFFINITY_MASK as an
explicit XPU hint. On a hybrid host with both torch.cuda and
torch.xpu available, ZE_AFFINITY_MASK=... picks XPU before CUDA so
users can pin Intel without also setting CUDA_VISIBLE_DEVICES="".
4. studio/backend/utils/hardware/hardware.py
Add _xpu_hierarchy_is_composite() and gate the Level Zero mask
parsing on it. In the FLAT hierarchy (oneAPI 2024+ default)
numeric ZE_AFFINITY_MASK entries are tile or device handles, not
root GPU IDs. _get_parent_visible_gpu_spec now returns
supports_explicit_gpu_ids=False in FLAT so explicit gpu_ids are
rejected; _resolve_xpu_smi_device_id returns None so callers skip
"xpu-smi -d <wrong_tile>" and fall back to torch.xpu VRAM
telemetry. COMPOSITE retains the original root-ID mapping.
5. studio/backend/utils/hardware/hardware.py
Conservative XPU used_gb fallback. When torch.xpu lacks
mem_get_info, _torch_get_per_device_info now returns used_gb=None
instead of torch.xpu.memory_allocated (process-local and
misleading for multi-tenant placement). get_visible_gpu_utilization
guards against None during percent computation; downstream
auto_select_gpu_ids already filters None entries.
6. studio/backend/models/inference.py, studio/backend/models/training.py
gpu_ids Field descriptions now mention the Intel XPU
ZE_AFFINITY_MASK subdevice and FLAT-hierarchy restrictions
alongside the existing CUDA UUID/MIG note.
Testing:
- 96 PR-authored tests still pass (test_gpu_selection,
test_gpu_selection_sandbox, test_utils).
- 234 additional simulation tests across two isolated uv venvs
cover fix-fix interactions, malformed inputs, cross-platform
behavior (Linux, macOS, Windows), real subprocess env propagation,
JSON/Pydantic schema roundtrips, concurrency, and fuzz.
- 6/6 CUDA scenarios on 8xB200 hardware are byte-identical to the
pre-fix baseline. Zero CUDA regression.
- ROCm, MLX, and CPU regression tests all pass. AMD HIP_VISIBLE_DEVICES
and ROCR_VISIBLE_DEVICES propagation is unchanged.
Backend support after these changes: NVIDIA CUDA, AMD ROCm, Intel XPU
(both COMPOSITE and FLAT hierarchies), CPU, and Apple Silicon MLX.
for more information, see https://pre-commit.ci
…s, FLAT numerics Follow-up to 6c55664 addressing the four findings from the second 10-reviewer pass. Each fix is narrow and additive; the CUDA, ROCm, MLX, and CPU paths remain byte-identical to the pre-round-2 baseline. A. studio/backend/core/inference/llama_cpp.py Replace `dict.get(a) or dict.get(b)` with explicit `is None` fallbacks when reading vram_total_gb / vram_used_gb in the generic free-memory path. An idle GPU with vram_used_gb=0.0 was being treated as missing telemetry and dropped, which pushed GGUF placement into the non-placement fallback whenever the best free card was fully idle. 9/10 reviewers in the round-2 pass flagged this; every suggestion was the same patch. B. studio/backend/utils/hardware/hardware.py Tighten the ZE_AFFINITY_MASK -> XPU detection hint from "any non-empty mask wins" to "non-empty mask plus one of: CUDA_VISIBLE_DEVICES explicitly hidden (empty / -1), or UNSLOTH_FORCE_XPU=1, or CUDA simply not available". Prevents a stray ZE_AFFINITY_MASK=0 inherited from unrelated Intel tooling from silently flipping existing hybrid CUDA deployments onto XPU. Intel-only hosts are unaffected (CUDA unavailable -> hint is honoured). C. studio/backend/core/inference/llama_cpp.py Refuse to use device indices returned from get_visible_gpu_utilization when index_kind != "physical". Relative ordinals (reported for subdevice / wildcard / UUID parent masks) are not safe to round-trip back into ZE_AFFINITY_MASK / CUDA_VISIBLE_DEVICES; returning [] lets the launcher skip placement and inherit the parent's visibility unchanged. D. studio/backend/utils/hardware/hardware.py Relax _get_parent_visible_gpu_spec in FLAT hierarchy. Per Intel's Level Zero device-hierarchy docs, numeric ZE_AFFINITY_MASK entries in FLAT mode ARE stable flat ordinals that torch.xpu honours 1-to-1 with the child process. Previously we blanket-rejected all FLAT numeric masks, which meant Studio could not do XPU auto-selection under the default oneAPI configuration. Now: - FLAT + numeric mask -> accept as numeric_ids - FLAT + unset mask -> expose range(device_count) - FLAT + subdevice "N.M" -> reject (invalid syntax in FLAT) - FLAT + wildcard / non-numeric -> reject - COMPOSITE semantics unchanged (numeric -> root IDs) _resolve_xpu_smi_device_id continues to return None in FLAT because xpu-smi -d addresses root IDs, not flat ordinals; the telemetry path falls back to torch.xpu VRAM. Testing: - 96 PR-authored tests still pass (test_gpu_selection_sandbox, test_gpu_selection, test_utils). - 264 simulation tests pass across two isolated uv venvs. The round-2 tests include explicit coverage for: Fix A: idle GPU preserved, None still skipped, 0.0 total ok Fix B: bare mask on hybrid -> CUDA wins; CVD="" or "-1" hint -> XPU; UNSLOTH_FORCE_XPU=1 -> XPU; CUDA unavailable -> XPU; UNSLOTH_FORCE_XPU=0 (literal) not activated Fix C: relative ordinals not returned; physical ordinals returned; missing index_kind treated as physical Fix D: FLAT numeric accepted; FLAT unset auto-enumerated; FLAT subdevice rejected; FLAT wildcard rejected; xpu-smi resolver still None in FLAT; COMPOSITE unchanged - 6/6 CUDA scenarios on 8xB200 remain byte-identical to baseline. - ROCm, MLX, CPU regression tests all pass.
for more information, see https://pre-commit.ci
…rd fallback Follow-up to eebf077 addressing three findings from the third 10-reviewer pass and Gemini review 4119296322. All changes are in studio/backend/utils/hardware/hardware.py; CUDA/ROCm/MLX/CPU paths remain byte-identical to the pre-round-3 baseline. E. _get_parent_visible_gpu_spec in FLAT hierarchy now refuses explicit gpu_ids. Per Intel's "flattening-gpu-tile-hierarchy" doc and the Level Zero spec, numeric ZE_AFFINITY_MASK entries in FLAT mode are tile / device-handle ordinals, not physical GPU IDs -- accepting them as "Physical GPU indices" breaks the API contract documented in models/inference.py and models/training.py. numeric_ids is still populated so display and auto-selection enumerate devices, but supports_explicit_gpu_ids is False in FLAT regardless of mask presence. Users who need explicit tile-level selection can opt in with ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE. COMPOSITE semantics are unchanged -- numeric masks still resolve to root GPU IDs. Flagged by 6/10 reviewers in the third pass. F. detect_hardware restructured so UNSLOTH_FORCE_XPU=1 is a standalone trigger. Previously the outer "if ze_mask:" gate swallowed the force knob unless ZE_AFFINITY_MASK was also set, which meant the documented override did not work on a common hybrid NVIDIA+Intel scenario. Now the preference check is: prefer_xpu = force_xpu or (ze_mask and (cuda_hidden or cuda_unavailable)) The "stray ZE_AFFINITY_MASK flips CUDA to XPU" protection from round 2 is preserved: a bare mask without CUDA-hidden still keeps CUDA. Flagged by 3/10 reviewers in the third pass. G. get_visible_gpu_count torch-fallback path: only "*" wildcard counts as "all physical XPUs visible". Other non-parseable masks (",,,", "GPU-uuid", random garbage) now return 0 visible devices rather than silently exposing the full fleet. The happy path (torch.xpu .device_count() succeeds) is unchanged. Suggested by the Gemini review bot. Testing: - 96 PR-authored tests still pass (test_gpu_selection_sandbox, test_gpu_selection, test_utils). - 284 simulation tests pass across two isolated uv venvs. New test_round3_fixes.py covers each round-3 finding explicitly: Fix E: FLAT rejects gpu_ids via prepare_gpu_selection; FLAT unset also rejects; COMPOSITE still accepts; FLAT auto-select gracefully falls to inherit_parent_visible mode. Fix F: force_xpu alone picks XPU; force_xpu=0 does not; no-xpu-torch falls back to CUDA; force_xpu + mask works; CVD="" + mask path unchanged; bare mask still keeps CUDA. Fix G: "*" -> all_physical; ",,," -> 0; "GPU-abc" -> 0; "0,1,*" -> 2 (partial wildcard); subdevice -> 1 root; no mask unchanged; torch happy path bypasses fallback. - 6/6 CUDA scenarios on 8xB200 byte-identical to pre-fix baseline. - ROCm, MLX, CPU regression tests all pass.
for more information, see https://pre-commit.ci
Gemini's fourth review pass (pullrequestreview-4119484559) flagged six bare `except Exception: pass` blocks in hardware.py where silent failures could hide real driver or runtime problems. Each site now logs at DEBUG level with the exception message, so operators have a diagnostic trail when something goes wrong, while the control flow stays unchanged and the same fallback path is taken: 1. clear_gpu_cache() XPU branch: log "Failed to clear XPU cache" 2. _resolve_xpu_smi_device_id(): log "torch.xpu.current_device() probe failed" 3. _get_xpu_utilization() xpu-smi call: log "xpu-smi query failed" 4. _get_xpu_utilization() torch.xpu VRAM query: log "torch.xpu VRAM query failed" 5. get_visible_gpu_count() torch.xpu fallback: log and still count mask roots 6. get_visible_gpu_count() torch.cuda fallback: log and still use physical count 7. dataset_map_num_proc() is_initialized probe: log "torch.xpu.is_initialized() probe failed" These are logging-only edits; every test in the PR's own suite, the two simulation harnesses (284 tests), and the 8x B200 CUDA behavior matrix (6/6 scenarios byte-identical) pass unchanged.
Round-4 review (10 parallel reviewers on PR unslothai#4724) produced a unanimous consensus finding plus two isolated but real regressions. This commit addresses all three. [10/10] studio/backend/utils/hardware/hardware.py _get_parent_visible_gpu_spec() on Intel XPU with ZE_FLAT_DEVICE_HIERARCHY unset (the oneAPI default FLAT hierarchy) and no ZE_AFFINITY_MASK was internally inconsistent: it returned numeric_ids=[0..N) with supports_explicit_gpu_ids=False. Downstream, get_visible_gpu_utilization() treated those ordinals as numeric_ids!=None, labeled them index_kind="physical", and llama.cpp's _get_gpu_free_memory() would round-trip tile/device-handle ordinals back into ZE_AFFINITY_MASK as if they were stable root-GPU IDs. Collapse the FLAT+no-mask case to numeric_ids=None so the telemetry path falls into its relative-ordinal branch and auto-selection uniformly returns inherit_parent_visible. Users who need explicit Intel selection opt in via ZE_FLAT_DEVICE_HIERARCHY =COMPOSITE. [1/10] studio/backend/core/inference/llama_cpp.py LlamaCppBackend._get_gpu_free_memory() always probed nvidia-smi before falling back to the generic telemetry path. On a hybrid NVIDIA+Intel host running Studio in XPU mode, nvidia-smi returned NVIDIA indices and those ordinals were then used to build a ZE_AFFINITY_MASK for llama-server, pinning the server to the wrong device. Gate the nvidia-smi fast path on get_device() == CUDA and IS_ROCM=False so other backends skip straight to the backend-aware telemetry path. [1/10] studio/backend/utils/utils.py format_error_message() dropped the broad "memory"/"cuda" match in favor of specific substrings. That stopped matching common CUDA allocation failures such as "CUDA error: CUBLAS_STATUS_ALLOC_FAILED", so users saw the raw backend exception instead of the friendly "Not enough GPU memory" hint. Re-add the CUBLAS, cuda-error+alloc, and xpu-alloc patterns while keeping the matcher narrow enough that non-memory CUDA errors (e.g. "device-side assert triggered") still pass through unchanged. Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174), sim suite v2 (128 incl. 14 new round-5 coverage), 8x B200 CUDA matrix (6/6 scenarios byte-identical to pre-fix baseline).
for more information, see https://pre-commit.ci
Round-5 review (10 parallel reviewers on PR unslothai#4724) narrowed to two real correctness issues in the XPU path introduced by round-5's numeric_ids handling. This commit addresses both. [4/10] studio/backend/utils/hardware/hardware.py -- get_device_map() The "numeric_ids is None and visible_count > 1 -> balanced" heuristic was meant for CUDA UUID/MIG masks, but after round 5 collapsed FLAT no-mask XPU into numeric_ids=None, the heuristic started firing on the default Intel FLAT hierarchy and forced balanced sharding across tile handles even though the same module had already classified those ordinals as non-physical. Restrict the heuristic to CUDA so XPU stays on sequential loading unless the caller passes explicit gpu_ids (via ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE or prepare_gpu_selection). [3/10] studio/backend/utils/hardware/hardware.py -- get_visible_gpu_utilization, get_backend_visible_gpu_info FLAT numeric masks like ZE_AFFINITY_MASK="0,1" populate numeric_ids for telemetry display but have supports_explicit_gpu_ids=False because the tokens are tile/device-handle ordinals per Intel's Level Zero docs. Both helpers still labeled those ordinals index_kind="physical", which let LlamaCppBackend._get_gpu_free_memory() round-trip them back into ZE_AFFINITY_MASK as if they were stable root GPU IDs. Gate the "physical" label on parent_visible_spec["supports_explicit_gpu_ids"] and clear parent_visible_gpu_ids in the payload when it's False, so API consumers and llama.cpp skip the placement path. Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174, one updated), sim suite v2 (141 incl. 13 new round-6 tests), 8x B200 CUDA matrix (6/6 scenarios byte-identical to round-5 baseline).
for more information, see https://pre-commit.ci
Round-7 review (10 parallel reviewers on PR unslothai#4724) produced a strong convergent finding (7/10 independent reviewers on the same underlying issue): the PR is titled "enable Studio for Intel GPU" but on the default Intel oneAPI runtime (ZE_FLAT_DEVICE_HIERARCHY=FLAT, no ZE_AFFINITY_MASK set), Studio still couldn't auto-select or shard across multiple visible XPUs. Users had to manually opt into COMPOSITE hierarchy for the PR's advertised feature to actually work. The reviewers correctly noted that this is fixable without relaxing the "physical IDs only" contract of the public prepare_gpu_selection() API. torch.xpu ordinals are stable within a single worker process's inherited ZE_AFFINITY_MASK scope: Studio can safely generate them internally for auto-selection, narrow the mask via apply_gpu_ids(), and let the child process inherit the same torch.xpu device set. Round-trip stays 1:1 within the process boundary the ordinals were observed in. [7/10] studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids() When get_device()==XPU and supports_explicit_gpu_ids=False (FLAT hierarchy cases), instead of immediately bailing out with selection_mode="inherit_parent_visible" and selected=None, fall back to list(range(get_visible_gpu_count())) as the candidate set and continue into the VRAM-based selection logic. This path is flagged via a new metadata key xpu_relative_auto_select=True so telemetry stays introspectable. The public prepare_gpu_selection() API is unchanged and still refuses user-supplied explicit gpu_ids on these masks, because end users can't reason about FLAT tile handles; only Studio's own auto-selection (which controls the worker-process env) can use these ordinals safely. [3/10] studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory() The index_kind="relative" guard was rejecting XPU ordinals alongside CUDA UUID/MIG fallbacks. CUDA relative indices stay unsafe (the parent has hidden the physical ID mapping, so re-exporting them into CUDA_VISIBLE_DEVICES would re-expose hidden GPUs). XPU relative ordinals are stable for the current worker scope, so accept them and let VRAM-based GGUF placement work on default Intel FLAT hosts. Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2 (153 incl. 11 new round-7 tests, 3 updated to new contract), 8x B200 CUDA matrix (6/6 scenarios byte-identical to round-5 baseline).
for more information, see https://pre-commit.ci
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts could use multiple visible XPUs via auto-selection and VRAM-based llama.cpp placement. Round-8 review (6/10 convergent finding on 10 parallel reviewers of PR unslothai#4724) showed that fallback triggered even when the parent process had already narrowed visibility via ZE_AFFINITY_MASK, which silently retargeted child processes onto different Level Zero handles than the parent exposed. Example: parent sets ZE_AFFINITY_MASK="3,5" to expose tiles 3 and 5. torch.xpu in the parent sees ordinals 0 (=tile 3) and 1 (=tile 5). Round-7 auto_select returned [0, 1]; apply_gpu_ids() rewrote ZE_AFFINITY_MASK="0,1"; the child inherited that and saw tile 0 and tile 1, not the originally-exposed 3 and 5. Same regression on subdevice masks like "0.0,0.1". studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids() Gate xpu_relative_auto_select on parent_visible_spec["raw"] is None. When a parent mask is already set, defer to inherit_parent_visible so the child inherits the exact visibility the parent intended. studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory() Gate the relative-XPU acceptance on os.environ.get("ZE_AFFINITY_MASK") is None. Inherited masks (numeric "3,5", subdevice "0.0,0.1", wildcard "*") are preserved via the existing "return []" fall-through. Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2 (162 incl. 9 new round-8 tests + 2 updated to the new inherited-mask contract), 8x B200 CUDA matrix (6/6 scenarios byte-identical).
Gemini review 4120214370 caught that LlamaCppBackend._get_gpu_free_memory()
built its CUDA_VISIBLE_DEVICES filter with int(x.strip()) for x in
cvd.split(","), which raises ValueError on an empty token from a shell-
exported value like "0,1," (trailing comma) or "0,,1" (doubled comma).
The surrounding try/except swallowed the crash but left ``allowed=None``,
which silently disables the filter so every nvidia-smi GPU becomes a
placement candidate even though the user clearly narrowed visibility.
Skip empty tokens during the parse so trailing/doubled/leading commas
still produce a valid filter, matching the "if token.strip()" pattern
already used in utils/hardware/hardware.py.
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts could auto-select multiple visible XPUs. Round-9 review (4/10 convergent plus related findings on 10 parallel reviewers of PR unslothai#4724) showed that on multi-tile Intel devices (Data Center GPU Max, 2 tiles per root), the synthesized torch.xpu ordinals enumerate tiles rather than distinct root GPUs. Writing them back via apply_gpu_ids() then rewrote ZE_AFFINITY_MASK with tile handles -- narrowing the worker onto a subset of one card instead of spreading across the parent-visible set. This revert + replacement keeps multi-XPU sharding working on default Intel FLAT without any environment-visible ordinal round-trip: studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids() Reverted R7-A. When the parent-visible spec cannot expose stable physical GPU IDs (FLAT no-mask, FLAT numeric, wildcard, subdevice), defer to inherit_parent_visible with selected_gpu_ids=None. No more synthesized worker-local ordinals written back into ZE_AFFINITY_MASK. studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory() Reverted R7-B. XPU relative telemetry is rejected for placement just like CUDA UUID/MIG. llama-server inherits the parent's mask unchanged and uses its own multi-device machinery. studio/backend/utils/hardware/hardware.py -- get_device_map() New path: when device==XPU and caller did not pass gpu_ids, return "balanced" whenever more than one torch.xpu device is visible (either via unresolved mask or FLAT numeric mask that populates numeric_ids but does not support explicit selection). HF Transformers uses the torch ordinals directly -- scope-local within the worker process, no ZE_AFFINITY_MASK rewrite, so tile-vs-root ambiguity cannot leak out. Explicit gpu_ids=[0] continues to produce sequential so the caller's deliberate single-device choice is respected. Verified: PR tests (96 + 4 skipped), sim_pr4724 (174, one contract update), sim_fixes_v2 (183 incl. 15 new round-10 tests + 10 updated for the reverted contracts), 8x B200 CUDA matrix (6/6 scenarios byte-identical).
Shorten multi-line design-rationale comments to 1-3 lines each. The surrounding code and commit history carry the full context; inline comments should state the rule, not re-derive the reasoning. Net reduction: ~107 comment lines removed across 2 files. No behavior change.
…el XPU Studio support)
…ing-intel-gpu-support-merge
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request adds support for Intel XPU (Intel GPUs) to the backend, updating hardware detection, GPU selection, memory telemetry, and error formatting to handle Level Zero and ZE_AFFINITY_MASK environment variables. It also refactors GPU cache clearing and device string resolution across the codebase. However, a critical copy-paste/merge error was introduced in llama_cpp.py where a large block of code starting at line 2740 was duplicated inside the Hugging Face repository block, causing a syntax error and potentially starting llama-server twice.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0cd40c611
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| f"context: {effective_ctx}, " | ||
| f"GPUs free: {gpus}, selected: {gpu_indices}, fit: {use_fit}" | ||
| ) | ||
| except Exception as e: |
There was a problem hiding this comment.
Keep GPU-selection exception inside a try block
With this except at the same indentation level as the preceding if hf_repo: and no matching try, the module no longer parses; python -m py_compile studio/backend/core/inference/llama_cpp.py fails with SyntaxError: invalid syntax. This blocks any environment that imports or starts the llama.cpp backend, not just the new XPU path.
Useful? React with 👍 / 👎.
| cvd = os.environ.get("CUDA_VISIBLE_DEVICES") | ||
| if cvd is not None: | ||
| if cvd is not None and cvd.strip(): |
There was a problem hiding this comment.
Preserve empty CUDA masks when filtering nvidia-smi
When CUDA_VISIBLE_DEVICES is explicitly set to an empty string while the Studio process is still on the CUDA backend, this cvd.strip() guard skips filtering entirely, so the nvidia-smi fast path reports every physical GPU as available. That can make GGUF GPU selection choose and launch on GPUs the parent mask intentionally hid; parse the empty mask as an empty allowed set instead of treating it like an unset mask.
Useful? React with 👍 / 👎.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22e7f72e19
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Intel XPU uses Level Zero and honors ZE_AFFINITY_MASK, not | ||
| # CUDA_VISIBLE_DEVICES. Route XPU pinning through the correct env var | ||
| # so worker subprocesses are actually restricted to the intended GPU. | ||
| if get_device() == DeviceType.XPU: |
There was a problem hiding this comment.
Apply the CUDA mask before probing the backend
When a CUDA worker receives non-empty resolved_gpu_ids, this new get_device() check runs before CUDA_VISIBLE_DEVICES is set. get_device() calls detect_hardware(), which imports torch and queries CUDA device properties, so the CUDA runtime has already enumerated the original visible devices; setting CUDA_VISIBLE_DEVICES afterward in the same process will not reliably restrict the model load to the selected GPUs. This regresses explicit/auto GPU selection for CUDA jobs in the training and inference workers; decide the XPU branch from already-known environment/state or set the CUDA mask before any backend probe.
Useful? React with 👍 / 👎.
| continue | ||
|
|
||
| free_mib = max(int((float(total_gb) - float(used_gb)) * 1024), 0) | ||
| gpus.append((int(index), free_mib)) |
There was a problem hiding this comment.
Pin GGUF XPU selections with ZE_AFFINITY_MASK
When this generic path runs on Intel XPU with physical indices, it now returns a non-empty GPU list that drives gpu_indices selection, but the llama-server launch later only writes CUDA_VISIBLE_DEVICES (and HIP/ROCR for ROCm), not ZE_AFFINITY_MASK. On multi-XPU COMPOSITE setups, Studio can therefore decide to pin a specific Intel GPU while the child process still sees the parent XPU visibility and may run on the wrong GPU or all XPUs; either avoid returning XPU placement here or propagate the selection through ZE_AFFINITY_MASK before spawning llama-server.
Useful? React with 👍 / 👎.
| # Leave inherited CUDA_VISIBLE_DEVICES alone -- removing it could | ||
| # let the worker flip back to CUDA on hybrid hosts. |
There was a problem hiding this comment.
Hide CUDA when forcing XPU workers
When UNSLOTH_FORCE_XPU=1 is used on a hybrid CUDA+Intel host without an already-hidden CUDA_VISIBLE_DEVICES, Studio's hardware detector selects XPU, but this worker path leaves CUDA visible before importing Unsloth. The Unsloth package's own device detection still prefers torch.cuda.is_available() over torch.xpu.is_available(), so the model load can run on NVIDIA CUDA despite the Studio request being routed through XPU; the XPU branch needs to hide CUDA (or otherwise propagate the force signal into Unsloth's detector) before the training/inference worker imports the model stack.
Useful? React with 👍 / 👎.
Add back test_prepare_gpu_selection_rejects_gpu_ids_on_non_cuda_backend to TestRouteErrors (CPU backend still rejects; message now covers XPU too, so assertion checks the common prefix). Add TestXpuRejection with test_auto_select_returns_non_cuda_for_xpu covering the no-visible-devices XPU path where auto-select falls back to non_cuda mode. Companion positive case (explicit IDs accepted) lives in TestXpuSelection.test_prepare_gpu_selection_accepts_explicit_ids_on_xpu. Fixes lint CI "TARGET-MISSING" blockers.
pre-commit.ci reformatted llama_cpp.py and reverted two logic fixes:
- nvidia_eligible: restore XPU-aware guard (_detected != XPU) instead of
equality check (== CUDA) that breaks CI when device is undetected
- CVD filter: restore plain `if cvd is not None:` so empty string masks
("") still suppress all GPUs (revert .strip() regression)
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf4b3bcfec
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return_value = { | ||
| "raw": None, | ||
| "numeric_ids": [], | ||
| "supports_explicit_gpu_ids": False, |
There was a problem hiding this comment.
Update stale XPU auto-selection expectation
When this test runs with get_device patched to XPU, auto_select_gpu_ids() no longer takes the old non-CUDA early return; with this added supports_explicit_gpu_ids == False setup it now sets metadata["selection_mode"] to "inherit_parent_visible". The unchanged assertion for "non_cuda" below will fail in the backend GPU-selection test suite, so the expected mode or scenario needs to be updated.
Useful? React with 👍 / 👎.
Summary
Combines two community PRs adding Intel Arc (XPU) support. Closes #6105.
cc @danielhanchen @Datta0 @mmathew23
This branch merges them on top of current main with all conflicts resolved.
In Progress Testing
Co-authored-by: @leizhenyuan
Co-authored-by: @cheehook