Skip to content

feat: Intel Arc / XPU GPU support for Studio and Gemma training#6106

Draft
LeoBorcherding wants to merge 30 commits into
unslothai:mainfrom
LeoBorcherding:feat/intel-arc-xpu-support
Draft

feat: Intel Arc / XPU GPU support for Studio and Gemma training#6106
LeoBorcherding wants to merge 30 commits into
unslothai:mainfrom
LeoBorcherding:feat/intel-arc-xpu-support

Conversation

@LeoBorcherding

@LeoBorcherding LeoBorcherding commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Combines two community PRs adding Intel Arc (XPU) support. Closes #6105.

cc @danielhanchen @Datta0 @mmathew23

This branch merges them on top of current main with all conflicts resolved.

In Progress Testing

  • Intel Arc Pro B50 (Battlemage, 16GB, Windows 11) - community Discord user

Co-authored-by: @leizhenyuan
Co-authored-by: @cheehook

danielhanchen and others added 24 commits April 11, 2026 12:49
Adds first-class Intel XPU support to unsloth studio across hardware
detection, GPU selection, visibility reporting, telemetry, cache
clearing, OOM messaging, and the DAC autocast path, matching feature
parity with the existing CUDA and ROCm backends. This is a rebased
linear version of the leizhenyuan/zhenyuan_enable_studio branch with
three rounds of review-loop fixes layered in.

Originally authored by leizhenyuan. Rebased and review fixes by
danielhanchen.

Highlights:

hardware.py
  - detect_hardware: promote DeviceType.XPU when torch.xpu is
    available and CUDA is not, preserving CPU / MLX / ROCm paths.
  - New helpers: _parse_ze_mask_roots, _resolve_xpu_smi_device_id,
    _get_xpu_utilization (xpu-smi -m 0,1,3 with cached binary path
    resolution and robust N/A parsing).
  - _get_parent_visible_gpu_spec adds an XPU branch that honors
    ZE_AFFINITY_MASK, including subdevice syntax (0.0,0.1) and
    wildcard / unparseable masks, and mirrors the CUDA UUID / MIG
    path by returning numeric_ids=None for subdevice masks so the
    rest of the stack enumerates torch-visible ordinals and can
    still shard.
  - _backend_visible_devices_env routes XPU through ZE_AFFINITY_MASK
    so /system-info and get_backend_visible_gpu_info report the
    active mask instead of a stale or None CUDA_VISIBLE_DEVICES.
  - get_visible_gpu_utilization and get_backend_visible_gpu_info
    short-circuit on "no devices visible" masks (ZE_AFFINITY_MASK=""
    or CUDA_VISIBLE_DEVICES="" / "-1") so the torch-ordinal
    enumeration fallback does not report hidden devices.
  - get_visible_gpu_count and get_device_map add XPU branches that
    prefer torch.xpu.device_count() and fall back to ZE mask roots
    (str.isdecimal() to reject Unicode superscripts).
  - apply_gpu_ids routes XPU through ZE_AFFINITY_MASK but leaves
    inherited CUDA_VISIBLE_DEVICES alone so hybrid NVIDIA+Intel
    hosts that hid CUDA via CUDA_VISIBLE_DEVICES="" stay on XPU.
  - auto_select_gpu_ids and prepare_gpu_selection allow XPU in the
    accelerator guard and use "non_accelerator" as the selection
    mode label for CPU / MLX.
  - resolve_requested_gpu_ids error message covers "CUDA and Intel
    XPU devices" and mentions ZE_AFFINITY_MASK for XPU.
  - clear_gpu_cache XPU branch guards synchronize and empty_cache
    with hasattr so older torch-xpu builds do not propagate
    AttributeError.
  - dataset_map_num_proc returns None only when
    torch.xpu.is_initialized exists and returns True, so older
    torch-xpu builds still get pre-init CPU dataset parallelism.
  - get_package_versions reports torch.version.xpu alongside cuda.

utils/utils.py
  - format_error_message recognises Intel XPU OOM strings including
    out_of_device_memory, out_of_host_memory, memory allocation
    failed, and bare MemoryError instances, labelling messages as
    "Intel GPU" when DeviceType is XPU.

core/inference/inference.py
  - _generate_dac derives autocast device from model.device.type
    instead of the global backend, clamps to cuda/xpu/mps/cpu, and
    falls through to nullcontext on cpu/xpu float32 (unsupported by
    torch.amp.autocast).

core/inference/llama_cpp.py
  - _start_process pins the llama-server subprocess via
    ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere
    so SYCL builds of llama-server respect the selected GPUs.
  - unload_model clears the XPU cache via the backend-neutral
    clear_gpu_cache helper.
  - init_audio_codec / generate_audio_response retain the pre-PR
    "cuda if torch.cuda.is_available() else cpu" fallback for the
    SNAC / BiCodec / DAC codecs because those upstream codecs are
    not yet validated on Intel XPU.

core/training/trainer.py
  - clear_gpu_cache is hoisted to the module-level import block.
  - _preprocess_snac_dataset / _preprocess_bicodec_dataset /
    _preprocess_dac_dataset keep the pre-PR CPU fallback on
    non-CUDA hosts for Spark-TTS BiCodec, SNAC, and OuteTTS DAC /
    Whisper preprocessing.

tests/test_gpu_selection.py
  - Replace TestXpuRejection with TestXpuSelection (positive tests
    for auto_select_gpu_ids and prepare_gpu_selection on XPU).
  - Update non-accelerator rejection assertion to the new
    "CUDA and Intel XPU" wording.
  - Update test_explicit_ids_are_rejected_for_uuid_parent_visibility
    regex to match the broadened "uses non-numeric or subdevice"
    error message.

tests/test_gpu_selection_sandbox.py
  - Rename test_non_cuda_returns_none to
    test_non_accelerator_returns_none and assert
    selection_mode="non_accelerator".
Replace device-specific CUDA calls in GemmaFixedRotaryEmbedding with
device-agnostic helpers to fix crashes when loading Gemma v1 models on
Intel XPU systems.

Changes:
- torch.cuda.current_device() → get_current_device()
- torch.cuda.empty_cache() → clean_gpu_cache()
- torch.device(device) → torch.device(DEVICE_TYPE_TORCH, device)

Fixes "Torch not compiled with CUDA enabled" error on non-CUDA platforms.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…vation, FLAT hierarchy, backend-aware GGUF discovery

Six targeted changes following the 10-reviewer pass on this PR:

1. studio/backend/core/inference/llama_cpp.py
   Remove env.pop("CUDA_VISIBLE_DEVICES", None) in the XPU llama-server
   spawn branch. The pop contradicted the documented preservation in
   apply_gpu_ids() for hybrid NVIDIA+Intel hosts where the parent sets
   CUDA_VISIBLE_DEVICES="" to force Studio onto Intel/SYCL. Added a
   comment referencing that design note.

2. studio/backend/core/inference/llama_cpp.py
   Make _get_gpu_free_memory backend-aware. NVIDIA keeps the cheap
   nvidia-smi fast path; AMD ROCm and Intel XPU fall through to
   get_visible_gpu_utilization so GGUF offload selection has per-GPU
   free memory on all backends, not just NVIDIA. Returns an empty list
   only when no telemetry is available.

3. studio/backend/utils/hardware/hardware.py
   detect_hardware now honors a non-empty ZE_AFFINITY_MASK as an
   explicit XPU hint. On a hybrid host with both torch.cuda and
   torch.xpu available, ZE_AFFINITY_MASK=... picks XPU before CUDA so
   users can pin Intel without also setting CUDA_VISIBLE_DEVICES="".

4. studio/backend/utils/hardware/hardware.py
   Add _xpu_hierarchy_is_composite() and gate the Level Zero mask
   parsing on it. In the FLAT hierarchy (oneAPI 2024+ default)
   numeric ZE_AFFINITY_MASK entries are tile or device handles, not
   root GPU IDs. _get_parent_visible_gpu_spec now returns
   supports_explicit_gpu_ids=False in FLAT so explicit gpu_ids are
   rejected; _resolve_xpu_smi_device_id returns None so callers skip
   "xpu-smi -d <wrong_tile>" and fall back to torch.xpu VRAM
   telemetry. COMPOSITE retains the original root-ID mapping.

5. studio/backend/utils/hardware/hardware.py
   Conservative XPU used_gb fallback. When torch.xpu lacks
   mem_get_info, _torch_get_per_device_info now returns used_gb=None
   instead of torch.xpu.memory_allocated (process-local and
   misleading for multi-tenant placement). get_visible_gpu_utilization
   guards against None during percent computation; downstream
   auto_select_gpu_ids already filters None entries.

6. studio/backend/models/inference.py, studio/backend/models/training.py
   gpu_ids Field descriptions now mention the Intel XPU
   ZE_AFFINITY_MASK subdevice and FLAT-hierarchy restrictions
   alongside the existing CUDA UUID/MIG note.

Testing:

- 96 PR-authored tests still pass (test_gpu_selection,
  test_gpu_selection_sandbox, test_utils).
- 234 additional simulation tests across two isolated uv venvs
  cover fix-fix interactions, malformed inputs, cross-platform
  behavior (Linux, macOS, Windows), real subprocess env propagation,
  JSON/Pydantic schema roundtrips, concurrency, and fuzz.
- 6/6 CUDA scenarios on 8xB200 hardware are byte-identical to the
  pre-fix baseline. Zero CUDA regression.
- ROCm, MLX, and CPU regression tests all pass. AMD HIP_VISIBLE_DEVICES
  and ROCR_VISIBLE_DEVICES propagation is unchanged.

Backend support after these changes: NVIDIA CUDA, AMD ROCm, Intel XPU
(both COMPOSITE and FLAT hierarchies), CPU, and Apple Silicon MLX.
…s, FLAT numerics

Follow-up to 6c55664 addressing the four findings from the second
10-reviewer pass. Each fix is narrow and additive; the CUDA, ROCm,
MLX, and CPU paths remain byte-identical to the pre-round-2 baseline.

A. studio/backend/core/inference/llama_cpp.py
   Replace `dict.get(a) or dict.get(b)` with explicit `is None`
   fallbacks when reading vram_total_gb / vram_used_gb in the generic
   free-memory path. An idle GPU with vram_used_gb=0.0 was being
   treated as missing telemetry and dropped, which pushed GGUF
   placement into the non-placement fallback whenever the best free
   card was fully idle. 9/10 reviewers in the round-2 pass flagged
   this; every suggestion was the same patch.

B. studio/backend/utils/hardware/hardware.py
   Tighten the ZE_AFFINITY_MASK -> XPU detection hint from
   "any non-empty mask wins" to "non-empty mask plus one of:
   CUDA_VISIBLE_DEVICES explicitly hidden (empty / -1), or
   UNSLOTH_FORCE_XPU=1, or CUDA simply not available". Prevents a
   stray ZE_AFFINITY_MASK=0 inherited from unrelated Intel tooling
   from silently flipping existing hybrid CUDA deployments onto XPU.
   Intel-only hosts are unaffected (CUDA unavailable -> hint is
   honoured).

C. studio/backend/core/inference/llama_cpp.py
   Refuse to use device indices returned from
   get_visible_gpu_utilization when index_kind != "physical".
   Relative ordinals (reported for subdevice / wildcard / UUID
   parent masks) are not safe to round-trip back into
   ZE_AFFINITY_MASK / CUDA_VISIBLE_DEVICES; returning [] lets the
   launcher skip placement and inherit the parent's visibility
   unchanged.

D. studio/backend/utils/hardware/hardware.py
   Relax _get_parent_visible_gpu_spec in FLAT hierarchy. Per Intel's
   Level Zero device-hierarchy docs, numeric ZE_AFFINITY_MASK entries
   in FLAT mode ARE stable flat ordinals that torch.xpu honours
   1-to-1 with the child process. Previously we blanket-rejected all
   FLAT numeric masks, which meant Studio could not do XPU
   auto-selection under the default oneAPI configuration. Now:
     - FLAT + numeric mask -> accept as numeric_ids
     - FLAT + unset mask -> expose range(device_count)
     - FLAT + subdevice "N.M" -> reject (invalid syntax in FLAT)
     - FLAT + wildcard / non-numeric -> reject
     - COMPOSITE semantics unchanged (numeric -> root IDs)
   _resolve_xpu_smi_device_id continues to return None in FLAT
   because xpu-smi -d addresses root IDs, not flat ordinals; the
   telemetry path falls back to torch.xpu VRAM.

Testing:

- 96 PR-authored tests still pass
  (test_gpu_selection_sandbox, test_gpu_selection, test_utils).
- 264 simulation tests pass across two isolated uv venvs. The
  round-2 tests include explicit coverage for:
    Fix A: idle GPU preserved, None still skipped, 0.0 total ok
    Fix B: bare mask on hybrid -> CUDA wins; CVD="" or "-1" hint ->
           XPU; UNSLOTH_FORCE_XPU=1 -> XPU; CUDA unavailable -> XPU;
           UNSLOTH_FORCE_XPU=0 (literal) not activated
    Fix C: relative ordinals not returned; physical ordinals
           returned; missing index_kind treated as physical
    Fix D: FLAT numeric accepted; FLAT unset auto-enumerated;
           FLAT subdevice rejected; FLAT wildcard rejected;
           xpu-smi resolver still None in FLAT; COMPOSITE unchanged
- 6/6 CUDA scenarios on 8xB200 remain byte-identical to baseline.
- ROCm, MLX, CPU regression tests all pass.
…rd fallback

Follow-up to eebf077 addressing three findings from the third
10-reviewer pass and Gemini review 4119296322. All changes are in
studio/backend/utils/hardware/hardware.py; CUDA/ROCm/MLX/CPU paths
remain byte-identical to the pre-round-3 baseline.

E. _get_parent_visible_gpu_spec in FLAT hierarchy now refuses explicit
   gpu_ids. Per Intel's "flattening-gpu-tile-hierarchy" doc and the
   Level Zero spec, numeric ZE_AFFINITY_MASK entries in FLAT mode are
   tile / device-handle ordinals, not physical GPU IDs -- accepting
   them as "Physical GPU indices" breaks the API contract documented
   in models/inference.py and models/training.py. numeric_ids is still
   populated so display and auto-selection enumerate devices, but
   supports_explicit_gpu_ids is False in FLAT regardless of mask
   presence. Users who need explicit tile-level selection can opt in
   with ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE. COMPOSITE semantics are
   unchanged -- numeric masks still resolve to root GPU IDs. Flagged
   by 6/10 reviewers in the third pass.

F. detect_hardware restructured so UNSLOTH_FORCE_XPU=1 is a standalone
   trigger. Previously the outer "if ze_mask:" gate swallowed the
   force knob unless ZE_AFFINITY_MASK was also set, which meant the
   documented override did not work on a common hybrid NVIDIA+Intel
   scenario. Now the preference check is:
     prefer_xpu = force_xpu or (ze_mask and (cuda_hidden or cuda_unavailable))
   The "stray ZE_AFFINITY_MASK flips CUDA to XPU" protection from
   round 2 is preserved: a bare mask without CUDA-hidden still keeps
   CUDA. Flagged by 3/10 reviewers in the third pass.

G. get_visible_gpu_count torch-fallback path: only "*" wildcard counts
   as "all physical XPUs visible". Other non-parseable masks (",,,",
   "GPU-uuid", random garbage) now return 0 visible devices rather
   than silently exposing the full fleet. The happy path (torch.xpu
   .device_count() succeeds) is unchanged. Suggested by the Gemini
   review bot.

Testing:

- 96 PR-authored tests still pass
  (test_gpu_selection_sandbox, test_gpu_selection, test_utils).
- 284 simulation tests pass across two isolated uv venvs. New
  test_round3_fixes.py covers each round-3 finding explicitly:
    Fix E: FLAT rejects gpu_ids via prepare_gpu_selection; FLAT unset
           also rejects; COMPOSITE still accepts; FLAT auto-select
           gracefully falls to inherit_parent_visible mode.
    Fix F: force_xpu alone picks XPU; force_xpu=0 does not;
           no-xpu-torch falls back to CUDA; force_xpu + mask works;
           CVD="" + mask path unchanged; bare mask still keeps CUDA.
    Fix G: "*" -> all_physical; ",,," -> 0; "GPU-abc" -> 0;
           "0,1,*" -> 2 (partial wildcard); subdevice -> 1 root;
           no mask unchanged; torch happy path bypasses fallback.
- 6/6 CUDA scenarios on 8xB200 byte-identical to pre-fix baseline.
- ROCm, MLX, CPU regression tests all pass.
Gemini's fourth review pass (pullrequestreview-4119484559) flagged six
bare `except Exception: pass` blocks in hardware.py where silent failures
could hide real driver or runtime problems. Each site now logs at DEBUG
level with the exception message, so operators have a diagnostic trail
when something goes wrong, while the control flow stays unchanged and the
same fallback path is taken:

1. clear_gpu_cache() XPU branch: log "Failed to clear XPU cache"
2. _resolve_xpu_smi_device_id(): log "torch.xpu.current_device() probe failed"
3. _get_xpu_utilization() xpu-smi call: log "xpu-smi query failed"
4. _get_xpu_utilization() torch.xpu VRAM query: log "torch.xpu VRAM query failed"
5. get_visible_gpu_count() torch.xpu fallback: log and still count mask roots
6. get_visible_gpu_count() torch.cuda fallback: log and still use physical count
7. dataset_map_num_proc() is_initialized probe: log "torch.xpu.is_initialized() probe failed"

These are logging-only edits; every test in the PR's own suite, the two
simulation harnesses (284 tests), and the 8x B200 CUDA behavior matrix
(6/6 scenarios byte-identical) pass unchanged.
Round-4 review (10 parallel reviewers on PR unslothai#4724) produced a unanimous
consensus finding plus two isolated but real regressions. This commit
addresses all three.

[10/10] studio/backend/utils/hardware/hardware.py
_get_parent_visible_gpu_spec() on Intel XPU with ZE_FLAT_DEVICE_HIERARCHY
unset (the oneAPI default FLAT hierarchy) and no ZE_AFFINITY_MASK was
internally inconsistent: it returned numeric_ids=[0..N) with
supports_explicit_gpu_ids=False. Downstream, get_visible_gpu_utilization()
treated those ordinals as numeric_ids!=None, labeled them
index_kind="physical", and llama.cpp's _get_gpu_free_memory() would
round-trip tile/device-handle ordinals back into ZE_AFFINITY_MASK as if
they were stable root-GPU IDs. Collapse the FLAT+no-mask case to
numeric_ids=None so the telemetry path falls into its relative-ordinal
branch and auto-selection uniformly returns inherit_parent_visible. Users
who need explicit Intel selection opt in via ZE_FLAT_DEVICE_HIERARCHY
=COMPOSITE.

[1/10] studio/backend/core/inference/llama_cpp.py
LlamaCppBackend._get_gpu_free_memory() always probed nvidia-smi before
falling back to the generic telemetry path. On a hybrid NVIDIA+Intel host
running Studio in XPU mode, nvidia-smi returned NVIDIA indices and those
ordinals were then used to build a ZE_AFFINITY_MASK for llama-server,
pinning the server to the wrong device. Gate the nvidia-smi fast path on
get_device() == CUDA and IS_ROCM=False so other backends skip straight
to the backend-aware telemetry path.

[1/10] studio/backend/utils/utils.py
format_error_message() dropped the broad "memory"/"cuda" match in favor
of specific substrings. That stopped matching common CUDA allocation
failures such as "CUDA error: CUBLAS_STATUS_ALLOC_FAILED", so users saw
the raw backend exception instead of the friendly "Not enough GPU memory"
hint. Re-add the CUBLAS, cuda-error+alloc, and xpu-alloc patterns while
keeping the matcher narrow enough that non-memory CUDA errors (e.g.
"device-side assert triggered") still pass through unchanged.

Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174), sim
suite v2 (128 incl. 14 new round-5 coverage), 8x B200 CUDA matrix
(6/6 scenarios byte-identical to pre-fix baseline).
Round-5 review (10 parallel reviewers on PR unslothai#4724) narrowed to two real
correctness issues in the XPU path introduced by round-5's numeric_ids
handling. This commit addresses both.

[4/10] studio/backend/utils/hardware/hardware.py -- get_device_map()
The "numeric_ids is None and visible_count > 1 -> balanced" heuristic
was meant for CUDA UUID/MIG masks, but after round 5 collapsed FLAT
no-mask XPU into numeric_ids=None, the heuristic started firing on the
default Intel FLAT hierarchy and forced balanced sharding across tile
handles even though the same module had already classified those
ordinals as non-physical. Restrict the heuristic to CUDA so XPU stays
on sequential loading unless the caller passes explicit gpu_ids (via
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE or prepare_gpu_selection).

[3/10] studio/backend/utils/hardware/hardware.py -- get_visible_gpu_utilization,
get_backend_visible_gpu_info
FLAT numeric masks like ZE_AFFINITY_MASK="0,1" populate numeric_ids
for telemetry display but have supports_explicit_gpu_ids=False because
the tokens are tile/device-handle ordinals per Intel's Level Zero
docs. Both helpers still labeled those ordinals index_kind="physical",
which let LlamaCppBackend._get_gpu_free_memory() round-trip them back
into ZE_AFFINITY_MASK as if they were stable root GPU IDs. Gate the
"physical" label on parent_visible_spec["supports_explicit_gpu_ids"]
and clear parent_visible_gpu_ids in the payload when it's False, so
API consumers and llama.cpp skip the placement path.

Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174, one
updated), sim suite v2 (141 incl. 13 new round-6 tests), 8x B200 CUDA
matrix (6/6 scenarios byte-identical to round-5 baseline).
Round-7 review (10 parallel reviewers on PR unslothai#4724) produced a strong
convergent finding (7/10 independent reviewers on the same underlying
issue): the PR is titled "enable Studio for Intel GPU" but on the
default Intel oneAPI runtime (ZE_FLAT_DEVICE_HIERARCHY=FLAT, no
ZE_AFFINITY_MASK set), Studio still couldn't auto-select or shard
across multiple visible XPUs. Users had to manually opt into COMPOSITE
hierarchy for the PR's advertised feature to actually work.

The reviewers correctly noted that this is fixable without relaxing the
"physical IDs only" contract of the public prepare_gpu_selection() API.
torch.xpu ordinals are stable within a single worker process's inherited
ZE_AFFINITY_MASK scope: Studio can safely generate them internally for
auto-selection, narrow the mask via apply_gpu_ids(), and let the child
process inherit the same torch.xpu device set. Round-trip stays 1:1
within the process boundary the ordinals were observed in.

[7/10] studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids()
When get_device()==XPU and supports_explicit_gpu_ids=False (FLAT
hierarchy cases), instead of immediately bailing out with
selection_mode="inherit_parent_visible" and selected=None, fall back to
list(range(get_visible_gpu_count())) as the candidate set and continue
into the VRAM-based selection logic. This path is flagged via a new
metadata key xpu_relative_auto_select=True so telemetry stays
introspectable. The public prepare_gpu_selection() API is unchanged
and still refuses user-supplied explicit gpu_ids on these masks,
because end users can't reason about FLAT tile handles; only Studio's
own auto-selection (which controls the worker-process env) can use
these ordinals safely.

[3/10] studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory()
The index_kind="relative" guard was rejecting XPU ordinals alongside
CUDA UUID/MIG fallbacks. CUDA relative indices stay unsafe (the parent
has hidden the physical ID mapping, so re-exporting them into
CUDA_VISIBLE_DEVICES would re-expose hidden GPUs). XPU relative
ordinals are stable for the current worker scope, so accept them and
let VRAM-based GGUF placement work on default Intel FLAT hosts.

Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2
(153 incl. 11 new round-7 tests, 3 updated to new contract), 8x B200
CUDA matrix (6/6 scenarios byte-identical to round-5 baseline).
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts
could use multiple visible XPUs via auto-selection and VRAM-based
llama.cpp placement. Round-8 review (6/10 convergent finding on 10
parallel reviewers of PR unslothai#4724) showed that fallback triggered even
when the parent process had already narrowed visibility via
ZE_AFFINITY_MASK, which silently retargeted child processes onto
different Level Zero handles than the parent exposed.

Example: parent sets ZE_AFFINITY_MASK="3,5" to expose tiles 3 and 5.
torch.xpu in the parent sees ordinals 0 (=tile 3) and 1 (=tile 5).
Round-7 auto_select returned [0, 1]; apply_gpu_ids() rewrote
ZE_AFFINITY_MASK="0,1"; the child inherited that and saw tile 0 and
tile 1, not the originally-exposed 3 and 5. Same regression on
subdevice masks like "0.0,0.1".

studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids()
Gate xpu_relative_auto_select on parent_visible_spec["raw"] is None.
When a parent mask is already set, defer to inherit_parent_visible so
the child inherits the exact visibility the parent intended.

studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory()
Gate the relative-XPU acceptance on os.environ.get("ZE_AFFINITY_MASK")
is None. Inherited masks (numeric "3,5", subdevice "0.0,0.1",
wildcard "*") are preserved via the existing "return []" fall-through.

Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2
(162 incl. 9 new round-8 tests + 2 updated to the new inherited-mask
contract), 8x B200 CUDA matrix (6/6 scenarios byte-identical).
Gemini review 4120214370 caught that LlamaCppBackend._get_gpu_free_memory()
built its CUDA_VISIBLE_DEVICES filter with int(x.strip()) for x in
cvd.split(","), which raises ValueError on an empty token from a shell-
exported value like "0,1," (trailing comma) or "0,,1" (doubled comma).
The surrounding try/except swallowed the crash but left ``allowed=None``,
which silently disables the filter so every nvidia-smi GPU becomes a
placement candidate even though the user clearly narrowed visibility.

Skip empty tokens during the parse so trailing/doubled/leading commas
still produce a valid filter, matching the "if token.strip()" pattern
already used in utils/hardware/hardware.py.
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts
could auto-select multiple visible XPUs. Round-9 review (4/10 convergent
plus related findings on 10 parallel reviewers of PR unslothai#4724) showed that
on multi-tile Intel devices (Data Center GPU Max, 2 tiles per root),
the synthesized torch.xpu ordinals enumerate tiles rather than distinct
root GPUs. Writing them back via apply_gpu_ids() then rewrote
ZE_AFFINITY_MASK with tile handles -- narrowing the worker onto a
subset of one card instead of spreading across the parent-visible set.

This revert + replacement keeps multi-XPU sharding working on default
Intel FLAT without any environment-visible ordinal round-trip:

studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids()
Reverted R7-A. When the parent-visible spec cannot expose stable
physical GPU IDs (FLAT no-mask, FLAT numeric, wildcard, subdevice),
defer to inherit_parent_visible with selected_gpu_ids=None. No more
synthesized worker-local ordinals written back into ZE_AFFINITY_MASK.

studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory()
Reverted R7-B. XPU relative telemetry is rejected for placement just
like CUDA UUID/MIG. llama-server inherits the parent's mask unchanged
and uses its own multi-device machinery.

studio/backend/utils/hardware/hardware.py -- get_device_map()
New path: when device==XPU and caller did not pass gpu_ids, return
"balanced" whenever more than one torch.xpu device is visible (either
via unresolved mask or FLAT numeric mask that populates numeric_ids
but does not support explicit selection). HF Transformers uses the
torch ordinals directly -- scope-local within the worker process, no
ZE_AFFINITY_MASK rewrite, so tile-vs-root ambiguity cannot leak out.
Explicit gpu_ids=[0] continues to produce sequential so the caller's
deliberate single-device choice is respected.

Verified: PR tests (96 + 4 skipped), sim_pr4724 (174, one contract
update), sim_fixes_v2 (183 incl. 15 new round-10 tests + 10 updated
for the reverted contracts), 8x B200 CUDA matrix (6/6 scenarios
byte-identical).
Shorten multi-line design-rationale comments to 1-3 lines each.
The surrounding code and commit history carry the full context;
inline comments should state the rule, not re-derive the reasoning.

Net reduction: ~107 comment lines removed across 2 files.
No behavior change.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Intel XPU (Intel GPUs) to the backend, updating hardware detection, GPU selection, memory telemetry, and error formatting to handle Level Zero and ZE_AFFINITY_MASK environment variables. It also refactors GPU cache clearing and device string resolution across the codebase. However, a critical copy-paste/merge error was introduced in llama_cpp.py where a large block of code starting at line 2740 was duplicated inside the Hugging Face repository block, causing a syntax error and potentially starting llama-server twice.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread studio/backend/core/inference/llama_cpp.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0cd40c611

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

f"context: {effective_ctx}, "
f"GPUs free: {gpus}, selected: {gpu_indices}, fit: {use_fit}"
)
except Exception as e:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep GPU-selection exception inside a try block

With this except at the same indentation level as the preceding if hf_repo: and no matching try, the module no longer parses; python -m py_compile studio/backend/core/inference/llama_cpp.py fails with SyntaxError: invalid syntax. This blocks any environment that imports or starts the llama.cpp backend, not just the new XPU path.

Useful? React with 👍 / 👎.

Comment on lines +1288 to +1289
cvd = os.environ.get("CUDA_VISIBLE_DEVICES")
if cvd is not None:
if cvd is not None and cvd.strip():

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve empty CUDA masks when filtering nvidia-smi

When CUDA_VISIBLE_DEVICES is explicitly set to an empty string while the Studio process is still on the CUDA backend, this cvd.strip() guard skips filtering entirely, so the nvidia-smi fast path reports every physical GPU as available. That can make GGUF GPU selection choose and launch on GPUs the parent mask intentionally hid; parse the empty mask as an empty allowed set instead of treating it like an unset mask.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22e7f72e19

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

# Intel XPU uses Level Zero and honors ZE_AFFINITY_MASK, not
# CUDA_VISIBLE_DEVICES. Route XPU pinning through the correct env var
# so worker subprocesses are actually restricted to the intended GPU.
if get_device() == DeviceType.XPU:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply the CUDA mask before probing the backend

When a CUDA worker receives non-empty resolved_gpu_ids, this new get_device() check runs before CUDA_VISIBLE_DEVICES is set. get_device() calls detect_hardware(), which imports torch and queries CUDA device properties, so the CUDA runtime has already enumerated the original visible devices; setting CUDA_VISIBLE_DEVICES afterward in the same process will not reliably restrict the model load to the selected GPUs. This regresses explicit/auto GPU selection for CUDA jobs in the training and inference workers; decide the XPU branch from already-known environment/state or set the CUDA mask before any backend probe.

Useful? React with 👍 / 👎.

continue

free_mib = max(int((float(total_gb) - float(used_gb)) * 1024), 0)
gpus.append((int(index), free_mib))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin GGUF XPU selections with ZE_AFFINITY_MASK

When this generic path runs on Intel XPU with physical indices, it now returns a non-empty GPU list that drives gpu_indices selection, but the llama-server launch later only writes CUDA_VISIBLE_DEVICES (and HIP/ROCR for ROCm), not ZE_AFFINITY_MASK. On multi-XPU COMPOSITE setups, Studio can therefore decide to pin a specific Intel GPU while the child process still sees the parent XPU visibility and may run on the wrong GPU or all XPUs; either avoid returning XPU placement here or propagate the selection through ZE_AFFINITY_MASK before spawning llama-server.

Useful? React with 👍 / 👎.

Comment on lines +2304 to +2305
# Leave inherited CUDA_VISIBLE_DEVICES alone -- removing it could
# let the worker flip back to CUDA on hybrid hosts.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Hide CUDA when forcing XPU workers

When UNSLOTH_FORCE_XPU=1 is used on a hybrid CUDA+Intel host without an already-hidden CUDA_VISIBLE_DEVICES, Studio's hardware detector selects XPU, but this worker path leaves CUDA visible before importing Unsloth. The Unsloth package's own device detection still prefers torch.cuda.is_available() over torch.xpu.is_available(), so the model load can run on NVIDIA CUDA despite the Studio request being routed through XPU; the XPU branch needs to hide CUDA (or otherwise propagate the force signal into Unsloth's detector) before the training/inference worker imports the model stack.

Useful? React with 👍 / 👎.

LeoBorcherding and others added 3 commits June 8, 2026 15:03
Add back test_prepare_gpu_selection_rejects_gpu_ids_on_non_cuda_backend
to TestRouteErrors (CPU backend still rejects; message now covers XPU
too, so assertion checks the common prefix).

Add TestXpuRejection with test_auto_select_returns_non_cuda_for_xpu
covering the no-visible-devices XPU path where auto-select falls back
to non_cuda mode. Companion positive case (explicit IDs accepted) lives
in TestXpuSelection.test_prepare_gpu_selection_accepts_explicit_ids_on_xpu.

Fixes lint CI "TARGET-MISSING" blockers.
pre-commit.ci reformatted llama_cpp.py and reverted two logic fixes:
- nvidia_eligible: restore XPU-aware guard (_detected != XPU) instead of
  equality check (== CUDA) that breaks CI when device is undetected
- CVD filter: restore plain `if cvd is not None:` so empty string masks
  ("") still suppress all GPUs (revert .strip() regression)

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf4b3bcfec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1320 to +1323
return_value = {
"raw": None,
"numeric_ids": [],
"supports_explicit_gpu_ids": False,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update stale XPU auto-selection expectation

When this test runs with get_device patched to XPU, auto_select_gpu_ids() no longer takes the old non-CUDA early return; with this added supports_explicit_gpu_ids == False setup it now sets metadata["selection_mode"] to "inherit_parent_visible". The unchanged assertion for "non_cuda" below will fail in the backend GPU-selection test suite, so the expected mode or scenario needs to be updated.

Useful? React with 👍 / 👎.

@LeoBorcherding LeoBorcherding marked this pull request as draft June 8, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intel Arc GPU (XPU) support tracking issue

4 participants