Skip to content

enable studio for intel GPU#4724

Open
leizhenyuan wants to merge 20 commits into
unslothai:mainfrom
leizhenyuan:zhenyuan_enable_studio
Open

enable studio for intel GPU#4724
leizhenyuan wants to merge 20 commits into
unslothai:mainfrom
leizhenyuan:zhenyuan_enable_studio

Conversation

@leizhenyuan

Copy link
Copy Markdown
Contributor

No description provided.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for Intel XPU (Intel GPUs) across the backend by replacing hardcoded CUDA logic with hardware-agnostic utility functions. It adds XPU detection, utilization tracking via xpu-smi, and environment variable handling for device visibility. Feedback highlights a logic error in parsing xpu-smi output where memory utilization is incorrectly assigned to temperature, a lack of multi-GPU support in the monitoring command, and an inconsistency in the version reporting format for XPU.

Comment on lines +390 to +454
def _get_xpu_utilization() -> Dict[str, Any]:
"""Return a live snapshot of Intel XPU GPU utilization via ``xpu-smi`` or torch.xpu."""
try:
import subprocess

result = subprocess.run(
["xpu-smi", "dump", "-d", "0", "-m", "0,1,2,18"],
capture_output = True,
text = True,
timeout = 5,
)
if result.returncode == 0 and result.stdout.strip():
# xpu-smi dump outputs CSV: Timestamp, DeviceId, GPU Utilization (%), ...
lines = result.stdout.strip().splitlines()
for line in reversed(lines):
if line.startswith("Timestamp") or line.startswith("#"):
continue
parts = [p.strip() for p in line.split(",")]
if len(parts) >= 4:
gpu_util = float(parts[2]) if parts[2] not in ("", "N/A") else None
temp = float(parts[3]) if parts[3] not in ("", "N/A") else None
break
else:
gpu_util = None
temp = None
else:
gpu_util = None
temp = None
except Exception:
gpu_util = None
temp = None

# Get VRAM from torch.xpu
vram_used_gb = None
vram_total_gb = None
try:
import torch

idx = torch.xpu.current_device()
props = torch.xpu.get_device_properties(idx)
vram_total_gb = round(props.total_memory / (1024**3), 2)
vram_used_gb = round(torch.xpu.memory_allocated(idx) / (1024**3), 2)
except Exception:
pass

vram_pct = (
round((vram_used_gb / vram_total_gb) * 100, 1)
if vram_used_gb is not None and vram_total_gb and vram_total_gb > 0
else None
)

has_any = any(v is not None for v in [gpu_util, temp, vram_used_gb])
if not has_any:
return {"available": False, "backend": "xpu"}

return {
"available": True,
"backend": "xpu",
"gpu_utilization_pct": gpu_util,
"temperature_c": temp,
"vram_used_gb": vram_used_gb,
"vram_total_gb": vram_total_gb,
"vram_utilization_pct": vram_pct,
"power_draw_w": None,
"power_limit_w": None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _get_xpu_utilization function contains a logic error in metric parsing and lacks support for multi-GPU configurations:

  1. Metric Index Bug: With the requested metrics -m 0,1,2,18, the CSV output indices are: 2 (GPU Util), 3 (GPU Mem Util), 4 (GPU Temp), and 5 (GPU Power). The current code incorrectly assigns parts[3] (Memory Util) to temp.
  2. Hardcoded Device: It always queries physical device 0 via xpu-smi -d 0. This should be mapped from the current torch device index, especially when ZE_AFFINITY_MASK is active to restrict visibility.
  3. Power Data: Metric 18 (Power) is requested in the command but not parsed into the return dictionary.

Comment on lines +321 to +322
if hasattr(torch, "xpu") and torch.xpu.is_available():
versions["xpu"] = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting versions["xpu"] = True is inconsistent with other entries in the versions dictionary, which typically contain version strings (e.g., from importlib.metadata). It would be more useful to provide a string indicating support or the version of the Intel Extension for PyTorch (IPEX) if applicable.

@danielhanchen danielhanchen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! The goal of this PR is to enable Unsloth Studio on Intel XPU GPUs. As a summary, this PR replaces hardcoded CUDA-only code paths in Studio audio inference, training, and cache clearing with device-agnostic helpers (get_torch_device_str(), clear_gpu_cache()), adds XPU detection/telemetry in hardware.py, and updates error messaging to mention Intel GPUs.

I pushed a commit fixing the issues below. Here is a summary of findings after reviewing:

Reviewers Severity Finding
10/10 High _get_xpu_utilization() used -m 0,1,2,18 which mapped parts[3] to temperature incorrectly -- it was GPU Memory Utilization, not temperature. Fixed to use -m 0,2,3 (Utilization, Power, Core Temperature) with correct column parsing.
10/10 High xpu-smi dump without -n 1 runs indefinitely until the timeout kills it. Added -n 1 to request exactly one sample.
10/10 Medium Hardcoded -d 0 in xpu-smi call ignores which device is active. Fixed to use torch.xpu.current_device().
10/10 Medium Dead code in get_visible_gpu_count() -- XPU re-check at line 1357 was unreachable because the XPU block above always returns first. Removed the dead branch.
8/10 Medium versions["xpu"] = True was a bool in a Dict[str, Optional[str]]. Changed to getattr(torch.version, "xpu", "available") to return the actual XPU version string.
7/10 Low power_draw_w was always None in the returned dict even though power data was being requested from xpu-smi. Now correctly populated.

All fixes are in the latest commit. Concrete suggestions for each finding above.

@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Intel XPU (Intel GPUs) across the inference and training modules by abstracting device-specific logic into utility functions like get_torch_device_str and clear_gpu_cache. Key changes include updated hardware detection, XPU utilization monitoring via xpu-smi, and safety measures to disable multiprocessing on XPU devices to prevent memory context corruption. Feedback is provided regarding the use of broad, silent exception handling in the XPU utilization logic, which should be refined to include explicit error handling and logging.

Comment on lines +413 to +418
result = subprocess.run(
["xpu-smi", "dump", "-d", str(dev_idx), "-m", "0,2,3", "-n", "1"],
capture_output = True,
text = True,
timeout = 10,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The subprocess.run call for xpu-smi might raise a FileNotFoundError. While the broad except Exception block at line 430 catches this, it is better to explicitly handle FileNotFoundError. Additionally, ensure the exception is logged (even at a debug level) rather than silently ignored, as per repository rules to aid in future debugging.

References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@savvadesogle

Copy link
Copy Markdown

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Intel XPU hardware by abstracting device-specific logic into hardware-agnostic utility functions like get_torch_device_str and clear_gpu_cache. Key updates include enhanced hardware detection, XPU utilization tracking via xpu-smi, and improved handling of device visibility and multiprocessing on Intel GPUs. Review feedback focuses on improving code organization by moving local imports to the module level and enhancing observability by logging exceptions in broad try-except blocks instead of using silent pass statements.

Comment on lines +1646 to +1647
from utils.hardware import get_torch_device_str

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code organization and to avoid repeated imports within a function, please move the import of get_torch_device_str to the top of the file. You can add it to the existing from utils.hardware import ... block at line 21.

Comment on lines +1441 to +1442
from utils.hardware import clear_gpu_cache

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve code organization and adhere to Python's best practices, please move this import to the top of the file. This will make dependencies clearer and avoid re-importing within methods.

This also applies to the local imports of get_torch_device_str on lines 3020 and 3096.

Comment thread studio/backend/core/training/trainer.py Outdated
Comment on lines +1535 to +1536
from utils.hardware import get_torch_device_str

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve code organization and avoid redundant imports, please move the import of get_torch_device_str to the top of the file. It can be added to the existing from utils.hardware import ... block at line 36.

Additionally, clear_gpu_cache is already imported at the top, so its local imports (e.g., at lines 1713, 1945, and 2162) can be removed. This change should be applied to all local imports from utils.hardware throughout this file.

References
  1. Avoid redundant inline imports if the module is already imported at the top of the file.

Comment on lines +437 to +438
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This broad except Exception: pass block silently ignores all errors that might occur when running xpu-smi, including timeouts or parsing errors. This can make debugging difficult. Please log the exception to aid in future troubleshooting, even if it's at a debug level. This aligns with the general rule to avoid silent exception handlers.

Suggested change
except Exception:
pass
except Exception as e:
logger.debug("Failed to get XPU utilization from xpu-smi: %s", e)
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines +450 to +451
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the xpu-smi call, this except Exception: pass block silently ignores errors when querying VRAM from torch.xpu. Logging these errors would be beneficial for debugging potential issues with the Intel GPU tooling or PyTorch's XPU support.

Suggested change
except Exception:
pass
except Exception as e:
logger.debug("Failed to get XPU VRAM info from torch.xpu: %s", e)
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

import torch

_visible_gpu_count = torch.xpu.device_count()
except Exception:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This except Exception: block silently catches any error from torch.xpu.device_count() and proceeds to a fallback. To improve debuggability, please log the exception that was caught. This will help diagnose issues if torch.xpu is available but device_count() fails unexpectedly.

For example:

        except Exception as e:
            logger.debug("torch.xpu.device_count() failed: %s", e)
            if xpu_visible:
                # ...
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Adds first-class Intel XPU support to unsloth studio across hardware
detection, GPU selection, visibility reporting, telemetry, cache
clearing, OOM messaging, and the DAC autocast path, matching feature
parity with the existing CUDA and ROCm backends. This is a rebased
linear version of the leizhenyuan/zhenyuan_enable_studio branch with
three rounds of review-loop fixes layered in.

Originally authored by leizhenyuan. Rebased and review fixes by
danielhanchen.

Highlights:

hardware.py
  - detect_hardware: promote DeviceType.XPU when torch.xpu is
    available and CUDA is not, preserving CPU / MLX / ROCm paths.
  - New helpers: _parse_ze_mask_roots, _resolve_xpu_smi_device_id,
    _get_xpu_utilization (xpu-smi -m 0,1,3 with cached binary path
    resolution and robust N/A parsing).
  - _get_parent_visible_gpu_spec adds an XPU branch that honors
    ZE_AFFINITY_MASK, including subdevice syntax (0.0,0.1) and
    wildcard / unparseable masks, and mirrors the CUDA UUID / MIG
    path by returning numeric_ids=None for subdevice masks so the
    rest of the stack enumerates torch-visible ordinals and can
    still shard.
  - _backend_visible_devices_env routes XPU through ZE_AFFINITY_MASK
    so /system-info and get_backend_visible_gpu_info report the
    active mask instead of a stale or None CUDA_VISIBLE_DEVICES.
  - get_visible_gpu_utilization and get_backend_visible_gpu_info
    short-circuit on "no devices visible" masks (ZE_AFFINITY_MASK=""
    or CUDA_VISIBLE_DEVICES="" / "-1") so the torch-ordinal
    enumeration fallback does not report hidden devices.
  - get_visible_gpu_count and get_device_map add XPU branches that
    prefer torch.xpu.device_count() and fall back to ZE mask roots
    (str.isdecimal() to reject Unicode superscripts).
  - apply_gpu_ids routes XPU through ZE_AFFINITY_MASK but leaves
    inherited CUDA_VISIBLE_DEVICES alone so hybrid NVIDIA+Intel
    hosts that hid CUDA via CUDA_VISIBLE_DEVICES="" stay on XPU.
  - auto_select_gpu_ids and prepare_gpu_selection allow XPU in the
    accelerator guard and use "non_accelerator" as the selection
    mode label for CPU / MLX.
  - resolve_requested_gpu_ids error message covers "CUDA and Intel
    XPU devices" and mentions ZE_AFFINITY_MASK for XPU.
  - clear_gpu_cache XPU branch guards synchronize and empty_cache
    with hasattr so older torch-xpu builds do not propagate
    AttributeError.
  - dataset_map_num_proc returns None only when
    torch.xpu.is_initialized exists and returns True, so older
    torch-xpu builds still get pre-init CPU dataset parallelism.
  - get_package_versions reports torch.version.xpu alongside cuda.

utils/utils.py
  - format_error_message recognises Intel XPU OOM strings including
    out_of_device_memory, out_of_host_memory, memory allocation
    failed, and bare MemoryError instances, labelling messages as
    "Intel GPU" when DeviceType is XPU.

core/inference/inference.py
  - _generate_dac derives autocast device from model.device.type
    instead of the global backend, clamps to cuda/xpu/mps/cpu, and
    falls through to nullcontext on cpu/xpu float32 (unsupported by
    torch.amp.autocast).

core/inference/llama_cpp.py
  - _start_process pins the llama-server subprocess via
    ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere
    so SYCL builds of llama-server respect the selected GPUs.
  - unload_model clears the XPU cache via the backend-neutral
    clear_gpu_cache helper.
  - init_audio_codec / generate_audio_response retain the pre-PR
    "cuda if torch.cuda.is_available() else cpu" fallback for the
    SNAC / BiCodec / DAC codecs because those upstream codecs are
    not yet validated on Intel XPU.

core/training/trainer.py
  - clear_gpu_cache is hoisted to the module-level import block.
  - _preprocess_snac_dataset / _preprocess_bicodec_dataset /
    _preprocess_dac_dataset keep the pre-PR CPU fallback on
    non-CUDA hosts for Spark-TTS BiCodec, SNAC, and OuteTTS DAC /
    Whisper preprocessing.

tests/test_gpu_selection.py
  - Replace TestXpuRejection with TestXpuSelection (positive tests
    for auto_select_gpu_ids and prepare_gpu_selection on XPU).
  - Update non-accelerator rejection assertion to the new
    "CUDA and Intel XPU" wording.
  - Update test_explicit_ids_are_rejected_for_uuid_parent_visibility
    regex to match the broadened "uses non-numeric or subdevice"
    error message.

tests/test_gpu_selection_sandbox.py
  - Rename test_non_cuda_returns_none to
    test_non_accelerator_returns_none and assert
    selection_mode="non_accelerator".
@danielhanchen danielhanchen force-pushed the zhenyuan_enable_studio branch from 9ce735d to b759a03 Compare April 11, 2026 12:50
@danielhanchen danielhanchen added auto-approved Auto-review approved the PR and removed auto-reviewing Auto-review in progress auto-approved Auto-review approved the PR labels Apr 11, 2026
danielhanchen and others added 4 commits April 16, 2026 07:24
…vation, FLAT hierarchy, backend-aware GGUF discovery

Six targeted changes following the 10-reviewer pass on this PR:

1. studio/backend/core/inference/llama_cpp.py
   Remove env.pop("CUDA_VISIBLE_DEVICES", None) in the XPU llama-server
   spawn branch. The pop contradicted the documented preservation in
   apply_gpu_ids() for hybrid NVIDIA+Intel hosts where the parent sets
   CUDA_VISIBLE_DEVICES="" to force Studio onto Intel/SYCL. Added a
   comment referencing that design note.

2. studio/backend/core/inference/llama_cpp.py
   Make _get_gpu_free_memory backend-aware. NVIDIA keeps the cheap
   nvidia-smi fast path; AMD ROCm and Intel XPU fall through to
   get_visible_gpu_utilization so GGUF offload selection has per-GPU
   free memory on all backends, not just NVIDIA. Returns an empty list
   only when no telemetry is available.

3. studio/backend/utils/hardware/hardware.py
   detect_hardware now honors a non-empty ZE_AFFINITY_MASK as an
   explicit XPU hint. On a hybrid host with both torch.cuda and
   torch.xpu available, ZE_AFFINITY_MASK=... picks XPU before CUDA so
   users can pin Intel without also setting CUDA_VISIBLE_DEVICES="".

4. studio/backend/utils/hardware/hardware.py
   Add _xpu_hierarchy_is_composite() and gate the Level Zero mask
   parsing on it. In the FLAT hierarchy (oneAPI 2024+ default)
   numeric ZE_AFFINITY_MASK entries are tile or device handles, not
   root GPU IDs. _get_parent_visible_gpu_spec now returns
   supports_explicit_gpu_ids=False in FLAT so explicit gpu_ids are
   rejected; _resolve_xpu_smi_device_id returns None so callers skip
   "xpu-smi -d <wrong_tile>" and fall back to torch.xpu VRAM
   telemetry. COMPOSITE retains the original root-ID mapping.

5. studio/backend/utils/hardware/hardware.py
   Conservative XPU used_gb fallback. When torch.xpu lacks
   mem_get_info, _torch_get_per_device_info now returns used_gb=None
   instead of torch.xpu.memory_allocated (process-local and
   misleading for multi-tenant placement). get_visible_gpu_utilization
   guards against None during percent computation; downstream
   auto_select_gpu_ids already filters None entries.

6. studio/backend/models/inference.py, studio/backend/models/training.py
   gpu_ids Field descriptions now mention the Intel XPU
   ZE_AFFINITY_MASK subdevice and FLAT-hierarchy restrictions
   alongside the existing CUDA UUID/MIG note.

Testing:

- 96 PR-authored tests still pass (test_gpu_selection,
  test_gpu_selection_sandbox, test_utils).
- 234 additional simulation tests across two isolated uv venvs
  cover fix-fix interactions, malformed inputs, cross-platform
  behavior (Linux, macOS, Windows), real subprocess env propagation,
  JSON/Pydantic schema roundtrips, concurrency, and fuzz.
- 6/6 CUDA scenarios on 8xB200 hardware are byte-identical to the
  pre-fix baseline. Zero CUDA regression.
- ROCm, MLX, and CPU regression tests all pass. AMD HIP_VISIBLE_DEVICES
  and ROCR_VISIBLE_DEVICES propagation is unchanged.

Backend support after these changes: NVIDIA CUDA, AMD ROCm, Intel XPU
(both COMPOSITE and FLAT hierarchies), CPU, and Apple Silicon MLX.
…s, FLAT numerics

Follow-up to 6c55664 addressing the four findings from the second
10-reviewer pass. Each fix is narrow and additive; the CUDA, ROCm,
MLX, and CPU paths remain byte-identical to the pre-round-2 baseline.

A. studio/backend/core/inference/llama_cpp.py
   Replace `dict.get(a) or dict.get(b)` with explicit `is None`
   fallbacks when reading vram_total_gb / vram_used_gb in the generic
   free-memory path. An idle GPU with vram_used_gb=0.0 was being
   treated as missing telemetry and dropped, which pushed GGUF
   placement into the non-placement fallback whenever the best free
   card was fully idle. 9/10 reviewers in the round-2 pass flagged
   this; every suggestion was the same patch.

B. studio/backend/utils/hardware/hardware.py
   Tighten the ZE_AFFINITY_MASK -> XPU detection hint from
   "any non-empty mask wins" to "non-empty mask plus one of:
   CUDA_VISIBLE_DEVICES explicitly hidden (empty / -1), or
   UNSLOTH_FORCE_XPU=1, or CUDA simply not available". Prevents a
   stray ZE_AFFINITY_MASK=0 inherited from unrelated Intel tooling
   from silently flipping existing hybrid CUDA deployments onto XPU.
   Intel-only hosts are unaffected (CUDA unavailable -> hint is
   honoured).

C. studio/backend/core/inference/llama_cpp.py
   Refuse to use device indices returned from
   get_visible_gpu_utilization when index_kind != "physical".
   Relative ordinals (reported for subdevice / wildcard / UUID
   parent masks) are not safe to round-trip back into
   ZE_AFFINITY_MASK / CUDA_VISIBLE_DEVICES; returning [] lets the
   launcher skip placement and inherit the parent's visibility
   unchanged.

D. studio/backend/utils/hardware/hardware.py
   Relax _get_parent_visible_gpu_spec in FLAT hierarchy. Per Intel's
   Level Zero device-hierarchy docs, numeric ZE_AFFINITY_MASK entries
   in FLAT mode ARE stable flat ordinals that torch.xpu honours
   1-to-1 with the child process. Previously we blanket-rejected all
   FLAT numeric masks, which meant Studio could not do XPU
   auto-selection under the default oneAPI configuration. Now:
     - FLAT + numeric mask -> accept as numeric_ids
     - FLAT + unset mask -> expose range(device_count)
     - FLAT + subdevice "N.M" -> reject (invalid syntax in FLAT)
     - FLAT + wildcard / non-numeric -> reject
     - COMPOSITE semantics unchanged (numeric -> root IDs)
   _resolve_xpu_smi_device_id continues to return None in FLAT
   because xpu-smi -d addresses root IDs, not flat ordinals; the
   telemetry path falls back to torch.xpu VRAM.

Testing:

- 96 PR-authored tests still pass
  (test_gpu_selection_sandbox, test_gpu_selection, test_utils).
- 264 simulation tests pass across two isolated uv venvs. The
  round-2 tests include explicit coverage for:
    Fix A: idle GPU preserved, None still skipped, 0.0 total ok
    Fix B: bare mask on hybrid -> CUDA wins; CVD="" or "-1" hint ->
           XPU; UNSLOTH_FORCE_XPU=1 -> XPU; CUDA unavailable -> XPU;
           UNSLOTH_FORCE_XPU=0 (literal) not activated
    Fix C: relative ordinals not returned; physical ordinals
           returned; missing index_kind treated as physical
    Fix D: FLAT numeric accepted; FLAT unset auto-enumerated;
           FLAT subdevice rejected; FLAT wildcard rejected;
           xpu-smi resolver still None in FLAT; COMPOSITE unchanged
- 6/6 CUDA scenarios on 8xB200 remain byte-identical to baseline.
- ROCm, MLX, CPU regression tests all pass.
@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the backend, including inference, training, and hardware utility modules. Key updates include backend-aware GPU memory querying, support for the ZE_AFFINITY_MASK environment variable for device visibility, and integrated telemetry via xpu-smi. Additionally, the changes address XPU-specific constraints in multiprocessing and improve error reporting for device memory issues. Feedback was provided regarding the logic for calculating visible GPU counts when parsing ZE_AFFINITY_MASK, specifically to better distinguish between wildcard and invalid masks.

Comment on lines +1802 to +1807
roots = _parse_ze_mask_roots(xpu_visible)
# Non-digit wildcards (e.g. "*") yield an empty roots list;
# treat those as "all physical XPUs visible".
_visible_gpu_count = (
len(set(roots)) if roots else get_physical_gpu_count()
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fallback logic for _visible_gpu_count when roots is empty is incorrect. It treats any mask that results in an empty roots list (like invalid masks ",," or non-numeric wildcards *) as a signal to use all physical GPUs. This is only correct for a wildcard like "*". For invalid or empty-like masks, the count should be 0.

This can lead to an incorrect number of visible GPUs being reported, potentially causing issues with resource allocation.

I suggest refining the logic to explicitly handle the wildcard case and treat other non-parsing masks as yielding zero devices.

Suggested change
roots = _parse_ze_mask_roots(xpu_visible)
# Non-digit wildcards (e.g. "*") yield an empty roots list;
# treat those as "all physical XPUs visible".
_visible_gpu_count = (
len(set(roots)) if roots else get_physical_gpu_count()
)
# Non-digit wildcards (e.g. "*") should be treated as "all physical XPUs visible",
# while other invalid or empty masks should result in 0.
if xpu_visible.strip() == "*":
_visible_gpu_count = get_physical_gpu_count()
else:
roots = _parse_ze_mask_roots(xpu_visible)
_visible_gpu_count = len(set(roots))

danielhanchen and others added 2 commits April 16, 2026 08:34
…rd fallback

Follow-up to eebf077 addressing three findings from the third
10-reviewer pass and Gemini review 4119296322. All changes are in
studio/backend/utils/hardware/hardware.py; CUDA/ROCm/MLX/CPU paths
remain byte-identical to the pre-round-3 baseline.

E. _get_parent_visible_gpu_spec in FLAT hierarchy now refuses explicit
   gpu_ids. Per Intel's "flattening-gpu-tile-hierarchy" doc and the
   Level Zero spec, numeric ZE_AFFINITY_MASK entries in FLAT mode are
   tile / device-handle ordinals, not physical GPU IDs -- accepting
   them as "Physical GPU indices" breaks the API contract documented
   in models/inference.py and models/training.py. numeric_ids is still
   populated so display and auto-selection enumerate devices, but
   supports_explicit_gpu_ids is False in FLAT regardless of mask
   presence. Users who need explicit tile-level selection can opt in
   with ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE. COMPOSITE semantics are
   unchanged -- numeric masks still resolve to root GPU IDs. Flagged
   by 6/10 reviewers in the third pass.

F. detect_hardware restructured so UNSLOTH_FORCE_XPU=1 is a standalone
   trigger. Previously the outer "if ze_mask:" gate swallowed the
   force knob unless ZE_AFFINITY_MASK was also set, which meant the
   documented override did not work on a common hybrid NVIDIA+Intel
   scenario. Now the preference check is:
     prefer_xpu = force_xpu or (ze_mask and (cuda_hidden or cuda_unavailable))
   The "stray ZE_AFFINITY_MASK flips CUDA to XPU" protection from
   round 2 is preserved: a bare mask without CUDA-hidden still keeps
   CUDA. Flagged by 3/10 reviewers in the third pass.

G. get_visible_gpu_count torch-fallback path: only "*" wildcard counts
   as "all physical XPUs visible". Other non-parseable masks (",,,",
   "GPU-uuid", random garbage) now return 0 visible devices rather
   than silently exposing the full fleet. The happy path (torch.xpu
   .device_count() succeeds) is unchanged. Suggested by the Gemini
   review bot.

Testing:

- 96 PR-authored tests still pass
  (test_gpu_selection_sandbox, test_gpu_selection, test_utils).
- 284 simulation tests pass across two isolated uv venvs. New
  test_round3_fixes.py covers each round-3 finding explicitly:
    Fix E: FLAT rejects gpu_ids via prepare_gpu_selection; FLAT unset
           also rejects; COMPOSITE still accepts; FLAT auto-select
           gracefully falls to inherit_parent_visible mode.
    Fix F: force_xpu alone picks XPU; force_xpu=0 does not;
           no-xpu-torch falls back to CUDA; force_xpu + mask works;
           CVD="" + mask path unchanged; bare mask still keeps CUDA.
    Fix G: "*" -> all_physical; ",,," -> 0; "GPU-abc" -> 0;
           "0,1,*" -> 2 (partial wildcard); subdevice -> 1 root;
           no mask unchanged; torch happy path bypasses fallback.
- 6/6 CUDA scenarios on 8xB200 byte-identical to pre-fix baseline.
- ROCm, MLX, CPU regression tests all pass.
@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive support for Intel XPU (Intel GPUs) across the inference and training backends. Key updates include backend-aware hardware detection, GPU utilization monitoring via xpu-smi, and the correct application of ZE_AFFINITY_MASK for device pinning. The changes also refine autocast logic for diverse hardware and update error handling to recognize XPU-specific memory issues. The review feedback focuses on improving observability by replacing silent exception handlers with debug logging in the hardware utility module.

Comment on lines +247 to +248
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This broad except Exception: pass could hide unexpected errors during the cache clearing process. It's better to log the exception, even at a debug level, to aid in future debugging. This aligns with the general rule of avoiding silent exception handlers.

Suggested change
except Exception:
pass
except Exception as e:
logger.debug("Failed to clear XPU cache: %s", e)
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines +680 to +681
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Silently passing on all exceptions here can obscure issues with xpu-smi. It would be beneficial to log the exception at a debug level to help with troubleshooting, in line with the general rule to avoid silent exception handlers.

Suggested change
except Exception:
pass
except Exception as e:
logger.debug("xpu-smi query failed: %s", e)
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines +696 to +697
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This except Exception: pass block could hide errors when querying VRAM information from torch.xpu. Logging the exception at a debug level would be helpful for diagnostics, following the general rule to avoid silent exception handlers.

Suggested change
except Exception:
pass
except Exception as e:
logger.debug("torch.xpu VRAM query failed: %s", e)
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines +1798 to +1817
except Exception:
if xpu_visible:
# Fallback: count unique root device IDs from the mask.
# ZE_AFFINITY_MASK can use "device.subdevice" notation,
# so "0.0,0.1" is 1 root device, not 2. Without torch we
# cannot know which hierarchy mode is active, so fall back
# to root-device counting (the more conservative choice).
if xpu_visible == "*":
# Documented wildcard: all physical XPUs visible.
_visible_gpu_count = get_physical_gpu_count()
else:
roots = _parse_ze_mask_roots(xpu_visible)
# Non-parseable masks (",,,", "GPU-abc", etc.) yield an
# empty roots list and are treated as 0 visible devices,
# not "all visible" -- we have no evidence the user
# intended to expose the whole fleet.
_visible_gpu_count = len(set(roots))
else:
_visible_gpu_count = get_physical_gpu_count()
return _visible_gpu_count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception block here correctly implements fallback logic, but it doesn't log the exception that occurred. It would be valuable for debugging to log the error from torch.xpu.device_count() at a debug level before proceeding with the fallback.

Suggested change
except Exception:
if xpu_visible:
# Fallback: count unique root device IDs from the mask.
# ZE_AFFINITY_MASK can use "device.subdevice" notation,
# so "0.0,0.1" is 1 root device, not 2. Without torch we
# cannot know which hierarchy mode is active, so fall back
# to root-device counting (the more conservative choice).
if xpu_visible == "*":
# Documented wildcard: all physical XPUs visible.
_visible_gpu_count = get_physical_gpu_count()
else:
roots = _parse_ze_mask_roots(xpu_visible)
# Non-parseable masks (",,,", "GPU-abc", etc.) yield an
# empty roots list and are treated as 0 visible devices,
# not "all visible" -- we have no evidence the user
# intended to expose the whole fleet.
_visible_gpu_count = len(set(roots))
else:
_visible_gpu_count = get_physical_gpu_count()
return _visible_gpu_count
except Exception as e:
logger.debug("torch.xpu.device_count() failed, falling back to mask parsing: %s", e)
if xpu_visible:
# Fallback: count unique root device IDs from the mask.
# ZE_AFFINITY_MASK can use "device.subdevice" notation,
# so "0.0,0.1" is 1 root device, not 2. Without torch we
# cannot know which hierarchy mode is active, so fall back
# to root-device counting (the more conservative choice).
if xpu_visible == "*":
# Documented wildcard: all physical XPUs visible.
_visible_gpu_count = get_physical_gpu_count()
else:
roots = _parse_ze_mask_roots(xpu_visible)
# Non-parseable masks (",,,", "GPU-abc", etc.) yield an
# empty roots list and are treated as 0 visible devices,
# not "all visible" -- we have no evidence the user
# intended to expose the whole fleet.
_visible_gpu_count = len(set(roots))
else:
_visible_gpu_count = get_physical_gpu_count()
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines 1838 to 1839
except Exception:
_visible_gpu_count = get_physical_gpu_count()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to other fallbacks in this file, it would be beneficial to log the exception from torch.cuda.device_count() at a debug level before falling back to get_physical_gpu_count(). This aids in debugging scenarios where the PyTorch call fails unexpectedly.

Suggested change
except Exception:
_visible_gpu_count = get_physical_gpu_count()
except Exception as e:
logger.debug("torch.cuda.device_count() failed, falling back to physical count: %s", e)
_visible_gpu_count = get_physical_gpu_count()
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines +2077 to +2080
except Exception:
# Treat a failing probe as "runtime not touched yet" so
# pre-init CPU preprocessing can still parallelize.
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment explains the fallback behavior, but silently passing on the exception could hide underlying issues with the torch.xpu.is_initialized() call. Logging the exception at a debug level would provide more insight during troubleshooting, while still allowing the safe fallback to proceed.

Suggested change
except Exception:
# Treat a failing probe as "runtime not touched yet" so
# pre-init CPU preprocessing can still parallelize.
pass
except Exception as e:
# Treat a failing probe as "runtime not touched yet" so
# pre-init CPU preprocessing can still parallelize.
logger.debug("torch.xpu.is_initialized() check failed, assuming not initialized: %s", e)
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

danielhanchen and others added 8 commits April 16, 2026 09:00
Gemini's fourth review pass (pullrequestreview-4119484559) flagged six
bare `except Exception: pass` blocks in hardware.py where silent failures
could hide real driver or runtime problems. Each site now logs at DEBUG
level with the exception message, so operators have a diagnostic trail
when something goes wrong, while the control flow stays unchanged and the
same fallback path is taken:

1. clear_gpu_cache() XPU branch: log "Failed to clear XPU cache"
2. _resolve_xpu_smi_device_id(): log "torch.xpu.current_device() probe failed"
3. _get_xpu_utilization() xpu-smi call: log "xpu-smi query failed"
4. _get_xpu_utilization() torch.xpu VRAM query: log "torch.xpu VRAM query failed"
5. get_visible_gpu_count() torch.xpu fallback: log and still count mask roots
6. get_visible_gpu_count() torch.cuda fallback: log and still use physical count
7. dataset_map_num_proc() is_initialized probe: log "torch.xpu.is_initialized() probe failed"

These are logging-only edits; every test in the PR's own suite, the two
simulation harnesses (284 tests), and the 8x B200 CUDA behavior matrix
(6/6 scenarios byte-identical) pass unchanged.
Round-4 review (10 parallel reviewers on PR unslothai#4724) produced a unanimous
consensus finding plus two isolated but real regressions. This commit
addresses all three.

[10/10] studio/backend/utils/hardware/hardware.py
_get_parent_visible_gpu_spec() on Intel XPU with ZE_FLAT_DEVICE_HIERARCHY
unset (the oneAPI default FLAT hierarchy) and no ZE_AFFINITY_MASK was
internally inconsistent: it returned numeric_ids=[0..N) with
supports_explicit_gpu_ids=False. Downstream, get_visible_gpu_utilization()
treated those ordinals as numeric_ids!=None, labeled them
index_kind="physical", and llama.cpp's _get_gpu_free_memory() would
round-trip tile/device-handle ordinals back into ZE_AFFINITY_MASK as if
they were stable root-GPU IDs. Collapse the FLAT+no-mask case to
numeric_ids=None so the telemetry path falls into its relative-ordinal
branch and auto-selection uniformly returns inherit_parent_visible. Users
who need explicit Intel selection opt in via ZE_FLAT_DEVICE_HIERARCHY
=COMPOSITE.

[1/10] studio/backend/core/inference/llama_cpp.py
LlamaCppBackend._get_gpu_free_memory() always probed nvidia-smi before
falling back to the generic telemetry path. On a hybrid NVIDIA+Intel host
running Studio in XPU mode, nvidia-smi returned NVIDIA indices and those
ordinals were then used to build a ZE_AFFINITY_MASK for llama-server,
pinning the server to the wrong device. Gate the nvidia-smi fast path on
get_device() == CUDA and IS_ROCM=False so other backends skip straight
to the backend-aware telemetry path.

[1/10] studio/backend/utils/utils.py
format_error_message() dropped the broad "memory"/"cuda" match in favor
of specific substrings. That stopped matching common CUDA allocation
failures such as "CUDA error: CUBLAS_STATUS_ALLOC_FAILED", so users saw
the raw backend exception instead of the friendly "Not enough GPU memory"
hint. Re-add the CUBLAS, cuda-error+alloc, and xpu-alloc patterns while
keeping the matcher narrow enough that non-memory CUDA errors (e.g.
"device-side assert triggered") still pass through unchanged.

Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174), sim
suite v2 (128 incl. 14 new round-5 coverage), 8x B200 CUDA matrix
(6/6 scenarios byte-identical to pre-fix baseline).
Round-5 review (10 parallel reviewers on PR unslothai#4724) narrowed to two real
correctness issues in the XPU path introduced by round-5's numeric_ids
handling. This commit addresses both.

[4/10] studio/backend/utils/hardware/hardware.py -- get_device_map()
The "numeric_ids is None and visible_count > 1 -> balanced" heuristic
was meant for CUDA UUID/MIG masks, but after round 5 collapsed FLAT
no-mask XPU into numeric_ids=None, the heuristic started firing on the
default Intel FLAT hierarchy and forced balanced sharding across tile
handles even though the same module had already classified those
ordinals as non-physical. Restrict the heuristic to CUDA so XPU stays
on sequential loading unless the caller passes explicit gpu_ids (via
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE or prepare_gpu_selection).

[3/10] studio/backend/utils/hardware/hardware.py -- get_visible_gpu_utilization,
get_backend_visible_gpu_info
FLAT numeric masks like ZE_AFFINITY_MASK="0,1" populate numeric_ids
for telemetry display but have supports_explicit_gpu_ids=False because
the tokens are tile/device-handle ordinals per Intel's Level Zero
docs. Both helpers still labeled those ordinals index_kind="physical",
which let LlamaCppBackend._get_gpu_free_memory() round-trip them back
into ZE_AFFINITY_MASK as if they were stable root GPU IDs. Gate the
"physical" label on parent_visible_spec["supports_explicit_gpu_ids"]
and clear parent_visible_gpu_ids in the payload when it's False, so
API consumers and llama.cpp skip the placement path.

Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174, one
updated), sim suite v2 (141 incl. 13 new round-6 tests), 8x B200 CUDA
matrix (6/6 scenarios byte-identical to round-5 baseline).
Round-7 review (10 parallel reviewers on PR unslothai#4724) produced a strong
convergent finding (7/10 independent reviewers on the same underlying
issue): the PR is titled "enable Studio for Intel GPU" but on the
default Intel oneAPI runtime (ZE_FLAT_DEVICE_HIERARCHY=FLAT, no
ZE_AFFINITY_MASK set), Studio still couldn't auto-select or shard
across multiple visible XPUs. Users had to manually opt into COMPOSITE
hierarchy for the PR's advertised feature to actually work.

The reviewers correctly noted that this is fixable without relaxing the
"physical IDs only" contract of the public prepare_gpu_selection() API.
torch.xpu ordinals are stable within a single worker process's inherited
ZE_AFFINITY_MASK scope: Studio can safely generate them internally for
auto-selection, narrow the mask via apply_gpu_ids(), and let the child
process inherit the same torch.xpu device set. Round-trip stays 1:1
within the process boundary the ordinals were observed in.

[7/10] studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids()
When get_device()==XPU and supports_explicit_gpu_ids=False (FLAT
hierarchy cases), instead of immediately bailing out with
selection_mode="inherit_parent_visible" and selected=None, fall back to
list(range(get_visible_gpu_count())) as the candidate set and continue
into the VRAM-based selection logic. This path is flagged via a new
metadata key xpu_relative_auto_select=True so telemetry stays
introspectable. The public prepare_gpu_selection() API is unchanged
and still refuses user-supplied explicit gpu_ids on these masks,
because end users can't reason about FLAT tile handles; only Studio's
own auto-selection (which controls the worker-process env) can use
these ordinals safely.

[3/10] studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory()
The index_kind="relative" guard was rejecting XPU ordinals alongside
CUDA UUID/MIG fallbacks. CUDA relative indices stay unsafe (the parent
has hidden the physical ID mapping, so re-exporting them into
CUDA_VISIBLE_DEVICES would re-expose hidden GPUs). XPU relative
ordinals are stable for the current worker scope, so accept them and
let VRAM-based GGUF placement work on default Intel FLAT hosts.

Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2
(153 incl. 11 new round-7 tests, 3 updated to new contract), 8x B200
CUDA matrix (6/6 scenarios byte-identical to round-5 baseline).
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts
could use multiple visible XPUs via auto-selection and VRAM-based
llama.cpp placement. Round-8 review (6/10 convergent finding on 10
parallel reviewers of PR unslothai#4724) showed that fallback triggered even
when the parent process had already narrowed visibility via
ZE_AFFINITY_MASK, which silently retargeted child processes onto
different Level Zero handles than the parent exposed.

Example: parent sets ZE_AFFINITY_MASK="3,5" to expose tiles 3 and 5.
torch.xpu in the parent sees ordinals 0 (=tile 3) and 1 (=tile 5).
Round-7 auto_select returned [0, 1]; apply_gpu_ids() rewrote
ZE_AFFINITY_MASK="0,1"; the child inherited that and saw tile 0 and
tile 1, not the originally-exposed 3 and 5. Same regression on
subdevice masks like "0.0,0.1".

studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids()
Gate xpu_relative_auto_select on parent_visible_spec["raw"] is None.
When a parent mask is already set, defer to inherit_parent_visible so
the child inherits the exact visibility the parent intended.

studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory()
Gate the relative-XPU acceptance on os.environ.get("ZE_AFFINITY_MASK")
is None. Inherited masks (numeric "3,5", subdevice "0.0,0.1",
wildcard "*") are preserved via the existing "return []" fall-through.

Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2
(162 incl. 9 new round-8 tests + 2 updated to the new inherited-mask
contract), 8x B200 CUDA matrix (6/6 scenarios byte-identical).
@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the inference, training, and hardware utility modules. Key changes include backend-aware hardware detection, telemetry integration via xpu-smi and torch.xpu, and the implementation of ZE_AFFINITY_MASK for device visibility management. The PR also refactors GPU cache clearing and error formatting to be device-agnostic and addresses potential multiprocessing issues on XPU by disabling fork-based parallelism once the runtime is initialized. Feedback was provided to improve the robustness of environment variable parsing for GPU selection by filtering out empty tokens to prevent potential ValueErrors.

cvd = os.environ.get("CUDA_VISIBLE_DEVICES")
if cvd is not None and cvd.strip():
try:
allowed = set(int(x.strip()) for x in cvd.split(","))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list comprehension for parsing CUDA_VISIBLE_DEVICES should filter out empty strings to avoid a ValueError when calling int() on a trailing or double comma (e.g., "0,1,"). While the surrounding try...except block prevents a crash, it causes the entire filter to be ignored (allowed becomes None), which might lead to unexpected GPU selection. Filtering empty tokens ensures that numeric masks with trailing commas are still parsed correctly, consistent with the logic used in utils/hardware/hardware.py.

Suggested change
allowed = set(int(x.strip()) for x in cvd.split(","))
allowed = set(int(x.strip()) for x in cvd.split(",") if x.strip())
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Gemini review 4120214370 caught that LlamaCppBackend._get_gpu_free_memory()
built its CUDA_VISIBLE_DEVICES filter with int(x.strip()) for x in
cvd.split(","), which raises ValueError on an empty token from a shell-
exported value like "0,1," (trailing comma) or "0,,1" (doubled comma).
The surrounding try/except swallowed the crash but left ``allowed=None``,
which silently disables the filter so every nvidia-smi GPU becomes a
placement candidate even though the user clearly narrowed visibility.

Skip empty tokens during the parse so trailing/doubled/leading commas
still produce a valid filter, matching the "if token.strip()" pattern
already used in utils/hardware/hardware.py.
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts
could auto-select multiple visible XPUs. Round-9 review (4/10 convergent
plus related findings on 10 parallel reviewers of PR unslothai#4724) showed that
on multi-tile Intel devices (Data Center GPU Max, 2 tiles per root),
the synthesized torch.xpu ordinals enumerate tiles rather than distinct
root GPUs. Writing them back via apply_gpu_ids() then rewrote
ZE_AFFINITY_MASK with tile handles -- narrowing the worker onto a
subset of one card instead of spreading across the parent-visible set.

This revert + replacement keeps multi-XPU sharding working on default
Intel FLAT without any environment-visible ordinal round-trip:

studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids()
Reverted R7-A. When the parent-visible spec cannot expose stable
physical GPU IDs (FLAT no-mask, FLAT numeric, wildcard, subdevice),
defer to inherit_parent_visible with selected_gpu_ids=None. No more
synthesized worker-local ordinals written back into ZE_AFFINITY_MASK.

studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory()
Reverted R7-B. XPU relative telemetry is rejected for placement just
like CUDA UUID/MIG. llama-server inherits the parent's mask unchanged
and uses its own multi-device machinery.

studio/backend/utils/hardware/hardware.py -- get_device_map()
New path: when device==XPU and caller did not pass gpu_ids, return
"balanced" whenever more than one torch.xpu device is visible (either
via unresolved mask or FLAT numeric mask that populates numeric_ids
but does not support explicit selection). HF Transformers uses the
torch ordinals directly -- scope-local within the worker process, no
ZE_AFFINITY_MASK rewrite, so tile-vs-root ambiguity cannot leak out.
Explicit gpu_ids=[0] continues to produce sequential so the caller's
deliberate single-device choice is respected.

Verified: PR tests (96 + 4 skipped), sim_pr4724 (174, one contract
update), sim_fixes_v2 (183 incl. 15 new round-10 tests + 10 updated
for the reverted contracts), 8x B200 CUDA matrix (6/6 scenarios
byte-identical).
@danielhanchen

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the backend. Key changes include updating hardware detection to recognize XPU devices via ZE_AFFINITY_MASK and UNSLOTH_FORCE_XPU environment variables, implementing backend-aware GPU memory and utilization tracking using xpu-smi and torch.xpu, and ensuring proper device pinning using ZE_AFFINITY_MASK. Additionally, the PR updates audio codec handling to maintain CPU fallbacks for non-validated backends, refactors GPU cache clearing to be device-agnostic, and adjusts multiprocessing logic to prevent XPU runtime corruption on fork. I have no feedback to provide as there are no review comments.

danielhanchen and others added 2 commits April 16, 2026 12:06
Shorten multi-line design-rationale comments to 1-3 lines each.
The surrounding code and commit history carry the full context;
inline comments should state the rule, not re-derive the reasoning.

Net reduction: ~107 comment lines removed across 2 files.
No behavior change.
@nobody5050

Copy link
Copy Markdown

Any likelihood of this getting merged soon?

@rolandtannous

Copy link
Copy Markdown

@nobody5050 yes it's being reviewed and tessted with the help of community users before we merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants