enable studio for intel GPU#4724
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request implements support for Intel XPU (Intel GPUs) across the backend by replacing hardcoded CUDA logic with hardware-agnostic utility functions. It adds XPU detection, utilization tracking via xpu-smi, and environment variable handling for device visibility. Feedback highlights a logic error in parsing xpu-smi output where memory utilization is incorrectly assigned to temperature, a lack of multi-GPU support in the monitoring command, and an inconsistency in the version reporting format for XPU.
| def _get_xpu_utilization() -> Dict[str, Any]: | ||
| """Return a live snapshot of Intel XPU GPU utilization via ``xpu-smi`` or torch.xpu.""" | ||
| try: | ||
| import subprocess | ||
|
|
||
| result = subprocess.run( | ||
| ["xpu-smi", "dump", "-d", "0", "-m", "0,1,2,18"], | ||
| capture_output = True, | ||
| text = True, | ||
| timeout = 5, | ||
| ) | ||
| if result.returncode == 0 and result.stdout.strip(): | ||
| # xpu-smi dump outputs CSV: Timestamp, DeviceId, GPU Utilization (%), ... | ||
| lines = result.stdout.strip().splitlines() | ||
| for line in reversed(lines): | ||
| if line.startswith("Timestamp") or line.startswith("#"): | ||
| continue | ||
| parts = [p.strip() for p in line.split(",")] | ||
| if len(parts) >= 4: | ||
| gpu_util = float(parts[2]) if parts[2] not in ("", "N/A") else None | ||
| temp = float(parts[3]) if parts[3] not in ("", "N/A") else None | ||
| break | ||
| else: | ||
| gpu_util = None | ||
| temp = None | ||
| else: | ||
| gpu_util = None | ||
| temp = None | ||
| except Exception: | ||
| gpu_util = None | ||
| temp = None | ||
|
|
||
| # Get VRAM from torch.xpu | ||
| vram_used_gb = None | ||
| vram_total_gb = None | ||
| try: | ||
| import torch | ||
|
|
||
| idx = torch.xpu.current_device() | ||
| props = torch.xpu.get_device_properties(idx) | ||
| vram_total_gb = round(props.total_memory / (1024**3), 2) | ||
| vram_used_gb = round(torch.xpu.memory_allocated(idx) / (1024**3), 2) | ||
| except Exception: | ||
| pass | ||
|
|
||
| vram_pct = ( | ||
| round((vram_used_gb / vram_total_gb) * 100, 1) | ||
| if vram_used_gb is not None and vram_total_gb and vram_total_gb > 0 | ||
| else None | ||
| ) | ||
|
|
||
| has_any = any(v is not None for v in [gpu_util, temp, vram_used_gb]) | ||
| if not has_any: | ||
| return {"available": False, "backend": "xpu"} | ||
|
|
||
| return { | ||
| "available": True, | ||
| "backend": "xpu", | ||
| "gpu_utilization_pct": gpu_util, | ||
| "temperature_c": temp, | ||
| "vram_used_gb": vram_used_gb, | ||
| "vram_total_gb": vram_total_gb, | ||
| "vram_utilization_pct": vram_pct, | ||
| "power_draw_w": None, | ||
| "power_limit_w": None, |
There was a problem hiding this comment.
The _get_xpu_utilization function contains a logic error in metric parsing and lacks support for multi-GPU configurations:
- Metric Index Bug: With the requested metrics
-m 0,1,2,18, the CSV output indices are: 2 (GPU Util), 3 (GPU Mem Util), 4 (GPU Temp), and 5 (GPU Power). The current code incorrectly assignsparts[3](Memory Util) totemp. - Hardcoded Device: It always queries physical device
0viaxpu-smi -d 0. This should be mapped from the current torch device index, especially whenZE_AFFINITY_MASKis active to restrict visibility. - Power Data: Metric 18 (Power) is requested in the command but not parsed into the return dictionary.
| if hasattr(torch, "xpu") and torch.xpu.is_available(): | ||
| versions["xpu"] = True |
There was a problem hiding this comment.
Setting versions["xpu"] = True is inconsistent with other entries in the versions dictionary, which typically contain version strings (e.g., from importlib.metadata). It would be more useful to provide a string indicating support or the version of the Intel Extension for PyTorch (IPEX) if applicable.
danielhanchen
left a comment
There was a problem hiding this comment.
Thank you for the PR! The goal of this PR is to enable Unsloth Studio on Intel XPU GPUs. As a summary, this PR replaces hardcoded CUDA-only code paths in Studio audio inference, training, and cache clearing with device-agnostic helpers (get_torch_device_str(), clear_gpu_cache()), adds XPU detection/telemetry in hardware.py, and updates error messaging to mention Intel GPUs.
I pushed a commit fixing the issues below. Here is a summary of findings after reviewing:
| Reviewers | Severity | Finding |
|---|---|---|
| 10/10 | High | _get_xpu_utilization() used -m 0,1,2,18 which mapped parts[3] to temperature incorrectly -- it was GPU Memory Utilization, not temperature. Fixed to use -m 0,2,3 (Utilization, Power, Core Temperature) with correct column parsing. |
| 10/10 | High | xpu-smi dump without -n 1 runs indefinitely until the timeout kills it. Added -n 1 to request exactly one sample. |
| 10/10 | Medium | Hardcoded -d 0 in xpu-smi call ignores which device is active. Fixed to use torch.xpu.current_device(). |
| 10/10 | Medium | Dead code in get_visible_gpu_count() -- XPU re-check at line 1357 was unreachable because the XPU block above always returns first. Removed the dead branch. |
| 8/10 | Medium | versions["xpu"] = True was a bool in a Dict[str, Optional[str]]. Changed to getattr(torch.version, "xpu", "available") to return the actual XPU version string. |
| 7/10 | Low | power_draw_w was always None in the returned dict even though power data was being requested from xpu-smi. Now correctly populated. |
All fixes are in the latest commit. Concrete suggestions for each finding above.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Intel XPU (Intel GPUs) across the inference and training modules by abstracting device-specific logic into utility functions like get_torch_device_str and clear_gpu_cache. Key changes include updated hardware detection, XPU utilization monitoring via xpu-smi, and safety measures to disable multiprocessing on XPU devices to prevent memory context corruption. Feedback is provided regarding the use of broad, silent exception handling in the XPU utilization logic, which should be refined to include explicit error handling and logging.
| result = subprocess.run( | ||
| ["xpu-smi", "dump", "-d", str(dev_idx), "-m", "0,2,3", "-n", "1"], | ||
| capture_output = True, | ||
| text = True, | ||
| timeout = 10, | ||
| ) |
There was a problem hiding this comment.
The subprocess.run call for xpu-smi might raise a FileNotFoundError. While the broad except Exception block at line 430 catches this, it is better to explicitly handle FileNotFoundError. Additionally, ensure the exception is logged (even at a debug level) rather than silently ignored, as per repository rules to aid in future debugging.
References
- Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Intel XPU hardware by abstracting device-specific logic into hardware-agnostic utility functions like get_torch_device_str and clear_gpu_cache. Key updates include enhanced hardware detection, XPU utilization tracking via xpu-smi, and improved handling of device visibility and multiprocessing on Intel GPUs. Review feedback focuses on improving code organization by moving local imports to the module level and enhancing observability by logging exceptions in broad try-except blocks instead of using silent pass statements.
| from utils.hardware import get_torch_device_str | ||
|
|
| from utils.hardware import clear_gpu_cache | ||
|
|
There was a problem hiding this comment.
| from utils.hardware import get_torch_device_str | ||
|
|
There was a problem hiding this comment.
To improve code organization and avoid redundant imports, please move the import of get_torch_device_str to the top of the file. It can be added to the existing from utils.hardware import ... block at line 36.
Additionally, clear_gpu_cache is already imported at the top, so its local imports (e.g., at lines 1713, 1945, and 2162) can be removed. This change should be applied to all local imports from utils.hardware throughout this file.
References
- Avoid redundant inline imports if the module is already imported at the top of the file.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
This broad except Exception: pass block silently ignores all errors that might occur when running xpu-smi, including timeouts or parsing errors. This can make debugging difficult. Please log the exception to aid in future troubleshooting, even if it's at a debug level. This aligns with the general rule to avoid silent exception handlers.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.debug("Failed to get XPU utilization from xpu-smi: %s", e) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Similar to the xpu-smi call, this except Exception: pass block silently ignores errors when querying VRAM from torch.xpu. Logging these errors would be beneficial for debugging potential issues with the Intel GPU tooling or PyTorch's XPU support.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.debug("Failed to get XPU VRAM info from torch.xpu: %s", e) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| import torch | ||
|
|
||
| _visible_gpu_count = torch.xpu.device_count() | ||
| except Exception: |
There was a problem hiding this comment.
This except Exception: block silently catches any error from torch.xpu.device_count() and proceeds to a fallback. To improve debuggability, please log the exception that was caught. This will help diagnose issues if torch.xpu is available but device_count() fails unexpectedly.
For example:
except Exception as e:
logger.debug("torch.xpu.device_count() failed: %s", e)
if xpu_visible:
# ...References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
Adds first-class Intel XPU support to unsloth studio across hardware
detection, GPU selection, visibility reporting, telemetry, cache
clearing, OOM messaging, and the DAC autocast path, matching feature
parity with the existing CUDA and ROCm backends. This is a rebased
linear version of the leizhenyuan/zhenyuan_enable_studio branch with
three rounds of review-loop fixes layered in.
Originally authored by leizhenyuan. Rebased and review fixes by
danielhanchen.
Highlights:
hardware.py
- detect_hardware: promote DeviceType.XPU when torch.xpu is
available and CUDA is not, preserving CPU / MLX / ROCm paths.
- New helpers: _parse_ze_mask_roots, _resolve_xpu_smi_device_id,
_get_xpu_utilization (xpu-smi -m 0,1,3 with cached binary path
resolution and robust N/A parsing).
- _get_parent_visible_gpu_spec adds an XPU branch that honors
ZE_AFFINITY_MASK, including subdevice syntax (0.0,0.1) and
wildcard / unparseable masks, and mirrors the CUDA UUID / MIG
path by returning numeric_ids=None for subdevice masks so the
rest of the stack enumerates torch-visible ordinals and can
still shard.
- _backend_visible_devices_env routes XPU through ZE_AFFINITY_MASK
so /system-info and get_backend_visible_gpu_info report the
active mask instead of a stale or None CUDA_VISIBLE_DEVICES.
- get_visible_gpu_utilization and get_backend_visible_gpu_info
short-circuit on "no devices visible" masks (ZE_AFFINITY_MASK=""
or CUDA_VISIBLE_DEVICES="" / "-1") so the torch-ordinal
enumeration fallback does not report hidden devices.
- get_visible_gpu_count and get_device_map add XPU branches that
prefer torch.xpu.device_count() and fall back to ZE mask roots
(str.isdecimal() to reject Unicode superscripts).
- apply_gpu_ids routes XPU through ZE_AFFINITY_MASK but leaves
inherited CUDA_VISIBLE_DEVICES alone so hybrid NVIDIA+Intel
hosts that hid CUDA via CUDA_VISIBLE_DEVICES="" stay on XPU.
- auto_select_gpu_ids and prepare_gpu_selection allow XPU in the
accelerator guard and use "non_accelerator" as the selection
mode label for CPU / MLX.
- resolve_requested_gpu_ids error message covers "CUDA and Intel
XPU devices" and mentions ZE_AFFINITY_MASK for XPU.
- clear_gpu_cache XPU branch guards synchronize and empty_cache
with hasattr so older torch-xpu builds do not propagate
AttributeError.
- dataset_map_num_proc returns None only when
torch.xpu.is_initialized exists and returns True, so older
torch-xpu builds still get pre-init CPU dataset parallelism.
- get_package_versions reports torch.version.xpu alongside cuda.
utils/utils.py
- format_error_message recognises Intel XPU OOM strings including
out_of_device_memory, out_of_host_memory, memory allocation
failed, and bare MemoryError instances, labelling messages as
"Intel GPU" when DeviceType is XPU.
core/inference/inference.py
- _generate_dac derives autocast device from model.device.type
instead of the global backend, clamps to cuda/xpu/mps/cpu, and
falls through to nullcontext on cpu/xpu float32 (unsupported by
torch.amp.autocast).
core/inference/llama_cpp.py
- _start_process pins the llama-server subprocess via
ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere
so SYCL builds of llama-server respect the selected GPUs.
- unload_model clears the XPU cache via the backend-neutral
clear_gpu_cache helper.
- init_audio_codec / generate_audio_response retain the pre-PR
"cuda if torch.cuda.is_available() else cpu" fallback for the
SNAC / BiCodec / DAC codecs because those upstream codecs are
not yet validated on Intel XPU.
core/training/trainer.py
- clear_gpu_cache is hoisted to the module-level import block.
- _preprocess_snac_dataset / _preprocess_bicodec_dataset /
_preprocess_dac_dataset keep the pre-PR CPU fallback on
non-CUDA hosts for Spark-TTS BiCodec, SNAC, and OuteTTS DAC /
Whisper preprocessing.
tests/test_gpu_selection.py
- Replace TestXpuRejection with TestXpuSelection (positive tests
for auto_select_gpu_ids and prepare_gpu_selection on XPU).
- Update non-accelerator rejection assertion to the new
"CUDA and Intel XPU" wording.
- Update test_explicit_ids_are_rejected_for_uuid_parent_visibility
regex to match the broadened "uses non-numeric or subdevice"
error message.
tests/test_gpu_selection_sandbox.py
- Rename test_non_cuda_returns_none to
test_non_accelerator_returns_none and assert
selection_mode="non_accelerator".
9ce735d to
b759a03
Compare
for more information, see https://pre-commit.ci
…vation, FLAT hierarchy, backend-aware GGUF discovery
Six targeted changes following the 10-reviewer pass on this PR:
1. studio/backend/core/inference/llama_cpp.py
Remove env.pop("CUDA_VISIBLE_DEVICES", None) in the XPU llama-server
spawn branch. The pop contradicted the documented preservation in
apply_gpu_ids() for hybrid NVIDIA+Intel hosts where the parent sets
CUDA_VISIBLE_DEVICES="" to force Studio onto Intel/SYCL. Added a
comment referencing that design note.
2. studio/backend/core/inference/llama_cpp.py
Make _get_gpu_free_memory backend-aware. NVIDIA keeps the cheap
nvidia-smi fast path; AMD ROCm and Intel XPU fall through to
get_visible_gpu_utilization so GGUF offload selection has per-GPU
free memory on all backends, not just NVIDIA. Returns an empty list
only when no telemetry is available.
3. studio/backend/utils/hardware/hardware.py
detect_hardware now honors a non-empty ZE_AFFINITY_MASK as an
explicit XPU hint. On a hybrid host with both torch.cuda and
torch.xpu available, ZE_AFFINITY_MASK=... picks XPU before CUDA so
users can pin Intel without also setting CUDA_VISIBLE_DEVICES="".
4. studio/backend/utils/hardware/hardware.py
Add _xpu_hierarchy_is_composite() and gate the Level Zero mask
parsing on it. In the FLAT hierarchy (oneAPI 2024+ default)
numeric ZE_AFFINITY_MASK entries are tile or device handles, not
root GPU IDs. _get_parent_visible_gpu_spec now returns
supports_explicit_gpu_ids=False in FLAT so explicit gpu_ids are
rejected; _resolve_xpu_smi_device_id returns None so callers skip
"xpu-smi -d <wrong_tile>" and fall back to torch.xpu VRAM
telemetry. COMPOSITE retains the original root-ID mapping.
5. studio/backend/utils/hardware/hardware.py
Conservative XPU used_gb fallback. When torch.xpu lacks
mem_get_info, _torch_get_per_device_info now returns used_gb=None
instead of torch.xpu.memory_allocated (process-local and
misleading for multi-tenant placement). get_visible_gpu_utilization
guards against None during percent computation; downstream
auto_select_gpu_ids already filters None entries.
6. studio/backend/models/inference.py, studio/backend/models/training.py
gpu_ids Field descriptions now mention the Intel XPU
ZE_AFFINITY_MASK subdevice and FLAT-hierarchy restrictions
alongside the existing CUDA UUID/MIG note.
Testing:
- 96 PR-authored tests still pass (test_gpu_selection,
test_gpu_selection_sandbox, test_utils).
- 234 additional simulation tests across two isolated uv venvs
cover fix-fix interactions, malformed inputs, cross-platform
behavior (Linux, macOS, Windows), real subprocess env propagation,
JSON/Pydantic schema roundtrips, concurrency, and fuzz.
- 6/6 CUDA scenarios on 8xB200 hardware are byte-identical to the
pre-fix baseline. Zero CUDA regression.
- ROCm, MLX, and CPU regression tests all pass. AMD HIP_VISIBLE_DEVICES
and ROCR_VISIBLE_DEVICES propagation is unchanged.
Backend support after these changes: NVIDIA CUDA, AMD ROCm, Intel XPU
(both COMPOSITE and FLAT hierarchies), CPU, and Apple Silicon MLX.
for more information, see https://pre-commit.ci
…s, FLAT numerics Follow-up to 6c55664 addressing the four findings from the second 10-reviewer pass. Each fix is narrow and additive; the CUDA, ROCm, MLX, and CPU paths remain byte-identical to the pre-round-2 baseline. A. studio/backend/core/inference/llama_cpp.py Replace `dict.get(a) or dict.get(b)` with explicit `is None` fallbacks when reading vram_total_gb / vram_used_gb in the generic free-memory path. An idle GPU with vram_used_gb=0.0 was being treated as missing telemetry and dropped, which pushed GGUF placement into the non-placement fallback whenever the best free card was fully idle. 9/10 reviewers in the round-2 pass flagged this; every suggestion was the same patch. B. studio/backend/utils/hardware/hardware.py Tighten the ZE_AFFINITY_MASK -> XPU detection hint from "any non-empty mask wins" to "non-empty mask plus one of: CUDA_VISIBLE_DEVICES explicitly hidden (empty / -1), or UNSLOTH_FORCE_XPU=1, or CUDA simply not available". Prevents a stray ZE_AFFINITY_MASK=0 inherited from unrelated Intel tooling from silently flipping existing hybrid CUDA deployments onto XPU. Intel-only hosts are unaffected (CUDA unavailable -> hint is honoured). C. studio/backend/core/inference/llama_cpp.py Refuse to use device indices returned from get_visible_gpu_utilization when index_kind != "physical". Relative ordinals (reported for subdevice / wildcard / UUID parent masks) are not safe to round-trip back into ZE_AFFINITY_MASK / CUDA_VISIBLE_DEVICES; returning [] lets the launcher skip placement and inherit the parent's visibility unchanged. D. studio/backend/utils/hardware/hardware.py Relax _get_parent_visible_gpu_spec in FLAT hierarchy. Per Intel's Level Zero device-hierarchy docs, numeric ZE_AFFINITY_MASK entries in FLAT mode ARE stable flat ordinals that torch.xpu honours 1-to-1 with the child process. Previously we blanket-rejected all FLAT numeric masks, which meant Studio could not do XPU auto-selection under the default oneAPI configuration. Now: - FLAT + numeric mask -> accept as numeric_ids - FLAT + unset mask -> expose range(device_count) - FLAT + subdevice "N.M" -> reject (invalid syntax in FLAT) - FLAT + wildcard / non-numeric -> reject - COMPOSITE semantics unchanged (numeric -> root IDs) _resolve_xpu_smi_device_id continues to return None in FLAT because xpu-smi -d addresses root IDs, not flat ordinals; the telemetry path falls back to torch.xpu VRAM. Testing: - 96 PR-authored tests still pass (test_gpu_selection_sandbox, test_gpu_selection, test_utils). - 264 simulation tests pass across two isolated uv venvs. The round-2 tests include explicit coverage for: Fix A: idle GPU preserved, None still skipped, 0.0 total ok Fix B: bare mask on hybrid -> CUDA wins; CVD="" or "-1" hint -> XPU; UNSLOTH_FORCE_XPU=1 -> XPU; CUDA unavailable -> XPU; UNSLOTH_FORCE_XPU=0 (literal) not activated Fix C: relative ordinals not returned; physical ordinals returned; missing index_kind treated as physical Fix D: FLAT numeric accepted; FLAT unset auto-enumerated; FLAT subdevice rejected; FLAT wildcard rejected; xpu-smi resolver still None in FLAT; COMPOSITE unchanged - 6/6 CUDA scenarios on 8xB200 remain byte-identical to baseline. - ROCm, MLX, CPU regression tests all pass.
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the backend, including inference, training, and hardware utility modules. Key updates include backend-aware GPU memory querying, support for the ZE_AFFINITY_MASK environment variable for device visibility, and integrated telemetry via xpu-smi. Additionally, the changes address XPU-specific constraints in multiprocessing and improve error reporting for device memory issues. Feedback was provided regarding the logic for calculating visible GPU counts when parsing ZE_AFFINITY_MASK, specifically to better distinguish between wildcard and invalid masks.
| roots = _parse_ze_mask_roots(xpu_visible) | ||
| # Non-digit wildcards (e.g. "*") yield an empty roots list; | ||
| # treat those as "all physical XPUs visible". | ||
| _visible_gpu_count = ( | ||
| len(set(roots)) if roots else get_physical_gpu_count() | ||
| ) |
There was a problem hiding this comment.
The fallback logic for _visible_gpu_count when roots is empty is incorrect. It treats any mask that results in an empty roots list (like invalid masks ",," or non-numeric wildcards *) as a signal to use all physical GPUs. This is only correct for a wildcard like "*". For invalid or empty-like masks, the count should be 0.
This can lead to an incorrect number of visible GPUs being reported, potentially causing issues with resource allocation.
I suggest refining the logic to explicitly handle the wildcard case and treat other non-parsing masks as yielding zero devices.
| roots = _parse_ze_mask_roots(xpu_visible) | |
| # Non-digit wildcards (e.g. "*") yield an empty roots list; | |
| # treat those as "all physical XPUs visible". | |
| _visible_gpu_count = ( | |
| len(set(roots)) if roots else get_physical_gpu_count() | |
| ) | |
| # Non-digit wildcards (e.g. "*") should be treated as "all physical XPUs visible", | |
| # while other invalid or empty masks should result in 0. | |
| if xpu_visible.strip() == "*": | |
| _visible_gpu_count = get_physical_gpu_count() | |
| else: | |
| roots = _parse_ze_mask_roots(xpu_visible) | |
| _visible_gpu_count = len(set(roots)) |
…rd fallback Follow-up to eebf077 addressing three findings from the third 10-reviewer pass and Gemini review 4119296322. All changes are in studio/backend/utils/hardware/hardware.py; CUDA/ROCm/MLX/CPU paths remain byte-identical to the pre-round-3 baseline. E. _get_parent_visible_gpu_spec in FLAT hierarchy now refuses explicit gpu_ids. Per Intel's "flattening-gpu-tile-hierarchy" doc and the Level Zero spec, numeric ZE_AFFINITY_MASK entries in FLAT mode are tile / device-handle ordinals, not physical GPU IDs -- accepting them as "Physical GPU indices" breaks the API contract documented in models/inference.py and models/training.py. numeric_ids is still populated so display and auto-selection enumerate devices, but supports_explicit_gpu_ids is False in FLAT regardless of mask presence. Users who need explicit tile-level selection can opt in with ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE. COMPOSITE semantics are unchanged -- numeric masks still resolve to root GPU IDs. Flagged by 6/10 reviewers in the third pass. F. detect_hardware restructured so UNSLOTH_FORCE_XPU=1 is a standalone trigger. Previously the outer "if ze_mask:" gate swallowed the force knob unless ZE_AFFINITY_MASK was also set, which meant the documented override did not work on a common hybrid NVIDIA+Intel scenario. Now the preference check is: prefer_xpu = force_xpu or (ze_mask and (cuda_hidden or cuda_unavailable)) The "stray ZE_AFFINITY_MASK flips CUDA to XPU" protection from round 2 is preserved: a bare mask without CUDA-hidden still keeps CUDA. Flagged by 3/10 reviewers in the third pass. G. get_visible_gpu_count torch-fallback path: only "*" wildcard counts as "all physical XPUs visible". Other non-parseable masks (",,,", "GPU-uuid", random garbage) now return 0 visible devices rather than silently exposing the full fleet. The happy path (torch.xpu .device_count() succeeds) is unchanged. Suggested by the Gemini review bot. Testing: - 96 PR-authored tests still pass (test_gpu_selection_sandbox, test_gpu_selection, test_utils). - 284 simulation tests pass across two isolated uv venvs. New test_round3_fixes.py covers each round-3 finding explicitly: Fix E: FLAT rejects gpu_ids via prepare_gpu_selection; FLAT unset also rejects; COMPOSITE still accepts; FLAT auto-select gracefully falls to inherit_parent_visible mode. Fix F: force_xpu alone picks XPU; force_xpu=0 does not; no-xpu-torch falls back to CUDA; force_xpu + mask works; CVD="" + mask path unchanged; bare mask still keeps CUDA. Fix G: "*" -> all_physical; ",,," -> 0; "GPU-abc" -> 0; "0,1,*" -> 2 (partial wildcard); subdevice -> 1 root; no mask unchanged; torch happy path bypasses fallback. - 6/6 CUDA scenarios on 8xB200 byte-identical to pre-fix baseline. - ROCm, MLX, CPU regression tests all pass.
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements comprehensive support for Intel XPU (Intel GPUs) across the inference and training backends. Key updates include backend-aware hardware detection, GPU utilization monitoring via xpu-smi, and the correct application of ZE_AFFINITY_MASK for device pinning. The changes also refine autocast logic for diverse hardware and update error handling to recognize XPU-specific memory issues. The review feedback focuses on improving observability by replacing silent exception handlers with debug logging in the hardware utility module.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
This broad except Exception: pass could hide unexpected errors during the cache clearing process. It's better to log the exception, even at a debug level, to aid in future debugging. This aligns with the general rule of avoiding silent exception handlers.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.debug("Failed to clear XPU cache: %s", e) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Silently passing on all exceptions here can obscure issues with xpu-smi. It would be beneficial to log the exception at a debug level to help with troubleshooting, in line with the general rule to avoid silent exception handlers.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.debug("xpu-smi query failed: %s", e) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
This except Exception: pass block could hide errors when querying VRAM information from torch.xpu. Logging the exception at a debug level would be helpful for diagnostics, following the general rule to avoid silent exception handlers.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.debug("torch.xpu VRAM query failed: %s", e) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| except Exception: | ||
| if xpu_visible: | ||
| # Fallback: count unique root device IDs from the mask. | ||
| # ZE_AFFINITY_MASK can use "device.subdevice" notation, | ||
| # so "0.0,0.1" is 1 root device, not 2. Without torch we | ||
| # cannot know which hierarchy mode is active, so fall back | ||
| # to root-device counting (the more conservative choice). | ||
| if xpu_visible == "*": | ||
| # Documented wildcard: all physical XPUs visible. | ||
| _visible_gpu_count = get_physical_gpu_count() | ||
| else: | ||
| roots = _parse_ze_mask_roots(xpu_visible) | ||
| # Non-parseable masks (",,,", "GPU-abc", etc.) yield an | ||
| # empty roots list and are treated as 0 visible devices, | ||
| # not "all visible" -- we have no evidence the user | ||
| # intended to expose the whole fleet. | ||
| _visible_gpu_count = len(set(roots)) | ||
| else: | ||
| _visible_gpu_count = get_physical_gpu_count() | ||
| return _visible_gpu_count |
There was a problem hiding this comment.
The except Exception block here correctly implements fallback logic, but it doesn't log the exception that occurred. It would be valuable for debugging to log the error from torch.xpu.device_count() at a debug level before proceeding with the fallback.
| except Exception: | |
| if xpu_visible: | |
| # Fallback: count unique root device IDs from the mask. | |
| # ZE_AFFINITY_MASK can use "device.subdevice" notation, | |
| # so "0.0,0.1" is 1 root device, not 2. Without torch we | |
| # cannot know which hierarchy mode is active, so fall back | |
| # to root-device counting (the more conservative choice). | |
| if xpu_visible == "*": | |
| # Documented wildcard: all physical XPUs visible. | |
| _visible_gpu_count = get_physical_gpu_count() | |
| else: | |
| roots = _parse_ze_mask_roots(xpu_visible) | |
| # Non-parseable masks (",,,", "GPU-abc", etc.) yield an | |
| # empty roots list and are treated as 0 visible devices, | |
| # not "all visible" -- we have no evidence the user | |
| # intended to expose the whole fleet. | |
| _visible_gpu_count = len(set(roots)) | |
| else: | |
| _visible_gpu_count = get_physical_gpu_count() | |
| return _visible_gpu_count | |
| except Exception as e: | |
| logger.debug("torch.xpu.device_count() failed, falling back to mask parsing: %s", e) | |
| if xpu_visible: | |
| # Fallback: count unique root device IDs from the mask. | |
| # ZE_AFFINITY_MASK can use "device.subdevice" notation, | |
| # so "0.0,0.1" is 1 root device, not 2. Without torch we | |
| # cannot know which hierarchy mode is active, so fall back | |
| # to root-device counting (the more conservative choice). | |
| if xpu_visible == "*": | |
| # Documented wildcard: all physical XPUs visible. | |
| _visible_gpu_count = get_physical_gpu_count() | |
| else: | |
| roots = _parse_ze_mask_roots(xpu_visible) | |
| # Non-parseable masks (",,,", "GPU-abc", etc.) yield an | |
| # empty roots list and are treated as 0 visible devices, | |
| # not "all visible" -- we have no evidence the user | |
| # intended to expose the whole fleet. | |
| _visible_gpu_count = len(set(roots)) | |
| else: | |
| _visible_gpu_count = get_physical_gpu_count() |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| except Exception: | ||
| _visible_gpu_count = get_physical_gpu_count() |
There was a problem hiding this comment.
Similar to other fallbacks in this file, it would be beneficial to log the exception from torch.cuda.device_count() at a debug level before falling back to get_physical_gpu_count(). This aids in debugging scenarios where the PyTorch call fails unexpectedly.
| except Exception: | |
| _visible_gpu_count = get_physical_gpu_count() | |
| except Exception as e: | |
| logger.debug("torch.cuda.device_count() failed, falling back to physical count: %s", e) | |
| _visible_gpu_count = get_physical_gpu_count() |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| except Exception: | ||
| # Treat a failing probe as "runtime not touched yet" so | ||
| # pre-init CPU preprocessing can still parallelize. | ||
| pass |
There was a problem hiding this comment.
The comment explains the fallback behavior, but silently passing on the exception could hide underlying issues with the torch.xpu.is_initialized() call. Logging the exception at a debug level would provide more insight during troubleshooting, while still allowing the safe fallback to proceed.
| except Exception: | |
| # Treat a failing probe as "runtime not touched yet" so | |
| # pre-init CPU preprocessing can still parallelize. | |
| pass | |
| except Exception as e: | |
| # Treat a failing probe as "runtime not touched yet" so | |
| # pre-init CPU preprocessing can still parallelize. | |
| logger.debug("torch.xpu.is_initialized() check failed, assuming not initialized: %s", e) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
Gemini's fourth review pass (pullrequestreview-4119484559) flagged six bare `except Exception: pass` blocks in hardware.py where silent failures could hide real driver or runtime problems. Each site now logs at DEBUG level with the exception message, so operators have a diagnostic trail when something goes wrong, while the control flow stays unchanged and the same fallback path is taken: 1. clear_gpu_cache() XPU branch: log "Failed to clear XPU cache" 2. _resolve_xpu_smi_device_id(): log "torch.xpu.current_device() probe failed" 3. _get_xpu_utilization() xpu-smi call: log "xpu-smi query failed" 4. _get_xpu_utilization() torch.xpu VRAM query: log "torch.xpu VRAM query failed" 5. get_visible_gpu_count() torch.xpu fallback: log and still count mask roots 6. get_visible_gpu_count() torch.cuda fallback: log and still use physical count 7. dataset_map_num_proc() is_initialized probe: log "torch.xpu.is_initialized() probe failed" These are logging-only edits; every test in the PR's own suite, the two simulation harnesses (284 tests), and the 8x B200 CUDA behavior matrix (6/6 scenarios byte-identical) pass unchanged.
Round-4 review (10 parallel reviewers on PR unslothai#4724) produced a unanimous consensus finding plus two isolated but real regressions. This commit addresses all three. [10/10] studio/backend/utils/hardware/hardware.py _get_parent_visible_gpu_spec() on Intel XPU with ZE_FLAT_DEVICE_HIERARCHY unset (the oneAPI default FLAT hierarchy) and no ZE_AFFINITY_MASK was internally inconsistent: it returned numeric_ids=[0..N) with supports_explicit_gpu_ids=False. Downstream, get_visible_gpu_utilization() treated those ordinals as numeric_ids!=None, labeled them index_kind="physical", and llama.cpp's _get_gpu_free_memory() would round-trip tile/device-handle ordinals back into ZE_AFFINITY_MASK as if they were stable root-GPU IDs. Collapse the FLAT+no-mask case to numeric_ids=None so the telemetry path falls into its relative-ordinal branch and auto-selection uniformly returns inherit_parent_visible. Users who need explicit Intel selection opt in via ZE_FLAT_DEVICE_HIERARCHY =COMPOSITE. [1/10] studio/backend/core/inference/llama_cpp.py LlamaCppBackend._get_gpu_free_memory() always probed nvidia-smi before falling back to the generic telemetry path. On a hybrid NVIDIA+Intel host running Studio in XPU mode, nvidia-smi returned NVIDIA indices and those ordinals were then used to build a ZE_AFFINITY_MASK for llama-server, pinning the server to the wrong device. Gate the nvidia-smi fast path on get_device() == CUDA and IS_ROCM=False so other backends skip straight to the backend-aware telemetry path. [1/10] studio/backend/utils/utils.py format_error_message() dropped the broad "memory"/"cuda" match in favor of specific substrings. That stopped matching common CUDA allocation failures such as "CUDA error: CUBLAS_STATUS_ALLOC_FAILED", so users saw the raw backend exception instead of the friendly "Not enough GPU memory" hint. Re-add the CUBLAS, cuda-error+alloc, and xpu-alloc patterns while keeping the matcher narrow enough that non-memory CUDA errors (e.g. "device-side assert triggered") still pass through unchanged. Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174), sim suite v2 (128 incl. 14 new round-5 coverage), 8x B200 CUDA matrix (6/6 scenarios byte-identical to pre-fix baseline).
for more information, see https://pre-commit.ci
Round-5 review (10 parallel reviewers on PR unslothai#4724) narrowed to two real correctness issues in the XPU path introduced by round-5's numeric_ids handling. This commit addresses both. [4/10] studio/backend/utils/hardware/hardware.py -- get_device_map() The "numeric_ids is None and visible_count > 1 -> balanced" heuristic was meant for CUDA UUID/MIG masks, but after round 5 collapsed FLAT no-mask XPU into numeric_ids=None, the heuristic started firing on the default Intel FLAT hierarchy and forced balanced sharding across tile handles even though the same module had already classified those ordinals as non-physical. Restrict the heuristic to CUDA so XPU stays on sequential loading unless the caller passes explicit gpu_ids (via ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE or prepare_gpu_selection). [3/10] studio/backend/utils/hardware/hardware.py -- get_visible_gpu_utilization, get_backend_visible_gpu_info FLAT numeric masks like ZE_AFFINITY_MASK="0,1" populate numeric_ids for telemetry display but have supports_explicit_gpu_ids=False because the tokens are tile/device-handle ordinals per Intel's Level Zero docs. Both helpers still labeled those ordinals index_kind="physical", which let LlamaCppBackend._get_gpu_free_memory() round-trip them back into ZE_AFFINITY_MASK as if they were stable root GPU IDs. Gate the "physical" label on parent_visible_spec["supports_explicit_gpu_ids"] and clear parent_visible_gpu_ids in the payload when it's False, so API consumers and llama.cpp skip the placement path. Verified: PR-authored tests (96 + 4 skipped), sim suite v1 (174, one updated), sim suite v2 (141 incl. 13 new round-6 tests), 8x B200 CUDA matrix (6/6 scenarios byte-identical to round-5 baseline).
for more information, see https://pre-commit.ci
Round-7 review (10 parallel reviewers on PR unslothai#4724) produced a strong convergent finding (7/10 independent reviewers on the same underlying issue): the PR is titled "enable Studio for Intel GPU" but on the default Intel oneAPI runtime (ZE_FLAT_DEVICE_HIERARCHY=FLAT, no ZE_AFFINITY_MASK set), Studio still couldn't auto-select or shard across multiple visible XPUs. Users had to manually opt into COMPOSITE hierarchy for the PR's advertised feature to actually work. The reviewers correctly noted that this is fixable without relaxing the "physical IDs only" contract of the public prepare_gpu_selection() API. torch.xpu ordinals are stable within a single worker process's inherited ZE_AFFINITY_MASK scope: Studio can safely generate them internally for auto-selection, narrow the mask via apply_gpu_ids(), and let the child process inherit the same torch.xpu device set. Round-trip stays 1:1 within the process boundary the ordinals were observed in. [7/10] studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids() When get_device()==XPU and supports_explicit_gpu_ids=False (FLAT hierarchy cases), instead of immediately bailing out with selection_mode="inherit_parent_visible" and selected=None, fall back to list(range(get_visible_gpu_count())) as the candidate set and continue into the VRAM-based selection logic. This path is flagged via a new metadata key xpu_relative_auto_select=True so telemetry stays introspectable. The public prepare_gpu_selection() API is unchanged and still refuses user-supplied explicit gpu_ids on these masks, because end users can't reason about FLAT tile handles; only Studio's own auto-selection (which controls the worker-process env) can use these ordinals safely. [3/10] studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory() The index_kind="relative" guard was rejecting XPU ordinals alongside CUDA UUID/MIG fallbacks. CUDA relative indices stay unsafe (the parent has hidden the physical ID mapping, so re-exporting them into CUDA_VISIBLE_DEVICES would re-expose hidden GPUs). XPU relative ordinals are stable for the current worker scope, so accept them and let VRAM-based GGUF placement work on default Intel FLAT hosts. Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2 (153 incl. 11 new round-7 tests, 3 updated to new contract), 8x B200 CUDA matrix (6/6 scenarios byte-identical to round-5 baseline).
for more information, see https://pre-commit.ci
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts could use multiple visible XPUs via auto-selection and VRAM-based llama.cpp placement. Round-8 review (6/10 convergent finding on 10 parallel reviewers of PR unslothai#4724) showed that fallback triggered even when the parent process had already narrowed visibility via ZE_AFFINITY_MASK, which silently retargeted child processes onto different Level Zero handles than the parent exposed. Example: parent sets ZE_AFFINITY_MASK="3,5" to expose tiles 3 and 5. torch.xpu in the parent sees ordinals 0 (=tile 3) and 1 (=tile 5). Round-7 auto_select returned [0, 1]; apply_gpu_ids() rewrote ZE_AFFINITY_MASK="0,1"; the child inherited that and saw tile 0 and tile 1, not the originally-exposed 3 and 5. Same regression on subdevice masks like "0.0,0.1". studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids() Gate xpu_relative_auto_select on parent_visible_spec["raw"] is None. When a parent mask is already set, defer to inherit_parent_visible so the child inherits the exact visibility the parent intended. studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory() Gate the relative-XPU acceptance on os.environ.get("ZE_AFFINITY_MASK") is None. Inherited masks (numeric "3,5", subdevice "0.0,0.1", wildcard "*") are preserved via the existing "return []" fall-through. Verified: PR tests (96 + 4 skipped), sim_pr4724 (174), sim_fixes_v2 (162 incl. 9 new round-8 tests + 2 updated to the new inherited-mask contract), 8x B200 CUDA matrix (6/6 scenarios byte-identical).
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the inference, training, and hardware utility modules. Key changes include backend-aware hardware detection, telemetry integration via xpu-smi and torch.xpu, and the implementation of ZE_AFFINITY_MASK for device visibility management. The PR also refactors GPU cache clearing and error formatting to be device-agnostic and addresses potential multiprocessing issues on XPU by disabling fork-based parallelism once the runtime is initialized. Feedback was provided to improve the robustness of environment variable parsing for GPU selection by filtering out empty tokens to prevent potential ValueErrors.
| cvd = os.environ.get("CUDA_VISIBLE_DEVICES") | ||
| if cvd is not None and cvd.strip(): | ||
| try: | ||
| allowed = set(int(x.strip()) for x in cvd.split(",")) |
There was a problem hiding this comment.
The list comprehension for parsing CUDA_VISIBLE_DEVICES should filter out empty strings to avoid a ValueError when calling int() on a trailing or double comma (e.g., "0,1,"). While the surrounding try...except block prevents a crash, it causes the entire filter to be ignored (allowed becomes None), which might lead to unexpected GPU selection. Filtering empty tokens ensures that numeric masks with trailing commas are still parsed correctly, consistent with the logic used in utils/hardware/hardware.py.
| allowed = set(int(x.strip()) for x in cvd.split(",")) | |
| allowed = set(int(x.strip()) for x in cvd.split(",") if x.strip()) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
Gemini review 4120214370 caught that LlamaCppBackend._get_gpu_free_memory()
built its CUDA_VISIBLE_DEVICES filter with int(x.strip()) for x in
cvd.split(","), which raises ValueError on an empty token from a shell-
exported value like "0,1," (trailing comma) or "0,,1" (doubled comma).
The surrounding try/except swallowed the crash but left ``allowed=None``,
which silently disables the filter so every nvidia-smi GPU becomes a
placement candidate even though the user clearly narrowed visibility.
Skip empty tokens during the parse so trailing/doubled/leading commas
still produce a valid filter, matching the "if token.strip()" pattern
already used in utils/hardware/hardware.py.
Round-7 added a relative-ordinal fallback so default Intel FLAT hosts could auto-select multiple visible XPUs. Round-9 review (4/10 convergent plus related findings on 10 parallel reviewers of PR unslothai#4724) showed that on multi-tile Intel devices (Data Center GPU Max, 2 tiles per root), the synthesized torch.xpu ordinals enumerate tiles rather than distinct root GPUs. Writing them back via apply_gpu_ids() then rewrote ZE_AFFINITY_MASK with tile handles -- narrowing the worker onto a subset of one card instead of spreading across the parent-visible set. This revert + replacement keeps multi-XPU sharding working on default Intel FLAT without any environment-visible ordinal round-trip: studio/backend/utils/hardware/hardware.py -- auto_select_gpu_ids() Reverted R7-A. When the parent-visible spec cannot expose stable physical GPU IDs (FLAT no-mask, FLAT numeric, wildcard, subdevice), defer to inherit_parent_visible with selected_gpu_ids=None. No more synthesized worker-local ordinals written back into ZE_AFFINITY_MASK. studio/backend/core/inference/llama_cpp.py -- _get_gpu_free_memory() Reverted R7-B. XPU relative telemetry is rejected for placement just like CUDA UUID/MIG. llama-server inherits the parent's mask unchanged and uses its own multi-device machinery. studio/backend/utils/hardware/hardware.py -- get_device_map() New path: when device==XPU and caller did not pass gpu_ids, return "balanced" whenever more than one torch.xpu device is visible (either via unresolved mask or FLAT numeric mask that populates numeric_ids but does not support explicit selection). HF Transformers uses the torch ordinals directly -- scope-local within the worker process, no ZE_AFFINITY_MASK rewrite, so tile-vs-root ambiguity cannot leak out. Explicit gpu_ids=[0] continues to produce sequential so the caller's deliberate single-device choice is respected. Verified: PR tests (96 + 4 skipped), sim_pr4724 (174, one contract update), sim_fixes_v2 (183 incl. 15 new round-10 tests + 10 updated for the reverted contracts), 8x B200 CUDA matrix (6/6 scenarios byte-identical).
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the backend. Key changes include updating hardware detection to recognize XPU devices via ZE_AFFINITY_MASK and UNSLOTH_FORCE_XPU environment variables, implementing backend-aware GPU memory and utilization tracking using xpu-smi and torch.xpu, and ensuring proper device pinning using ZE_AFFINITY_MASK. Additionally, the PR updates audio codec handling to maintain CPU fallbacks for non-validated backends, refactors GPU cache clearing to be device-agnostic, and adjusts multiprocessing logic to prevent XPU runtime corruption on fork. I have no feedback to provide as there are no review comments.
Shorten multi-line design-rationale comments to 1-3 lines each. The surrounding code and commit history carry the full context; inline comments should state the rule, not re-derive the reasoning. Net reduction: ~107 comment lines removed across 2 files. No behavior change.
|
Any likelihood of this getting merged soon? |
|
@nobody5050 yes it's being reviewed and tessted with the help of community users before we merge. |
No description provided.