enable studio for intel GPU#43
Conversation
for more information, see https://pre-commit.ci
- Fix _get_xpu_utilization() metric indices: use -m 0,2,3 (GPU Util, Power, Core Temp) instead of -m 0,1,2,18 which mapped parts[3] to temperature incorrectly (it was actually GPU Memory Utilization). Now correctly parses utilization, power draw, and temperature. - Add -n 1 flag so xpu-smi dump exits after one sample instead of running indefinitely until the 5s timeout kills it. - Use torch.xpu.current_device() for the -d flag instead of hardcoding device 0, so multi-GPU XPU setups query the correct device. - Populate power_draw_w in the returned dict instead of always None. - Fix versions["xpu"] = True (bool) to use the actual XPU version string from torch.version.xpu, falling back to "available". This keeps the dict type-consistent (all str or None). - Remove dead code in get_visible_gpu_count() where the XPU branch at line 1357 was unreachable because the XPU early-return block above always returns before that point.
Skip the xpu-smi subprocess entirely when the binary is not on PATH. This avoids a multi-second timeout on Intel GPU systems that have PyTorch XPU support but no xpu-smi tooling installed. The function still falls back to torch.xpu for VRAM metrics.
Prefer torch.xpu.device_count() over manual mask parsing since the runtime correctly interprets all ZE_AFFINITY_MASK syntax including subdevice notation (e.g. "0.0,0.1" is 1 root device, not 2). The manual parsing fallback now counts unique root device IDs from the mask, handling "device.subdevice" notation correctly.
- _get_xpu_utilization: request metrics -m 0,1,3 (Util, Power, Temp) rather than 0,2,3 so the power column no longer reports MHz as watts. - _resolve_xpu_smi_device_id: map torch.xpu.current_device() (logical ordinal under ZE_AFFINITY_MASK) to the physical root device id that xpu-smi -d expects, so telemetry targets the active GPU. - Merge the duplicated torch blocks in _get_xpu_utilization so the VRAM lookup is guarded and the device index is computed once. - format_error_message: only rewrite true OOM errors (out of memory substrings) as memory errors, so non-OOM XPU/CUDA failures surface their real cause instead of a misleading memory message. - inference.py DAC generation: derive autocast device from model.device.type, not the global backend, so CPU-fallback models on an XPU host do not open a GPU autocast context. - dataset_map_num_proc: only disable XPU multiprocessing after the XPU runtime is actually initialized in this process, so pure CPU-side dataset preprocessing can still parallelize on Intel hosts. - get_package_versions: preserve the "available" fallback for xpu when torch.version.xpu exists as None. - get_visible_gpu_count: normalize ZE_AFFINITY_MASK parsing so the None and empty-string branches do not rely on implicit scoping.
…tion
Round 2 fixes addressing reviewer feedback:
- format_error_message: tightening "out of memory" coverage in round 1
dropped CPU allocator failures like "not enough memory to allocate"
and "cannot allocate memory", and Level Zero
ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY. Restore those patterns while
still excluding non-memory XPU/CUDA exceptions.
- apply_gpu_ids: route Intel XPU through ZE_AFFINITY_MASK instead of
CUDA_VISIBLE_DEVICES so worker subprocesses are actually pinned to
the requested GPUs on multi-XPU hosts.
- _get_parent_visible_gpu_spec: add an XPU branch that reads
ZE_AFFINITY_MASK and returns physical root device IDs, so the
visibility/selection stack reports the correct devices on Intel
hosts. Honors subdevice syntax and wildcards.
- Extract _parse_ze_mask_roots helper for the ZE_AFFINITY_MASK
parsing previously duplicated between _resolve_xpu_smi_device_id
and get_visible_gpu_count. Single source of truth for the mask
semantics.
- get_visible_gpu_count: treat non-digit wildcard masks (e.g. "*")
as "all physical XPUs visible" rather than zero.
- get_package_versions: also set versions["xpu"] = None in the
except block so a failing XPU probe does not leave the key missing.
- inference.py DAC autocast: clamp the resolved device_type to
("cuda", "xpu", "cpu") so exotic devices like "meta" during
accelerate offloaded loading do not raise.
Round 3 fixes targeting the remaining gaps reviewers flagged:
- prepare_gpu_selection: allow explicit gpu_ids on Intel XPU so the
apply_gpu_ids() XPU branch (and _get_parent_visible_gpu_spec XPU
branch) are actually reachable from the normal request path.
- _parse_ze_mask_roots: stop deduplicating. Keep one root ID per mask
token so the logical-ordinal-to-physical-root mapping used by
_resolve_xpu_smi_device_id() stays 1-to-1 even for mixed subdevice
masks like "2.0,0.1,0.2". Update the docstring to document the
new shape.
- _get_parent_visible_gpu_spec: dedupe roots only at the visibility
layer, and flag subdevice masks as supports_explicit_gpu_ids=False
so resolve_requested_gpu_ids() does not try to match duplicate IDs.
Treat wildcard masks as "all physical XPUs visible".
- format_error_message: also match the literal Level Zero enum names
ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY / _HOST_MEMORY which use
underscores and were not caught by the "out of device memory"
substring.
- inference.py DAC autocast: accept "mps" in the clamp list (it has
been an autocast-supported backend since torch 2.3) and skip
autocast entirely when the model is on CPU with an unsupported
dtype like float32, since torch.amp.autocast("cpu", dtype=float32)
raises.
- resolve_requested_gpu_ids: tailor the "unsupported explicit ids"
error message to the current backend so XPU users see a
ZE_AFFINITY_MASK reference instead of a CUDA one.
Round 4 fixes completing the multi-XPU story unlocked in round 3: - get_device_map: include DeviceType.XPU in the multi-GPU branch so explicit XPU gpu_ids=[0, 1] (or a wildcard-masked multi-XPU host) loads with device_map="balanced" instead of falling back to "sequential" and pinning the model to a single device. - auto_select_gpu_ids: allow XPU auto mode. The function relies on get_visible_gpu_utilization() for per-device free-VRAM telemetry, which already has an XPU path via _get_xpu_utilization. XPU hosts omitting gpu_ids now benefit from VRAM-aware selection. - get_visible_gpu_count torch-less fallback: count unique mask roots via len(set(roots)) so subdevice masks like "0.0,0.1" report the intended 1 root GPU, not 2. The ordinal-preserving semantics of _parse_ze_mask_roots are kept so _resolve_xpu_smi_device_id still maps logical ordinals to physical roots correctly. - xpu-smi subprocess timeout lowered from 10s to 3s so a hung driver does not block status polls / UI refreshes. - DAC autocast nullcontext fallback now covers XPU+float32 as well as CPU+float32, since XPU autocast only accepts bfloat16/float16 and otherwise warns on every generate call. - _get_parent_visible_gpu_spec subdevice dedup uses list(dict.fromkeys(...)) instead of an O(n^2) manual loop.
…ggers _get_parent_visible_gpu_spec returned numeric_ids=list(range(physical)) for wildcard ZE_AFFINITY_MASK=*, which blocked get_device_map from reaching its "unresolved multi-visible" fallback. Mirror the CUDA UUID/MIG behavior by returning numeric_ids=None with supports_explicit_gpu_ids=False, so explicit ids are still rejected and get_device_map falls back to sharding across visible devices when more than one is present.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements comprehensive support for Intel XPU hardware, covering device detection, telemetry (utilization and VRAM), and visibility management via ZE_AFFINITY_MASK. It refactors inference and training modules to use centralized hardware utility functions for device string retrieval and cache clearing, and enhances error reporting for memory-related issues. Review feedback focuses on improving code organization by moving several local imports to the top-level of their respective files in accordance with PEP 8.
| + text | ||
| + "<|text_end|>\n<|audio_start|><|global_features_start|>\n" | ||
| ) | ||
| import contextlib |
|
|
||
| if torch.cuda.is_available(): | ||
| torch.cuda.empty_cache() | ||
| from utils.hardware import clear_gpu_cache |
| SNAC_MODEL_NAME = "hubertsiuzdak/snac_24khz" | ||
| SNAC_SAMPLE_RATE = 24000 | ||
| device = "cuda" if torch.cuda.is_available() else "cpu" | ||
| from utils.hardware import get_torch_device_str |
| or "cannot allocate memory" in error_str | ||
| or ("mlx" in error_str and ("memory" in error_str or "allocate" in error_str)) | ||
| ): | ||
| from utils.hardware import get_device |
|
Thank you for the PR! The goal of this PR is to enable unsloth studio on Intel XPU (Level Zero) hosts with feature parity to the existing CUDA and ROCm paths: hardware detection, visibility reporting via This review merges 11 independent reviewers (7 workers from
Concrete suggestions for each finding below. Fix commit will follow in this PR. 1. [9/11]
|
- _backend_visible_devices_env: return ZE_AFFINITY_MASK on XPU so get_backend_visible_gpu_info reports the active mask instead of a stale or None CUDA_VISIBLE_DEVICES after apply_gpu_ids runs. - _get_parent_visible_gpu_spec: return numeric_ids=None for subdevice masks like 0.0,0.1 so get_visible_gpu_utilization, get_backend_visible_gpu_info and get_device_map enumerate torch-visible ordinals and can still shard across logical XPUs instead of collapsing to a single root. - _parse_ze_mask_roots: use str.isdecimal() so Unicode superscripts do not crash int() via str.isdigit() admitting them. - _get_xpu_utilization xpu-smi parsing: accept n/a, NA, - and lowercase variants as missing, and wrap the float parse so one bad column does not drop the whole telemetry row. - clear_gpu_cache XPU branch: guard synchronize/empty_cache with hasattr + try/except so older torch-xpu builds do not propagate AttributeError. - apply_gpu_ids XPU branch: pop stale CUDA_VISIBLE_DEVICES so environment-inspection tools do not show conflicting pinning state. - format_error_message: add memory allocation failed pattern and isinstance(error, MemoryError) so CPU hosts still classify OOMs that the tightened substring list dropped. - test_gpu_selection/test_gpu_selection_sandbox: rename TestXpuRejection to TestXpuSelection and update non_cuda -> non_accelerator and CUDA-only error substring to CUDA and Intel XPU so the suite matches the new behavior. - inference.py/llama_cpp.py/trainer.py/utils.py: hoist contextlib, clear_gpu_cache, get_torch_device_str and get_device imports to module top per PEP 8 feedback from the hosted gemini bot.
|
/gemini review |
| torch.xpu.synchronize() | ||
| if hasattr(torch.xpu, "empty_cache"): | ||
| torch.xpu.empty_cache() | ||
| except Exception: |
| xpu_ok = hasattr(torch, "xpu") and torch.xpu.is_available() | ||
| if xpu_ok: | ||
| ordinal = int(torch.xpu.current_device()) | ||
| except Exception: |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the backend, including hardware detection, telemetry via xpu-smi, and device visibility management using ZE_AFFINITY_MASK. It refactors device-specific logic for autocasting, cache clearing, and device string generation into shared utility functions to ensure consistency between CUDA and XPU environments. Feedback suggests refining exception handling in XPU-specific blocks to avoid overly broad catch statements and adding debug logging to aid in troubleshooting hardware-specific initialization issues.
| try: | ||
| import torch | ||
|
|
||
| torch.xpu.synchronize() | ||
| torch.xpu.empty_cache() | ||
| if hasattr(torch, "xpu"): | ||
| if hasattr(torch.xpu, "synchronize"): | ||
| torch.xpu.synchronize() | ||
| if hasattr(torch.xpu, "empty_cache"): | ||
| torch.xpu.empty_cache() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The try...except Exception: pass block is overly broad. While the intent is to guard against missing attributes in older IPEX builds, catching all exceptions (including KeyboardInterrupt or SystemExit if they were to occur here, though unlikely in this context) is generally discouraged. It would be safer to catch only AttributeError and ImportError specifically.
| if get_device() == DeviceType.XPU: | ||
| try: | ||
| import torch | ||
|
|
||
| if hasattr(torch, "xpu") and torch.xpu.is_initialized(): | ||
| return None | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
The except Exception: return None block in dataset_map_num_proc could potentially mask issues other than a missing is_initialized attribute. If an unexpected error occurs during the check, returning None correctly forces single-process execution (the safe choice for XPU), but it might be worth logging the exception at a debug level to aid in troubleshooting hardware-specific issues.
|
Review loop 2. Aggregated from 11 independent reviewers (7 reviewer.py --slow workers, 3 Sonnet subagents, and the hosted
1/7 reviewer in the reviewer.py pool returned APPROVE without findings; the remaining 6/7 requested changes. Fix commit will address the test failures (stale regex, test_cpu_oom regression), |
- test_gpu_selection.py:105 regex: update stale assertion from "uses UUID/MIG" to "uses non-numeric or subdevice" after the PR broadened resolve_requested_gpu_ids' error message to cover XPU subdevice masks. Three reviewers independently reproduced the suite failure. - utils/utils.py: revert the module-top `from utils.hardware import get_device` hoist that broke test_utils.py::TestFormatErrorMessage::test_cpu_oom -- the test patches utils.hardware.get_device at call time, so the import must stay function-local. Keep the comment explaining why. - hardware.py _get_xpu_utilization: lift _NA and _parse_metric out of the hot path to module scope (renamed _XPU_SMI_NA / _parse_xpu_smi_metric); re-instantiating them on every successful xpu-smi call is wasteful. - hardware.py has_any check: include power_w alongside gpu_util, temp and vram_used_gb so a row that only exposes power is not silently discarded. - hardware.py get_visible_gpu_utilization + get_backend_visible_gpu_info: honor explicit "no devices visible" masks (ZE_AFFINITY_MASK="" or CUDA_VISIBLE_DEVICES="" / "-1") by short-circuiting before the enumerate-visible-ordinals fallback. Previously get_visible_gpu_count returned 0 correctly but the telemetry helpers still enumerated torch devices, letting auto_select_gpu_ids pick a GPU the process explicitly hid. - trainer.py: collapse the two consecutive blank lines left after removing inline `from utils.hardware import get_torch_device_str` imports at lines 1749 and 1985.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Intel XPU (Level Zero) across the backend, including hardware detection, memory management, and telemetry. Key changes include the implementation of XPU-specific device visibility using ZE_AFFINITY_MASK, the addition of xpu-smi based utilization monitoring, and the introduction of utility functions like get_torch_device_str and clear_gpu_cache to unify device handling. Additionally, the PR addresses potential os.fork() issues on XPU by disabling multiprocessing in dataset mapping and updates error handling to provide clearer feedback for XPU-related memory issues. Feedback suggests optimizing the xpu-smi path resolution, narrowing exception handling in cache clearing, and ensuring multiprocessing is safely disabled for XPU.
| if get_device() == DeviceType.XPU: | ||
| try: | ||
| import torch | ||
|
|
||
| if hasattr(torch, "xpu") and torch.xpu.is_initialized(): | ||
| return None | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
For Intel XPU on Linux, os.fork() is generally unsafe once the Level Zero runtime has been touched. Since get_device() calls torch.xpu.is_available(), the runtime is likely already initialized by the time this check runs. To avoid "Pointer argument doesn't reference XPU device memory" errors, it is safer to always return None (disabling multiprocessing) for XPU on Linux, rather than relying on torch.xpu.is_initialized() which might be checked too late.
if get_device() == DeviceType.XPU:
# os.fork() is unsafe on Linux once the XPU runtime (Level Zero)
# has been touched. get_device() already calls is_available(),
# which is often enough to trigger initialization issues.
return None| except Exception: | ||
| pass |
There was a problem hiding this comment.
Using a broad except Exception: pass can hide unexpected errors. Since the intent is to guard against missing attributes in older torch-xpu builds, it is better to catch AttributeError specifically, or at least log other exceptions at a debug level to aid in troubleshooting.
| except Exception: | |
| pass | |
| except AttributeError: | |
| pass |
|
|
||
| # Skip subprocess entirely when xpu-smi is not on PATH, avoiding | ||
| # a multi-second timeout on systems without the Intel tooling. | ||
| xpu_smi = shutil.which("xpu-smi") |
There was a problem hiding this comment.
|
Review loop 3. Aggregated from 11 independent reviewers (7 Split of verdicts from
Fixes landing in this commit:
Deferred (not blocking merge, out-of-scope or architectural):
|
- apply_gpu_ids XPU: revert the CUDA_VISIBLE_DEVICES pop from loop 1. Popping it re-enabled CUDA detection on hybrid NVIDIA+Intel hosts where the parent had set CUDA_VISIBLE_DEVICES="" to force Studio onto XPU; the worker's follow-up detect_hardware() then flipped back to CUDA. torch.xpu only reads ZE_AFFINITY_MASK so the stale CUDA_VISIBLE_DEVICES is cosmetically redundant but functionally harmless, and leaving it alone preserves hybrid-host detection. - llama_cpp._start_process: pin the llama-server subprocess via ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere. llama-server's SYCL build reads ZE_AFFINITY_MASK, not CUDA_VISIBLE_DEVICES, so previous pinning was silently ignored on Intel. - llama_cpp init_audio_codec / generate_audio_response: revert the promotion from get_torch_device_str() to "xpu" on Intel hosts. SNAC / BiCodec / DAC codecs are not yet validated on Intel XPU and the old CPU fallback was the known-working non-CUDA path. Drop the now-unused get_torch_device_str import from llama_cpp.py. - trainer.py _preprocess_snac_dataset / _preprocess_bicodec_dataset / _preprocess_dac_dataset: revert the same unconditional XPU routing for audio dataset preprocessing back to the pre-PR CPU fallback on non-CUDA hosts. Spark-TTS BiCodec, SNAC, and OuteTTS DAC / Whisper paths were all CPU-backed on every non-CUDA host before this PR; promoting them to XPU without capability probes regressed the previously working CPU path. Drop the now-unused get_torch_device_str import from trainer.py. - dataset_map_num_proc: only disable multiprocessing when torch.xpu.is_initialized exists and returns True. Older torch-xpu builds without is_initialized() were previously falling through the broad except and returning None, silently disabling pre-init CPU dataset parallelism the docstring explicitly says should still work. - _get_xpu_utilization: cache the resolved xpu-smi binary path in a module-level sentinel via _resolve_xpu_smi_binary() so repeated telemetry polls do not re-scan PATH on every tick. - get_backend_visible_gpu_info: move the parent_visible_ids lookup below the empty-mask short-circuit so the spec is not computed twice on the fast exit path.
| power_w = _parse_xpu_smi_metric(parts[3]) | ||
| temp = _parse_xpu_smi_metric(parts[4]) | ||
| break | ||
| except Exception: |
| props = torch.xpu.get_device_properties(idx) | ||
| vram_total_gb = round(props.total_memory / (1024**3), 2) | ||
| vram_used_gb = round(torch.xpu.memory_allocated(idx) / (1024**3), 2) | ||
| except Exception: |
|
Fixes pushed to original PR unslothai#4724. Closing staging copy. |
Staging mirror of unslothai#4724
Original PR: unslothai#4724
Author: leizhenyuan
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description