enable studio for intel GPU by danielhanchen · Pull Request #43 · unslothai/unsloth-staging-1

danielhanchen · 2026-04-11T11:38:47Z

Staging mirror of unslothai#4724

Original PR: unslothai#4724
Author: leizhenyuan

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

for more information, see https://pre-commit.ci

- Fix _get_xpu_utilization() metric indices: use -m 0,2,3 (GPU Util, Power, Core Temp) instead of -m 0,1,2,18 which mapped parts[3] to temperature incorrectly (it was actually GPU Memory Utilization). Now correctly parses utilization, power draw, and temperature. - Add -n 1 flag so xpu-smi dump exits after one sample instead of running indefinitely until the 5s timeout kills it. - Use torch.xpu.current_device() for the -d flag instead of hardcoding device 0, so multi-GPU XPU setups query the correct device. - Populate power_draw_w in the returned dict instead of always None. - Fix versions["xpu"] = True (bool) to use the actual XPU version string from torch.version.xpu, falling back to "available". This keeps the dict type-consistent (all str or None). - Remove dead code in get_visible_gpu_count() where the XPU branch at line 1357 was unreachable because the XPU early-return block above always returns before that point.

Skip the xpu-smi subprocess entirely when the binary is not on PATH. This avoids a multi-second timeout on Intel GPU systems that have PyTorch XPU support but no xpu-smi tooling installed. The function still falls back to torch.xpu for VRAM metrics.

Prefer torch.xpu.device_count() over manual mask parsing since the runtime correctly interprets all ZE_AFFINITY_MASK syntax including subdevice notation (e.g. "0.0,0.1" is 1 root device, not 2). The manual parsing fallback now counts unique root device IDs from the mask, handling "device.subdevice" notation correctly.

- _get_xpu_utilization: request metrics -m 0,1,3 (Util, Power, Temp) rather than 0,2,3 so the power column no longer reports MHz as watts. - _resolve_xpu_smi_device_id: map torch.xpu.current_device() (logical ordinal under ZE_AFFINITY_MASK) to the physical root device id that xpu-smi -d expects, so telemetry targets the active GPU. - Merge the duplicated torch blocks in _get_xpu_utilization so the VRAM lookup is guarded and the device index is computed once. - format_error_message: only rewrite true OOM errors (out of memory substrings) as memory errors, so non-OOM XPU/CUDA failures surface their real cause instead of a misleading memory message. - inference.py DAC generation: derive autocast device from model.device.type, not the global backend, so CPU-fallback models on an XPU host do not open a GPU autocast context. - dataset_map_num_proc: only disable XPU multiprocessing after the XPU runtime is actually initialized in this process, so pure CPU-side dataset preprocessing can still parallelize on Intel hosts. - get_package_versions: preserve the "available" fallback for xpu when torch.version.xpu exists as None. - get_visible_gpu_count: normalize ZE_AFFINITY_MASK parsing so the None and empty-string branches do not rely on implicit scoping.

…tion Round 2 fixes addressing reviewer feedback: - format_error_message: tightening "out of memory" coverage in round 1 dropped CPU allocator failures like "not enough memory to allocate" and "cannot allocate memory", and Level Zero ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY. Restore those patterns while still excluding non-memory XPU/CUDA exceptions. - apply_gpu_ids: route Intel XPU through ZE_AFFINITY_MASK instead of CUDA_VISIBLE_DEVICES so worker subprocesses are actually pinned to the requested GPUs on multi-XPU hosts. - _get_parent_visible_gpu_spec: add an XPU branch that reads ZE_AFFINITY_MASK and returns physical root device IDs, so the visibility/selection stack reports the correct devices on Intel hosts. Honors subdevice syntax and wildcards. - Extract _parse_ze_mask_roots helper for the ZE_AFFINITY_MASK parsing previously duplicated between _resolve_xpu_smi_device_id and get_visible_gpu_count. Single source of truth for the mask semantics. - get_visible_gpu_count: treat non-digit wildcard masks (e.g. "*") as "all physical XPUs visible" rather than zero. - get_package_versions: also set versions["xpu"] = None in the except block so a failing XPU probe does not leave the key missing. - inference.py DAC autocast: clamp the resolved device_type to ("cuda", "xpu", "cpu") so exotic devices like "meta" during accelerate offloaded loading do not raise.

Round 3 fixes targeting the remaining gaps reviewers flagged: - prepare_gpu_selection: allow explicit gpu_ids on Intel XPU so the apply_gpu_ids() XPU branch (and _get_parent_visible_gpu_spec XPU branch) are actually reachable from the normal request path. - _parse_ze_mask_roots: stop deduplicating. Keep one root ID per mask token so the logical-ordinal-to-physical-root mapping used by _resolve_xpu_smi_device_id() stays 1-to-1 even for mixed subdevice masks like "2.0,0.1,0.2". Update the docstring to document the new shape. - _get_parent_visible_gpu_spec: dedupe roots only at the visibility layer, and flag subdevice masks as supports_explicit_gpu_ids=False so resolve_requested_gpu_ids() does not try to match duplicate IDs. Treat wildcard masks as "all physical XPUs visible". - format_error_message: also match the literal Level Zero enum names ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY / _HOST_MEMORY which use underscores and were not caught by the "out of device memory" substring. - inference.py DAC autocast: accept "mps" in the clamp list (it has been an autocast-supported backend since torch 2.3) and skip autocast entirely when the model is on CPU with an unsupported dtype like float32, since torch.amp.autocast("cpu", dtype=float32) raises. - resolve_requested_gpu_ids: tailor the "unsupported explicit ids" error message to the current backend so XPU users see a ZE_AFFINITY_MASK reference instead of a CUDA one.

Round 4 fixes completing the multi-XPU story unlocked in round 3: - get_device_map: include DeviceType.XPU in the multi-GPU branch so explicit XPU gpu_ids=[0, 1] (or a wildcard-masked multi-XPU host) loads with device_map="balanced" instead of falling back to "sequential" and pinning the model to a single device. - auto_select_gpu_ids: allow XPU auto mode. The function relies on get_visible_gpu_utilization() for per-device free-VRAM telemetry, which already has an XPU path via _get_xpu_utilization. XPU hosts omitting gpu_ids now benefit from VRAM-aware selection. - get_visible_gpu_count torch-less fallback: count unique mask roots via len(set(roots)) so subdevice masks like "0.0,0.1" report the intended 1 root GPU, not 2. The ordinal-preserving semantics of _parse_ze_mask_roots are kept so _resolve_xpu_smi_device_id still maps logical ordinals to physical roots correctly. - xpu-smi subprocess timeout lowered from 10s to 3s so a hung driver does not block status polls / UI refreshes. - DAC autocast nullcontext fallback now covers XPU+float32 as well as CPU+float32, since XPU autocast only accepts bfloat16/float16 and otherwise warns on every generate call. - _get_parent_visible_gpu_spec subdevice dedup uses list(dict.fromkeys(...)) instead of an O(n^2) manual loop.

…ggers _get_parent_visible_gpu_spec returned numeric_ids=list(range(physical)) for wildcard ZE_AFFINITY_MASK=*, which blocked get_device_map from reaching its "unresolved multi-visible" fallback. Mirror the CUDA UUID/MIG behavior by returning numeric_ids=None with supports_explicit_gpu_ids=False, so explicit ids are still rejected and get_device_map falls back to sharding across visible devices when more than one is present.

…ardware.py

danielhanchen · 2026-04-11T11:43:13Z

/gemini review

gemini-code-assist

Code Review

This pull request implements comprehensive support for Intel XPU hardware, covering device detection, telemetry (utilization and VRAM), and visibility management via ZE_AFFINITY_MASK. It refactors inference and training modules to use centralized hardware utility functions for device string retrieval and cache clearing, and enhances error reporting for memory-related issues. Review feedback focuses on improving code organization by moving several local imports to the top-level of their respective files in accordance with PEP 8.

gemini-code-assist · 2026-04-11T11:50:25Z

            + text
            + "<|text_end|>\n<|audio_start|><|global_features_start|>\n"
        )
+        import contextlib


For better code organization and adherence to PEP 8, it's recommended to move imports to the top of the file. Please move import contextlib to the top-level imports of this module.

gemini-code-assist · 2026-04-11T11:50:25Z


-                if torch.cuda.is_available():
-                    torch.cuda.empty_cache()
+                from utils.hardware import clear_gpu_cache


To improve code readability and follow PEP 8 guidelines, imports should be at the top of the file. Please move this import, and other similar local imports in this file (e.g., lines 3265, 3341), to the top-level imports of the module.

gemini-code-assist · 2026-04-11T11:50:25Z

        SNAC_MODEL_NAME = "hubertsiuzdak/snac_24khz"
        SNAC_SAMPLE_RATE = 24000
-        device = "cuda" if torch.cuda.is_available() else "cpu"
+        from utils.hardware import get_torch_device_str


To adhere to PEP 8 and improve maintainability, please move this import to the top of the file. This applies to other local imports added in this pull request for this file as well (e.g., lines 1721, 1751, 1953, 1990, 2170).

gemini-code-assist · 2026-04-11T11:50:25Z

+        or "cannot allocate memory" in error_str
+        or ("mlx" in error_str and ("memory" in error_str or "allocate" in error_str))
    ):
        from utils.hardware import get_device


According to PEP 8, imports should be at the top of the file. Please move this import to the top of the module for better code organization.

danielhanchen · 2026-04-11T12:00:18Z

Thank you for the PR! The goal of this PR is to enable unsloth studio on Intel XPU (Level Zero) hosts with feature parity to the existing CUDA and ROCm paths: hardware detection, visibility reporting via ZE_AFFINITY_MASK, multi-GPU sharding, VRAM-aware auto-selection, xpu-smi telemetry, XPU-aware OOM detection, and device-correct DAC autocast. As a summary, this PR wires DeviceType.XPU through hardware detection, visibility/selection, telemetry (_get_xpu_utilization, _resolve_xpu_smi_device_id, _parse_ze_mask_roots), get_device_map, apply_gpu_ids, and several audio/TTS device-string call sites; it also fixes the DAC autocast bug by deriving the autocast device from model.device.type rather than the global backend.

This review merges 11 independent reviewers (7 workers from reviewer.py --slow, 3 Sonnet subagents, and the hosted /gemini review bot).

Reviewers	Severity	Finding
9/11	High	`_backend_visible_devices_env()` still returns `CUDA_VISIBLE_DEVICES` on XPU hosts; the `/system-info` endpoint and `get_backend_visible_gpu_info()` drop the active `ZE_AFFINITY_MASK` after `apply_gpu_ids()`, so debug output lies about which XPUs are pinned
3/11	High	Subdevice `ZE_AFFINITY_MASK` values (e.g. `0.0,0.1`) are collapsed to unique root IDs in `_get_parent_visible_gpu_spec()`; `get_visible_gpu_utilization()` / `get_backend_visible_gpu_info()` emit only one device and `get_device_map(None)` stays `"sequential"`, silently breaking inherited multi-logical-XPU sharding
2/11	High	`TestXpuRejection` in `test_gpu_selection.py`, the `non_cuda` assertion in `test_gpu_selection_sandbox.py`, and `test_prepare_gpu_selection_rejects_gpu_ids_on_non_cuda_backend` still encode pre-PR behavior; CI fails as soon as these tests run
2/11	Medium	`format_error_message` tightening lost the `"memory allocation failed"` pattern and does not catch bare `MemoryError()` instances whose `str()` is empty
4/11	Medium	Several new `import` statements are placed inside functions rather than at module top (`contextlib`, `utils.hardware`, etc.), flagged by the hosted Gemini bot as PEP 8 violations
1/11	Medium	`_get_codec_device()` in `llama_cpp.py` now returns `"xpu"` on Intel hosts, but the llama.cpp GGUF runtime still executes on CPU; the codec decoder should follow the runtime, not the global Studio backend
1/11	Low	`_parse_ze_mask_roots` uses `str.isdigit()` which admits Unicode superscripts that `int()` rejects; `str.isdecimal()` is the correct predicate
1/11	Low	`xpu-smi` CSV float guard only covers `""` and `"N/A"`, not lowercase `n/a` or `-`
1/11	Low	`clear_gpu_cache()` XPU branch is unguarded; on older torch-xpu builds `torch.xpu.empty_cache` may be absent
1/11	Low	`apply_gpu_ids()` XPU path leaves a stale `CUDA_VISIBLE_DEVICES` in the environment
1/11	Low	`get_physical_gpu_count()` on XPU uses `torch.xpu.device_count()` which is filtered by `ZE_AFFINITY_MASK` (CUDA uses `nvidia-smi -L` which is unfiltered)

Concrete suggestions for each finding below. Fix commit will follow in this PR.

1. [9/11] `_backend_visible_devices_env()` missing XPU branch

studio/backend/utils/hardware/hardware.py:1428-1438

def _backend_visible_devices_env() -> Optional[str]:
    """Return the raw visibility env string that applies to this backend."""
    device = get_device()
    if device == DeviceType.XPU:
        return os.environ.get("ZE_AFFINITY_MASK")
    if IS_ROCM:
        return _get_parent_visible_gpu_spec().get("raw")
    return os.environ.get("CUDA_VISIBLE_DEVICES")

2. [3/11] Subdevice ZE_AFFINITY_MASK collapses multi-logical-XPU

studio/backend/utils/hardware/hardware.py in _get_parent_visible_gpu_spec():

if has_subdevice:
    # Subdevice masks expose logical devices that do not map cleanly back
    # to stable physical root IDs for explicit selection. Mirror the
    # wildcard/UUID path so downstream code enumerates visible ordinals
    # and can still shard across the logical devices.
    return {
        "raw": xpu_mask,
        "numeric_ids": None,
        "supports_explicit_gpu_ids": False,
    }

3. [2/11] Stale XPU rejection tests

studio/backend/tests/test_gpu_selection.py:714-719:

def test_prepare_gpu_selection_rejects_gpu_ids_on_non_cuda_backend(self):
    with patch("utils.hardware.hardware.get_device", return_value = DeviceType.CPU):
        with self.assertRaises(ValueError) as exc_info:
            prepare_gpu_selection([0], model_name = "unsloth/test")

    self.assertIn("only supported on CUDA and Intel XPU", str(exc_info.exception))

studio/backend/tests/test_gpu_selection.py:1092-1103 (replace TestXpuRejection with TestXpuSelection):

class TestXpuSelection(_GpuCacheResetMixin, unittest.TestCase):
    def test_auto_select_supports_xpu(self):
        with (
            patch("utils.hardware.hardware.get_device", return_value = DeviceType.XPU),
            patch(
                "utils.hardware.hardware.estimate_required_model_memory_gb",
                return_value = (1.0, {}),
            ),
            patch(
                "utils.hardware.hardware.get_visible_gpu_utilization",
                return_value = {
                    "devices": [{"index": 0, "vram_total_gb": 8, "vram_used_gb": 1}]
                },
            ),
            patch(
                "utils.hardware.hardware._get_parent_visible_gpu_spec",
                return_value = {
                    "raw": None,
                    "numeric_ids": [0],
                    "supports_explicit_gpu_ids": True,
                },
            ),
            patch(
                "utils.hardware.hardware.get_parent_visible_gpu_ids",
                return_value = [0],
            ),
        ):
            selected, metadata = auto_select_gpu_ids("unsloth/test")

        self.assertEqual(selected, [0])
        self.assertEqual(metadata["selection_mode"], "auto")

    def test_prepare_gpu_selection_accepts_explicit_ids_on_xpu(self):
        with (
            patch("utils.hardware.hardware.get_device", return_value = DeviceType.XPU),
            patch(
                "utils.hardware.hardware._get_parent_visible_gpu_spec",
                return_value = {
                    "raw": "0",
                    "numeric_ids": [0],
                    "supports_explicit_gpu_ids": True,
                },
            ),
            patch(
                "utils.hardware.hardware.get_parent_visible_gpu_ids",
                return_value = [0],
            ),
            patch(
                "utils.hardware.hardware.get_physical_gpu_count", return_value = 1
            ),
        ):
            selected, metadata = prepare_gpu_selection([0], model_name = "unsloth/test")

        self.assertEqual(selected, [0])
        self.assertEqual(metadata["selection_mode"], "explicit")

studio/backend/tests/test_gpu_selection_sandbox.py:305-312:

def test_non_accelerator_returns_none(self):
    from utils.hardware.hardware import auto_select_gpu_ids
    import utils.hardware.hardware as hw

    with patch.object(hw, "get_device", return_value = hw.DeviceType.CPU):
        selected, meta = auto_select_gpu_ids("test/model")
        self.assertIsNone(selected)
        self.assertEqual(meta["selection_mode"], "non_accelerator")

4. [2/11] `format_error_message` OOM patterns

studio/backend/utils/utils.py in format_error_message:

if (
    "out of memory" in error_str
    or "out of device memory" in error_str
    or "out_of_device_memory" in error_str
    or "out_of_host_memory" in error_str
    or "not enough memory" in error_str
    or "cannot allocate memory" in error_str
    or "memory allocation failed" in error_str
    or isinstance(error, MemoryError)
    or ("mlx" in error_str and ("memory" in error_str or "allocate" in error_str))
):

5. [4/11] PEP 8 imports at module top

The hosted Gemini bot flagged import contextlib in inference.py, from utils.hardware import clear_gpu_cache / get_torch_device_str in llama_cpp.py, trainer.py, and from utils.hardware import get_device in utils.py. These should all move to the module-level imports block.

6. [1/11] GGUF codec device on Intel hosts

studio/backend/core/inference/llama_cpp.py codec init/generate paths should not force xpu from the global backend while the llama.cpp runtime is still CPU-only. Lower confidence (1/11) so this will be left as a followup discussion rather than auto-fixed.

7. [1/11] `str.isdigit()` -> `str.isdecimal()`

studio/backend/utils/hardware/hardware.py _parse_ze_mask_roots:

if root.isdecimal():
    roots.append(int(root))

8. [1/11] `xpu-smi` N/A variants

_NA = frozenset(("", "n/a", "N/A", "-", "na", "NA"))
gpu_util = float(parts[2]) if parts[2].strip() not in _NA else None
power_w  = float(parts[3]) if parts[3].strip() not in _NA else None
temp     = float(parts[4]) if parts[4].strip() not in _NA else None

9. [1/11] `clear_gpu_cache()` XPU try/except

elif device == DeviceType.XPU:
    try:
        import torch
        if hasattr(torch, "xpu"):
            torch.xpu.synchronize()
            if hasattr(torch.xpu, "empty_cache"):
                torch.xpu.empty_cache()
    except Exception:
        pass

10. [1/11] Stale `CUDA_VISIBLE_DEVICES` after `apply_gpu_ids` on XPU

if get_device() == DeviceType.XPU:
    os.environ["ZE_AFFINITY_MASK"] = value
    os.environ.pop("CUDA_VISIBLE_DEVICES", None)
    _visible_gpu_count = None
    logger.info("Applied gpu_ids: ZE_AFFINITY_MASK='%s'", value)
    return

11. [1/11] `get_physical_gpu_count` XPU reports visible count

Document the limitation or query xpu-smi discovery for physical GPU count. Low real-world impact; noted for a followup.

- _backend_visible_devices_env: return ZE_AFFINITY_MASK on XPU so get_backend_visible_gpu_info reports the active mask instead of a stale or None CUDA_VISIBLE_DEVICES after apply_gpu_ids runs. - _get_parent_visible_gpu_spec: return numeric_ids=None for subdevice masks like 0.0,0.1 so get_visible_gpu_utilization, get_backend_visible_gpu_info and get_device_map enumerate torch-visible ordinals and can still shard across logical XPUs instead of collapsing to a single root. - _parse_ze_mask_roots: use str.isdecimal() so Unicode superscripts do not crash int() via str.isdigit() admitting them. - _get_xpu_utilization xpu-smi parsing: accept n/a, NA, - and lowercase variants as missing, and wrap the float parse so one bad column does not drop the whole telemetry row. - clear_gpu_cache XPU branch: guard synchronize/empty_cache with hasattr + try/except so older torch-xpu builds do not propagate AttributeError. - apply_gpu_ids XPU branch: pop stale CUDA_VISIBLE_DEVICES so environment-inspection tools do not show conflicting pinning state. - format_error_message: add memory allocation failed pattern and isinstance(error, MemoryError) so CPU hosts still classify OOMs that the tightened substring list dropped. - test_gpu_selection/test_gpu_selection_sandbox: rename TestXpuRejection to TestXpuSelection and update non_cuda -> non_accelerator and CUDA-only error substring to CUDA and Intel XPU so the suite matches the new behavior. - inference.py/llama_cpp.py/trainer.py/utils.py: hoist contextlib, clear_gpu_cache, get_torch_device_str and get_device imports to module top per PEP 8 feedback from the hosted gemini bot.

danielhanchen · 2026-04-11T12:05:30Z

/gemini review

+                    torch.xpu.synchronize()
+                if hasattr(torch.xpu, "empty_cache"):
+                    torch.xpu.empty_cache()
+        except Exception:


+        xpu_ok = hasattr(torch, "xpu") and torch.xpu.is_available()
+        if xpu_ok:
+            ordinal = int(torch.xpu.current_device())
+    except Exception:


gemini-code-assist

Code Review

This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the backend, including hardware detection, telemetry via xpu-smi, and device visibility management using ZE_AFFINITY_MASK. It refactors device-specific logic for autocasting, cache clearing, and device string generation into shared utility functions to ensure consistency between CUDA and XPU environments. Feedback suggests refining exception handling in XPU-specific blocks to avoid overly broad catch statements and adding debug logging to aid in troubleshooting hardware-specific initialization issues.

gemini-code-assist · 2026-04-11T12:10:24Z

+        try:
+            import torch

-        torch.xpu.synchronize()
-        torch.xpu.empty_cache()
+            if hasattr(torch, "xpu"):
+                if hasattr(torch.xpu, "synchronize"):
+                    torch.xpu.synchronize()
+                if hasattr(torch.xpu, "empty_cache"):
+                    torch.xpu.empty_cache()
+        except Exception:
+            pass


The try...except Exception: pass block is overly broad. While the intent is to guard against missing attributes in older IPEX builds, catching all exceptions (including KeyboardInterrupt or SystemExit if they were to occur here, though unlikely in this context) is generally discouraged. It would be safer to catch only AttributeError and ImportError specifically.

gemini-code-assist · 2026-04-11T12:10:24Z

+    if get_device() == DeviceType.XPU:
+        try:
+            import torch
+
+            if hasattr(torch, "xpu") and torch.xpu.is_initialized():
+                return None
+        except Exception:
+            return None


The except Exception: return None block in dataset_map_num_proc could potentially mask issues other than a missing is_initialized attribute. If an unexpected error occurs during the check, returning None correctly forces single-process execution (the safe choice for XPU), but it might be worth logging the exception at a debug level to aid in troubleshooting hardware-specific issues.

danielhanchen · 2026-04-11T12:21:11Z

Review loop 2. Aggregated from 11 independent reviewers (7 reviewer.py --slow workers, 3 Sonnet subagents, and the hosted /gemini review bot) against commit d6f1ea8.

Reviewers	Severity	Finding
3/11	High	`test_explicit_ids_are_rejected_for_uuid_parent_visibility` (and 2 similar assertions at test_gpu_selection.py:105, :819, :921) still pin the pre-PR "unsupported when CUDA_VISIBLE_DEVICES uses UUID/MIG entries" wording; the code now emits "non-numeric or subdevice entries" so the suite is red
2/11	High	Hoisting `from utils.hardware import get_device` to module top in `utils/utils.py` broke `test_utils.py::TestFormatErrorMessage::test_cpu_oom` because the patch target is now resolved at import time and no longer reflects later `patch("utils.hardware.get_device", ...)` calls
1/11	High	`ZE_AFFINITY_MASK=""` is honored by `get_visible_gpu_count()` (returns 0) but `get_visible_gpu_utilization()` and `get_backend_visible_gpu_info()` still enumerate devices, so `auto_select_gpu_ids()` can pick GPU 0 on a host that explicitly hid all XPUs
1/11	High	`ZE_AFFINITY_MASK="*"` currently disables both explicit IDs and auto-select; XPU visibility flow should treat the wildcard as "all physical XPUs visible" so the selection code can rank and choose a subset
1/11	High	Explicit XPU gpu_ids are validated as physical IDs; under Intel's default `ZE_FLAT_DEVICE_HIERARCHY=FLAT` hierarchy each stack is a root, so `[1]` may target stack 1 of card 0 rather than a second physical GPU. Correct fix is a follow-up mapping; for now the PR should reject explicit XPU ids with a clear message or accept them only for verified composite hosts
2/11	Medium	Subdevice masks (`0.0,0.1`) bypass VRAM-aware auto-selection and fall back to balanced sharding; a subsequent fix can carry mask tokens into `auto_select_gpu_ids`
2/11	Medium	`clear_gpu_cache` / `dataset_map_num_proc` `except Exception` is technically correct (BaseException subclasses are not caught) but the hosted Gemini bot flagged the breadth as a style concern; optional narrowing to `(AttributeError, ImportError)`
1/11	Medium	`_get_xpu_utilization` defines `_NA` and `_parse_metric` inside the hot path; lift to module scope for efficiency
1/11	Medium	`llama_cpp.py` `load_model()` sets `env["CUDA_VISIBLE_DEVICES"]` even on XPU hosts; llama-server SYCL path would want `ZE_AFFINITY_MASK` (pre-existing line, scope expansion)
1/11	Medium	`format_error_message`'s `isinstance(error, MemoryError)` branch is uncovered by tests (correct code but dead-coverage)
1/11	Medium	GGUF TTS codec device is now derived from the global backend rather than the actual llama.cpp session placement; a CPU GGUF session on an Intel host pushes the codec to XPU
1/11	Low	`has_any` in `_get_xpu_utilization` excludes `power_w`; if xpu-smi only returns a power reading the whole row is silently discarded
1/11	Low	`_torch_get_per_device_info` XPU fallback to `torch.xpu.memory_allocated()` reports process-local only and can undercount used memory on shared hosts
1/11	Low	`ZE_AFFINITY_MASK=-1` is not handled: on Level Zero `-1` is the wildcard meaning "all devices" (opposite of CUDA `-1` which means "no GPUs")
1/11	Low	`get_visible_gpu_count` torchless fallback uses `len(set(roots))` which undercounts subdevice masks
1/11	Low	`trainer.py:1749, 1985` left two consecutive blank lines after removing the inline `get_torch_device_str` imports
1/11	Low	Test coverage: no tests for the subdevice mask path, Unicode superscripts in `_parse_ze_mask_roots`, or bare `MemoryError`

1/7 reviewer in the reviewer.py pool returned APPROVE without findings; the remaining 6/7 requested changes.

Fix commit will address the test failures (stale regex, test_cpu_oom regression), has_any power_w, _NA / _parse_metric hoist, empty-mask telemetry guard, and trivial trainer.py whitespace. The FLAT-hierarchy, GGUF codec, subdevice auto-select refinement, and VRAM fallback items are noted for a follow-up because they either require architectural rework or are pre-existing lines outside this PR's scope.

- test_gpu_selection.py:105 regex: update stale assertion from "uses UUID/MIG" to "uses non-numeric or subdevice" after the PR broadened resolve_requested_gpu_ids' error message to cover XPU subdevice masks. Three reviewers independently reproduced the suite failure. - utils/utils.py: revert the module-top `from utils.hardware import get_device` hoist that broke test_utils.py::TestFormatErrorMessage::test_cpu_oom -- the test patches utils.hardware.get_device at call time, so the import must stay function-local. Keep the comment explaining why. - hardware.py _get_xpu_utilization: lift _NA and _parse_metric out of the hot path to module scope (renamed _XPU_SMI_NA / _parse_xpu_smi_metric); re-instantiating them on every successful xpu-smi call is wasteful. - hardware.py has_any check: include power_w alongside gpu_util, temp and vram_used_gb so a row that only exposes power is not silently discarded. - hardware.py get_visible_gpu_utilization + get_backend_visible_gpu_info: honor explicit "no devices visible" masks (ZE_AFFINITY_MASK="" or CUDA_VISIBLE_DEVICES="" / "-1") by short-circuiting before the enumerate-visible-ordinals fallback. Previously get_visible_gpu_count returned 0 correctly but the telemetry helpers still enumerated torch devices, letting auto_select_gpu_ids pick a GPU the process explicitly hid. - trainer.py: collapse the two consecutive blank lines left after removing inline `from utils.hardware import get_torch_device_str` imports at lines 1749 and 1985.

danielhanchen · 2026-04-11T12:23:43Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces comprehensive support for Intel XPU (Level Zero) across the backend, including hardware detection, memory management, and telemetry. Key changes include the implementation of XPU-specific device visibility using ZE_AFFINITY_MASK, the addition of xpu-smi based utilization monitoring, and the introduction of utility functions like get_torch_device_str and clear_gpu_cache to unify device handling. Additionally, the PR addresses potential os.fork() issues on XPU by disabling multiprocessing in dataset mapping and updates error handling to provide clearer feedback for XPU-related memory issues. Feedback suggests optimizing the xpu-smi path resolution, narrowing exception handling in cache clearing, and ensuring multiprocessing is safely disabled for XPU.

gemini-code-assist · 2026-04-11T12:26:52Z

+    if get_device() == DeviceType.XPU:
+        try:
+            import torch
+
+            if hasattr(torch, "xpu") and torch.xpu.is_initialized():
+                return None
+        except Exception:
+            return None


For Intel XPU on Linux, os.fork() is generally unsafe once the Level Zero runtime has been touched. Since get_device() calls torch.xpu.is_available(), the runtime is likely already initialized by the time this check runs. To avoid "Pointer argument doesn't reference XPU device memory" errors, it is safer to always return None (disabling multiprocessing) for XPU on Linux, rather than relying on torch.xpu.is_initialized() which might be checked too late.

if get_device() == DeviceType.XPU: # os.fork() is unsafe on Linux once the XPU runtime (Level Zero) # has been touched. get_device() already calls is_available(), # which is often enough to trigger initialization issues. return None

gemini-code-assist · 2026-04-11T12:26:53Z

+        except Exception:
+            pass


Using a broad except Exception: pass can hide unexpected errors. Since the intent is to guard against missing attributes in older torch-xpu builds, it is better to catch AttributeError specifically, or at least log other exceptions at a debug level to aid in troubleshooting.

Suggested change

except Exception:

pass

except AttributeError:

pass

gemini-code-assist · 2026-04-11T12:26:53Z

+
+        # Skip subprocess entirely when xpu-smi is not on PATH, avoiding
+        # a multi-second timeout on systems without the Intel tooling.
+        xpu_smi = shutil.which("xpu-smi")


shutil.which("xpu-smi") is called every time _get_xpu_utilization is executed, which can be frequent during telemetry updates. Since scanning the PATH is relatively expensive, it is better to cache the resolved path to the xpu-smi binary at the module level or using a function-level cache.

danielhanchen · 2026-04-11T12:40:12Z

Review loop 3. Aggregated from 11 independent reviewers (7 reviewer.py --slow workers, 3 Sonnet subagents, and the hosted /gemini review bot) against commit e22db8f.

Split of verdicts from reviewer.py loop 3: 3/7 APPROVE, 4/7 REQUEST_CHANGES. Subagent B ran the touched test suites locally with structlog pre-mocked and reports 43/43 passed in test_gpu_selection.py, 24/24 passed in test_gpu_selection_sandbox.py, and 8/8 passed in test_utils.py::TestFormatErrorMessage. Subagent C gives an APPROVE WITH FOLLOW-UP verdict.

Reviewers	Severity	Finding
2/11	High	`apply_gpu_ids` XPU path pops `CUDA_VISIBLE_DEVICES`, which regresses hybrid NVIDIA+Intel hosts: a parent that sets `CUDA_VISIBLE_DEVICES=""` to disable NVIDIA and force Studio onto XPU is silently unblocked when the worker re-runs `detect_hardware()` and flips back to CUDA
2/11	High	`llama_cpp._start_process` subprocess env still hard-codes `CUDA_VISIBLE_DEVICES` for GPU pinning on XPU hosts; llama-server's SYCL build needs `ZE_AFFINITY_MASK`
2/11	High	GGUF TTS codec load/decode (`init_audio_codec`, `generate_audio_response`) was promoted from `cpu` to `get_torch_device_str()` on non-CUDA hosts, forcing SNAC/BiCodec/DAC onto XPU without a capability probe; upstream codecs are not yet validated on Intel XPU
2/11	High	Trainer audio preprocessing (`_preprocess_snac_dataset`, `_preprocess_bicodec_dataset`, `_preprocess_dac_dataset`) was promoted from `cpu` to `get_torch_device_str()` with the same capability-probe concern
2/11	Medium	`dataset_map_num_proc` returns `None` on older torch-xpu builds where `torch.xpu.is_initialized` is absent, disabling pre-init dataset parallelism the PR intentionally wanted to preserve
1/11	High	Hosted Gemini: on Linux XPU, `os.fork()` is unsafe once Level Zero is touched and `get_device()` may have already touched it; safer to always return `None`
1/11	Medium	Hosted Gemini: `shutil.which("xpu-smi")` is called on every telemetry poll; cache the result
1/11	Medium	Hosted Gemini: `clear_gpu_cache` / `dataset_map_num_proc` `except Exception` too broad; narrow to `AttributeError` / log at debug
1/11	Medium	`get_visible_gpu_utilization` enumerates XPU under subdevice masks from a physical-root count that can under-report when a mock has `device_count() == 1`; in practice `torch.xpu.device_count()` returns the logical count so this only reproduces against a particular mock fixture
1/11	Low	`get_backend_visible_gpu_info` computes `parent_visible_ids` before the empty-mask guard; inefficient but not a bug
1/11	Low	`test_gpu_selection.py:106` regex match could be tightened to `re.escape("uses non-numeric or subdevice entries")`

Fixes landing in this commit:

Revert the CUDA_VISIBLE_DEVICES pop in apply_gpu_ids XPU branch (hybrid-host regression).
llama_cpp._start_process now sets ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere.
Revert GGUF TTS codec device in init_audio_codec / generate_audio_response to the pre-PR "cuda" if torch.cuda.is_available() else "cpu" fallback; drop the now-unused get_torch_device_str import from llama_cpp.py.
Revert trainer audio preprocessing (_preprocess_snac_dataset, _preprocess_bicodec_dataset, _preprocess_dac_dataset) to the pre-PR CPU fallback on non-CUDA hosts; drop the now-unused get_torch_device_str import from trainer.py.
Harden dataset_map_num_proc to only disable parallelism when torch.xpu.is_initialized exists and returns True, matching the docstring's "pre-init CPU preprocessing can still parallelize" contract.
Cache the resolved xpu-smi binary path once in _resolve_xpu_smi_binary() so repeated telemetry polls do not re-scan PATH.
Move the parent_visible_ids = get_parent_visible_gpu_ids() call below the empty-mask guard in get_backend_visible_gpu_info to avoid an unnecessary _get_parent_visible_gpu_spec() recomputation.

Deferred (not blocking merge, out-of-scope or architectural):

Intel FLAT device hierarchy root-vs-tile mapping for explicit XPU gpu_ids.
Mixed CUDA+XPU detection ordering (detect_hardware always picks CUDA first).
GGUF _get_gpu_free_memory nvidia-smi-only path.
Narrowing except Exception in clear_gpu_cache to AttributeError (broad except is technically correct since BaseException subclasses are not caught, and narrowing would miss ImportError from environments without torch.xpu).

- apply_gpu_ids XPU: revert the CUDA_VISIBLE_DEVICES pop from loop 1. Popping it re-enabled CUDA detection on hybrid NVIDIA+Intel hosts where the parent had set CUDA_VISIBLE_DEVICES="" to force Studio onto XPU; the worker's follow-up detect_hardware() then flipped back to CUDA. torch.xpu only reads ZE_AFFINITY_MASK so the stale CUDA_VISIBLE_DEVICES is cosmetically redundant but functionally harmless, and leaving it alone preserves hybrid-host detection. - llama_cpp._start_process: pin the llama-server subprocess via ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere. llama-server's SYCL build reads ZE_AFFINITY_MASK, not CUDA_VISIBLE_DEVICES, so previous pinning was silently ignored on Intel. - llama_cpp init_audio_codec / generate_audio_response: revert the promotion from get_torch_device_str() to "xpu" on Intel hosts. SNAC / BiCodec / DAC codecs are not yet validated on Intel XPU and the old CPU fallback was the known-working non-CUDA path. Drop the now-unused get_torch_device_str import from llama_cpp.py. - trainer.py _preprocess_snac_dataset / _preprocess_bicodec_dataset / _preprocess_dac_dataset: revert the same unconditional XPU routing for audio dataset preprocessing back to the pre-PR CPU fallback on non-CUDA hosts. Spark-TTS BiCodec, SNAC, and OuteTTS DAC / Whisper paths were all CPU-backed on every non-CUDA host before this PR; promoting them to XPU without capability probes regressed the previously working CPU path. Drop the now-unused get_torch_device_str import from trainer.py. - dataset_map_num_proc: only disable multiprocessing when torch.xpu.is_initialized exists and returns True. Older torch-xpu builds without is_initialized() were previously falling through the broad except and returning None, silently disabling pre-init CPU dataset parallelism the docstring explicitly says should still work. - _get_xpu_utilization: cache the resolved xpu-smi binary path in a module-level sentinel via _resolve_xpu_smi_binary() so repeated telemetry polls do not re-scan PATH on every tick. - get_backend_visible_gpu_info: move the parent_visible_ids lookup below the empty-mask short-circuit so the spec is not computed twice on the fast exit path.

+                    power_w = _parse_xpu_smi_metric(parts[3])
+                    temp = _parse_xpu_smi_metric(parts[4])
+                    break
+    except Exception:


+            props = torch.xpu.get_device_properties(idx)
+            vram_total_gb = round(props.total_memory / (1024**3), 2)
+            vram_used_gb = round(torch.xpu.memory_allocated(idx) / (1024**3), 2)
+    except Exception:


danielhanchen · 2026-04-11T12:50:18Z

Fixes pushed to original PR unslothai#4724. Closing staging copy.

leizhenyuan and others added 15 commits March 31, 2026 11:03

add intel GPU for unsloth studio

9db527a

clean stash

f1c426b

remove unuse code

dc55c95

remove rl changes

5487a1b

remove unuse code

72eced4

remove unuse code

f679dca

[pre-commit.ci] auto fixes from pre-commit.com hooks

8dd4290

for more information, see https://pre-commit.ci