Skip to content

enable studio for intel GPU#43

Closed
danielhanchen wants to merge 19 commits into
mainfrom
pr-4724-head
Closed

enable studio for intel GPU#43
danielhanchen wants to merge 19 commits into
mainfrom
pr-4724-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging mirror of unslothai#4724

Original PR: unslothai#4724
Author: leizhenyuan

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

leizhenyuan and others added 15 commits March 31, 2026 11:03
- Fix _get_xpu_utilization() metric indices: use -m 0,2,3 (GPU Util,
  Power, Core Temp) instead of -m 0,1,2,18 which mapped parts[3] to
  temperature incorrectly (it was actually GPU Memory Utilization).
  Now correctly parses utilization, power draw, and temperature.
- Add -n 1 flag so xpu-smi dump exits after one sample instead of
  running indefinitely until the 5s timeout kills it.
- Use torch.xpu.current_device() for the -d flag instead of hardcoding
  device 0, so multi-GPU XPU setups query the correct device.
- Populate power_draw_w in the returned dict instead of always None.
- Fix versions["xpu"] = True (bool) to use the actual XPU version
  string from torch.version.xpu, falling back to "available". This
  keeps the dict type-consistent (all str or None).
- Remove dead code in get_visible_gpu_count() where the XPU branch
  at line 1357 was unreachable because the XPU early-return block
  above always returns before that point.
Skip the xpu-smi subprocess entirely when the binary is not on
PATH. This avoids a multi-second timeout on Intel GPU systems
that have PyTorch XPU support but no xpu-smi tooling installed.
The function still falls back to torch.xpu for VRAM metrics.
Prefer torch.xpu.device_count() over manual mask parsing since the
runtime correctly interprets all ZE_AFFINITY_MASK syntax including
subdevice notation (e.g. "0.0,0.1" is 1 root device, not 2).

The manual parsing fallback now counts unique root device IDs from
the mask, handling "device.subdevice" notation correctly.
- _get_xpu_utilization: request metrics -m 0,1,3 (Util, Power, Temp)
  rather than 0,2,3 so the power column no longer reports MHz as watts.
- _resolve_xpu_smi_device_id: map torch.xpu.current_device() (logical
  ordinal under ZE_AFFINITY_MASK) to the physical root device id that
  xpu-smi -d expects, so telemetry targets the active GPU.
- Merge the duplicated torch blocks in _get_xpu_utilization so the
  VRAM lookup is guarded and the device index is computed once.
- format_error_message: only rewrite true OOM errors (out of memory
  substrings) as memory errors, so non-OOM XPU/CUDA failures surface
  their real cause instead of a misleading memory message.
- inference.py DAC generation: derive autocast device from
  model.device.type, not the global backend, so CPU-fallback models
  on an XPU host do not open a GPU autocast context.
- dataset_map_num_proc: only disable XPU multiprocessing after the
  XPU runtime is actually initialized in this process, so pure
  CPU-side dataset preprocessing can still parallelize on Intel hosts.
- get_package_versions: preserve the "available" fallback for xpu
  when torch.version.xpu exists as None.
- get_visible_gpu_count: normalize ZE_AFFINITY_MASK parsing so the
  None and empty-string branches do not rely on implicit scoping.
…tion

Round 2 fixes addressing reviewer feedback:

- format_error_message: tightening "out of memory" coverage in round 1
  dropped CPU allocator failures like "not enough memory to allocate"
  and "cannot allocate memory", and Level Zero
  ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY. Restore those patterns while
  still excluding non-memory XPU/CUDA exceptions.
- apply_gpu_ids: route Intel XPU through ZE_AFFINITY_MASK instead of
  CUDA_VISIBLE_DEVICES so worker subprocesses are actually pinned to
  the requested GPUs on multi-XPU hosts.
- _get_parent_visible_gpu_spec: add an XPU branch that reads
  ZE_AFFINITY_MASK and returns physical root device IDs, so the
  visibility/selection stack reports the correct devices on Intel
  hosts. Honors subdevice syntax and wildcards.
- Extract _parse_ze_mask_roots helper for the ZE_AFFINITY_MASK
  parsing previously duplicated between _resolve_xpu_smi_device_id
  and get_visible_gpu_count. Single source of truth for the mask
  semantics.
- get_visible_gpu_count: treat non-digit wildcard masks (e.g. "*")
  as "all physical XPUs visible" rather than zero.
- get_package_versions: also set versions["xpu"] = None in the
  except block so a failing XPU probe does not leave the key missing.
- inference.py DAC autocast: clamp the resolved device_type to
  ("cuda", "xpu", "cpu") so exotic devices like "meta" during
  accelerate offloaded loading do not raise.
Round 3 fixes targeting the remaining gaps reviewers flagged:

- prepare_gpu_selection: allow explicit gpu_ids on Intel XPU so the
  apply_gpu_ids() XPU branch (and _get_parent_visible_gpu_spec XPU
  branch) are actually reachable from the normal request path.
- _parse_ze_mask_roots: stop deduplicating. Keep one root ID per mask
  token so the logical-ordinal-to-physical-root mapping used by
  _resolve_xpu_smi_device_id() stays 1-to-1 even for mixed subdevice
  masks like "2.0,0.1,0.2". Update the docstring to document the
  new shape.
- _get_parent_visible_gpu_spec: dedupe roots only at the visibility
  layer, and flag subdevice masks as supports_explicit_gpu_ids=False
  so resolve_requested_gpu_ids() does not try to match duplicate IDs.
  Treat wildcard masks as "all physical XPUs visible".
- format_error_message: also match the literal Level Zero enum names
  ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY / _HOST_MEMORY which use
  underscores and were not caught by the "out of device memory"
  substring.
- inference.py DAC autocast: accept "mps" in the clamp list (it has
  been an autocast-supported backend since torch 2.3) and skip
  autocast entirely when the model is on CPU with an unsupported
  dtype like float32, since torch.amp.autocast("cpu", dtype=float32)
  raises.
- resolve_requested_gpu_ids: tailor the "unsupported explicit ids"
  error message to the current backend so XPU users see a
  ZE_AFFINITY_MASK reference instead of a CUDA one.
Round 4 fixes completing the multi-XPU story unlocked in round 3:

- get_device_map: include DeviceType.XPU in the multi-GPU branch so
  explicit XPU gpu_ids=[0, 1] (or a wildcard-masked multi-XPU host)
  loads with device_map="balanced" instead of falling back to
  "sequential" and pinning the model to a single device.
- auto_select_gpu_ids: allow XPU auto mode. The function relies on
  get_visible_gpu_utilization() for per-device free-VRAM telemetry,
  which already has an XPU path via _get_xpu_utilization. XPU hosts
  omitting gpu_ids now benefit from VRAM-aware selection.
- get_visible_gpu_count torch-less fallback: count unique mask roots
  via len(set(roots)) so subdevice masks like "0.0,0.1" report the
  intended 1 root GPU, not 2. The ordinal-preserving semantics of
  _parse_ze_mask_roots are kept so _resolve_xpu_smi_device_id still
  maps logical ordinals to physical roots correctly.
- xpu-smi subprocess timeout lowered from 10s to 3s so a hung driver
  does not block status polls / UI refreshes.
- DAC autocast nullcontext fallback now covers XPU+float32 as well
  as CPU+float32, since XPU autocast only accepts bfloat16/float16
  and otherwise warns on every generate call.
- _get_parent_visible_gpu_spec subdevice dedup uses
  list(dict.fromkeys(...)) instead of an O(n^2) manual loop.
…ggers

_get_parent_visible_gpu_spec returned numeric_ids=list(range(physical))
for wildcard ZE_AFFINITY_MASK=*, which blocked get_device_map from
reaching its "unresolved multi-visible" fallback. Mirror the CUDA
UUID/MIG behavior by returning numeric_ids=None with
supports_explicit_gpu_ids=False, so explicit ids are still rejected
and get_device_map falls back to sharding across visible devices
when more than one is present.
Comment thread studio/backend/utils/hardware/hardware.py Fixed
Comment thread studio/backend/utils/hardware/hardware.py Fixed
Comment thread studio/backend/utils/hardware/hardware.py Fixed
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

Comment thread studio/backend/utils/hardware/hardware.py Fixed
Comment thread studio/backend/utils/hardware/hardware.py Fixed
Comment thread studio/backend/utils/hardware/hardware.py Fixed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive support for Intel XPU hardware, covering device detection, telemetry (utilization and VRAM), and visibility management via ZE_AFFINITY_MASK. It refactors inference and training modules to use centralized hardware utility functions for device string retrieval and cache clearing, and enhances error reporting for memory-related issues. Review feedback focuses on improving code organization by moving several local imports to the top-level of their respective files in accordance with PEP 8.

+ text
+ "<|text_end|>\n<|audio_start|><|global_features_start|>\n"
)
import contextlib

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code organization and adherence to PEP 8, it's recommended to move imports to the top of the file. Please move import contextlib to the top-level imports of this module.


if torch.cuda.is_available():
torch.cuda.empty_cache()
from utils.hardware import clear_gpu_cache

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve code readability and follow PEP 8 guidelines, imports should be at the top of the file. Please move this import, and other similar local imports in this file (e.g., lines 3265, 3341), to the top-level imports of the module.

Comment thread studio/backend/core/training/trainer.py Outdated
SNAC_MODEL_NAME = "hubertsiuzdak/snac_24khz"
SNAC_SAMPLE_RATE = 24000
device = "cuda" if torch.cuda.is_available() else "cpu"
from utils.hardware import get_torch_device_str

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To adhere to PEP 8 and improve maintainability, please move this import to the top of the file. This applies to other local imports added in this pull request for this file as well (e.g., lines 1721, 1751, 1953, 1990, 2170).

or "cannot allocate memory" in error_str
or ("mlx" in error_str and ("memory" in error_str or "allocate" in error_str))
):
from utils.hardware import get_device

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, imports should be at the top of the file. Please move this import to the top of the module for better code organization.

@danielhanchen

Copy link
Copy Markdown
Member Author

Thank you for the PR! The goal of this PR is to enable unsloth studio on Intel XPU (Level Zero) hosts with feature parity to the existing CUDA and ROCm paths: hardware detection, visibility reporting via ZE_AFFINITY_MASK, multi-GPU sharding, VRAM-aware auto-selection, xpu-smi telemetry, XPU-aware OOM detection, and device-correct DAC autocast. As a summary, this PR wires DeviceType.XPU through hardware detection, visibility/selection, telemetry (_get_xpu_utilization, _resolve_xpu_smi_device_id, _parse_ze_mask_roots), get_device_map, apply_gpu_ids, and several audio/TTS device-string call sites; it also fixes the DAC autocast bug by deriving the autocast device from model.device.type rather than the global backend.

This review merges 11 independent reviewers (7 workers from reviewer.py --slow, 3 Sonnet subagents, and the hosted /gemini review bot).

Reviewers Severity Finding
9/11 High _backend_visible_devices_env() still returns CUDA_VISIBLE_DEVICES on XPU hosts; the /system-info endpoint and get_backend_visible_gpu_info() drop the active ZE_AFFINITY_MASK after apply_gpu_ids(), so debug output lies about which XPUs are pinned
3/11 High Subdevice ZE_AFFINITY_MASK values (e.g. 0.0,0.1) are collapsed to unique root IDs in _get_parent_visible_gpu_spec(); get_visible_gpu_utilization() / get_backend_visible_gpu_info() emit only one device and get_device_map(None) stays "sequential", silently breaking inherited multi-logical-XPU sharding
2/11 High TestXpuRejection in test_gpu_selection.py, the non_cuda assertion in test_gpu_selection_sandbox.py, and test_prepare_gpu_selection_rejects_gpu_ids_on_non_cuda_backend still encode pre-PR behavior; CI fails as soon as these tests run
2/11 Medium format_error_message tightening lost the "memory allocation failed" pattern and does not catch bare MemoryError() instances whose str() is empty
4/11 Medium Several new import statements are placed inside functions rather than at module top (contextlib, utils.hardware, etc.), flagged by the hosted Gemini bot as PEP 8 violations
1/11 Medium _get_codec_device() in llama_cpp.py now returns "xpu" on Intel hosts, but the llama.cpp GGUF runtime still executes on CPU; the codec decoder should follow the runtime, not the global Studio backend
1/11 Low _parse_ze_mask_roots uses str.isdigit() which admits Unicode superscripts that int() rejects; str.isdecimal() is the correct predicate
1/11 Low xpu-smi CSV float guard only covers "" and "N/A", not lowercase n/a or -
1/11 Low clear_gpu_cache() XPU branch is unguarded; on older torch-xpu builds torch.xpu.empty_cache may be absent
1/11 Low apply_gpu_ids() XPU path leaves a stale CUDA_VISIBLE_DEVICES in the environment
1/11 Low get_physical_gpu_count() on XPU uses torch.xpu.device_count() which is filtered by ZE_AFFINITY_MASK (CUDA uses nvidia-smi -L which is unfiltered)

Concrete suggestions for each finding below. Fix commit will follow in this PR.


1. [9/11] _backend_visible_devices_env() missing XPU branch

studio/backend/utils/hardware/hardware.py:1428-1438

def _backend_visible_devices_env() -> Optional[str]:
    """Return the raw visibility env string that applies to this backend."""
    device = get_device()
    if device == DeviceType.XPU:
        return os.environ.get("ZE_AFFINITY_MASK")
    if IS_ROCM:
        return _get_parent_visible_gpu_spec().get("raw")
    return os.environ.get("CUDA_VISIBLE_DEVICES")

2. [3/11] Subdevice ZE_AFFINITY_MASK collapses multi-logical-XPU

studio/backend/utils/hardware/hardware.py in _get_parent_visible_gpu_spec():

if has_subdevice:
    # Subdevice masks expose logical devices that do not map cleanly back
    # to stable physical root IDs for explicit selection. Mirror the
    # wildcard/UUID path so downstream code enumerates visible ordinals
    # and can still shard across the logical devices.
    return {
        "raw": xpu_mask,
        "numeric_ids": None,
        "supports_explicit_gpu_ids": False,
    }

3. [2/11] Stale XPU rejection tests

studio/backend/tests/test_gpu_selection.py:714-719:

def test_prepare_gpu_selection_rejects_gpu_ids_on_non_cuda_backend(self):
    with patch("utils.hardware.hardware.get_device", return_value = DeviceType.CPU):
        with self.assertRaises(ValueError) as exc_info:
            prepare_gpu_selection([0], model_name = "unsloth/test")

    self.assertIn("only supported on CUDA and Intel XPU", str(exc_info.exception))

studio/backend/tests/test_gpu_selection.py:1092-1103 (replace TestXpuRejection with TestXpuSelection):

class TestXpuSelection(_GpuCacheResetMixin, unittest.TestCase):
    def test_auto_select_supports_xpu(self):
        with (
            patch("utils.hardware.hardware.get_device", return_value = DeviceType.XPU),
            patch(
                "utils.hardware.hardware.estimate_required_model_memory_gb",
                return_value = (1.0, {}),
            ),
            patch(
                "utils.hardware.hardware.get_visible_gpu_utilization",
                return_value = {
                    "devices": [{"index": 0, "vram_total_gb": 8, "vram_used_gb": 1}]
                },
            ),
            patch(
                "utils.hardware.hardware._get_parent_visible_gpu_spec",
                return_value = {
                    "raw": None,
                    "numeric_ids": [0],
                    "supports_explicit_gpu_ids": True,
                },
            ),
            patch(
                "utils.hardware.hardware.get_parent_visible_gpu_ids",
                return_value = [0],
            ),
        ):
            selected, metadata = auto_select_gpu_ids("unsloth/test")

        self.assertEqual(selected, [0])
        self.assertEqual(metadata["selection_mode"], "auto")

    def test_prepare_gpu_selection_accepts_explicit_ids_on_xpu(self):
        with (
            patch("utils.hardware.hardware.get_device", return_value = DeviceType.XPU),
            patch(
                "utils.hardware.hardware._get_parent_visible_gpu_spec",
                return_value = {
                    "raw": "0",
                    "numeric_ids": [0],
                    "supports_explicit_gpu_ids": True,
                },
            ),
            patch(
                "utils.hardware.hardware.get_parent_visible_gpu_ids",
                return_value = [0],
            ),
            patch(
                "utils.hardware.hardware.get_physical_gpu_count", return_value = 1
            ),
        ):
            selected, metadata = prepare_gpu_selection([0], model_name = "unsloth/test")

        self.assertEqual(selected, [0])
        self.assertEqual(metadata["selection_mode"], "explicit")

studio/backend/tests/test_gpu_selection_sandbox.py:305-312:

def test_non_accelerator_returns_none(self):
    from utils.hardware.hardware import auto_select_gpu_ids
    import utils.hardware.hardware as hw

    with patch.object(hw, "get_device", return_value = hw.DeviceType.CPU):
        selected, meta = auto_select_gpu_ids("test/model")
        self.assertIsNone(selected)
        self.assertEqual(meta["selection_mode"], "non_accelerator")

4. [2/11] format_error_message OOM patterns

studio/backend/utils/utils.py in format_error_message:

if (
    "out of memory" in error_str
    or "out of device memory" in error_str
    or "out_of_device_memory" in error_str
    or "out_of_host_memory" in error_str
    or "not enough memory" in error_str
    or "cannot allocate memory" in error_str
    or "memory allocation failed" in error_str
    or isinstance(error, MemoryError)
    or ("mlx" in error_str and ("memory" in error_str or "allocate" in error_str))
):

5. [4/11] PEP 8 imports at module top

The hosted Gemini bot flagged import contextlib in inference.py, from utils.hardware import clear_gpu_cache / get_torch_device_str in llama_cpp.py, trainer.py, and from utils.hardware import get_device in utils.py. These should all move to the module-level imports block.

6. [1/11] GGUF codec device on Intel hosts

studio/backend/core/inference/llama_cpp.py codec init/generate paths should not force xpu from the global backend while the llama.cpp runtime is still CPU-only. Lower confidence (1/11) so this will be left as a followup discussion rather than auto-fixed.

7. [1/11] str.isdigit() -> str.isdecimal()

studio/backend/utils/hardware/hardware.py _parse_ze_mask_roots:

if root.isdecimal():
    roots.append(int(root))

8. [1/11] xpu-smi N/A variants

_NA = frozenset(("", "n/a", "N/A", "-", "na", "NA"))
gpu_util = float(parts[2]) if parts[2].strip() not in _NA else None
power_w  = float(parts[3]) if parts[3].strip() not in _NA else None
temp     = float(parts[4]) if parts[4].strip() not in _NA else None

9. [1/11] clear_gpu_cache() XPU try/except

elif device == DeviceType.XPU:
    try:
        import torch
        if hasattr(torch, "xpu"):
            torch.xpu.synchronize()
            if hasattr(torch.xpu, "empty_cache"):
                torch.xpu.empty_cache()
    except Exception:
        pass

10. [1/11] Stale CUDA_VISIBLE_DEVICES after apply_gpu_ids on XPU

if get_device() == DeviceType.XPU:
    os.environ["ZE_AFFINITY_MASK"] = value
    os.environ.pop("CUDA_VISIBLE_DEVICES", None)
    _visible_gpu_count = None
    logger.info("Applied gpu_ids: ZE_AFFINITY_MASK='%s'", value)
    return

11. [1/11] get_physical_gpu_count XPU reports visible count

Document the limitation or query xpu-smi discovery for physical GPU count. Low real-world impact; noted for a followup.

- _backend_visible_devices_env: return ZE_AFFINITY_MASK on XPU so
  get_backend_visible_gpu_info reports the active mask instead of a stale
  or None CUDA_VISIBLE_DEVICES after apply_gpu_ids runs.
- _get_parent_visible_gpu_spec: return numeric_ids=None for subdevice
  masks like 0.0,0.1 so get_visible_gpu_utilization, get_backend_visible_gpu_info
  and get_device_map enumerate torch-visible ordinals and can still shard
  across logical XPUs instead of collapsing to a single root.
- _parse_ze_mask_roots: use str.isdecimal() so Unicode superscripts do
  not crash int() via str.isdigit() admitting them.
- _get_xpu_utilization xpu-smi parsing: accept n/a, NA, - and lowercase
  variants as missing, and wrap the float parse so one bad column does
  not drop the whole telemetry row.
- clear_gpu_cache XPU branch: guard synchronize/empty_cache with hasattr
  + try/except so older torch-xpu builds do not propagate AttributeError.
- apply_gpu_ids XPU branch: pop stale CUDA_VISIBLE_DEVICES so
  environment-inspection tools do not show conflicting pinning state.
- format_error_message: add memory allocation failed pattern and
  isinstance(error, MemoryError) so CPU hosts still classify OOMs that
  the tightened substring list dropped.
- test_gpu_selection/test_gpu_selection_sandbox: rename TestXpuRejection
  to TestXpuSelection and update non_cuda -> non_accelerator and CUDA-only
  error substring to CUDA and Intel XPU so the suite matches the new
  behavior.
- inference.py/llama_cpp.py/trainer.py/utils.py: hoist contextlib,
  clear_gpu_cache, get_torch_device_str and get_device imports to module
  top per PEP 8 feedback from the hosted gemini bot.
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

torch.xpu.synchronize()
if hasattr(torch.xpu, "empty_cache"):
torch.xpu.empty_cache()
except Exception:
xpu_ok = hasattr(torch, "xpu") and torch.xpu.is_available()
if xpu_ok:
ordinal = int(torch.xpu.current_device())
except Exception:
Comment thread studio/backend/utils/hardware/hardware.py Fixed
Comment thread studio/backend/utils/hardware/hardware.py Fixed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for Intel XPU (Intel GPUs) across the backend, including hardware detection, telemetry via xpu-smi, and device visibility management using ZE_AFFINITY_MASK. It refactors device-specific logic for autocasting, cache clearing, and device string generation into shared utility functions to ensure consistency between CUDA and XPU environments. Feedback suggests refining exception handling in XPU-specific blocks to avoid overly broad catch statements and adding debug logging to aid in troubleshooting hardware-specific initialization issues.

Comment on lines +191 to +200
try:
import torch

torch.xpu.synchronize()
torch.xpu.empty_cache()
if hasattr(torch, "xpu"):
if hasattr(torch.xpu, "synchronize"):
torch.xpu.synchronize()
if hasattr(torch.xpu, "empty_cache"):
torch.xpu.empty_cache()
except Exception:
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try...except Exception: pass block is overly broad. While the intent is to guard against missing attributes in older IPEX builds, catching all exceptions (including KeyboardInterrupt or SystemExit if they were to occur here, though unlikely in this context) is generally discouraged. It would be safer to catch only AttributeError and ImportError specifically.

Comment on lines +1861 to +1868
if get_device() == DeviceType.XPU:
try:
import torch

if hasattr(torch, "xpu") and torch.xpu.is_initialized():
return None
except Exception:
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception: return None block in dataset_map_num_proc could potentially mask issues other than a missing is_initialized attribute. If an unexpected error occurs during the check, returning None correctly forces single-process execution (the safe choice for XPU), but it might be worth logging the exception at a debug level to aid in troubleshooting hardware-specific issues.

@danielhanchen

Copy link
Copy Markdown
Member Author

Review loop 2. Aggregated from 11 independent reviewers (7 reviewer.py --slow workers, 3 Sonnet subagents, and the hosted /gemini review bot) against commit d6f1ea8.

Reviewers Severity Finding
3/11 High test_explicit_ids_are_rejected_for_uuid_parent_visibility (and 2 similar assertions at test_gpu_selection.py:105, :819, :921) still pin the pre-PR "unsupported when CUDA_VISIBLE_DEVICES uses UUID/MIG entries" wording; the code now emits "non-numeric or subdevice entries" so the suite is red
2/11 High Hoisting from utils.hardware import get_device to module top in utils/utils.py broke test_utils.py::TestFormatErrorMessage::test_cpu_oom because the patch target is now resolved at import time and no longer reflects later patch("utils.hardware.get_device", ...) calls
1/11 High ZE_AFFINITY_MASK="" is honored by get_visible_gpu_count() (returns 0) but get_visible_gpu_utilization() and get_backend_visible_gpu_info() still enumerate devices, so auto_select_gpu_ids() can pick GPU 0 on a host that explicitly hid all XPUs
1/11 High ZE_AFFINITY_MASK="*" currently disables both explicit IDs and auto-select; XPU visibility flow should treat the wildcard as "all physical XPUs visible" so the selection code can rank and choose a subset
1/11 High Explicit XPU gpu_ids are validated as physical IDs; under Intel's default ZE_FLAT_DEVICE_HIERARCHY=FLAT hierarchy each stack is a root, so [1] may target stack 1 of card 0 rather than a second physical GPU. Correct fix is a follow-up mapping; for now the PR should reject explicit XPU ids with a clear message or accept them only for verified composite hosts
2/11 Medium Subdevice masks (0.0,0.1) bypass VRAM-aware auto-selection and fall back to balanced sharding; a subsequent fix can carry mask tokens into auto_select_gpu_ids
2/11 Medium clear_gpu_cache / dataset_map_num_proc except Exception is technically correct (BaseException subclasses are not caught) but the hosted Gemini bot flagged the breadth as a style concern; optional narrowing to (AttributeError, ImportError)
1/11 Medium _get_xpu_utilization defines _NA and _parse_metric inside the hot path; lift to module scope for efficiency
1/11 Medium llama_cpp.py load_model() sets env["CUDA_VISIBLE_DEVICES"] even on XPU hosts; llama-server SYCL path would want ZE_AFFINITY_MASK (pre-existing line, scope expansion)
1/11 Medium format_error_message's isinstance(error, MemoryError) branch is uncovered by tests (correct code but dead-coverage)
1/11 Medium GGUF TTS codec device is now derived from the global backend rather than the actual llama.cpp session placement; a CPU GGUF session on an Intel host pushes the codec to XPU
1/11 Low has_any in _get_xpu_utilization excludes power_w; if xpu-smi only returns a power reading the whole row is silently discarded
1/11 Low _torch_get_per_device_info XPU fallback to torch.xpu.memory_allocated() reports process-local only and can undercount used memory on shared hosts
1/11 Low ZE_AFFINITY_MASK=-1 is not handled: on Level Zero -1 is the wildcard meaning "all devices" (opposite of CUDA -1 which means "no GPUs")
1/11 Low get_visible_gpu_count torchless fallback uses len(set(roots)) which undercounts subdevice masks
1/11 Low trainer.py:1749, 1985 left two consecutive blank lines after removing the inline get_torch_device_str imports
1/11 Low Test coverage: no tests for the subdevice mask path, Unicode superscripts in _parse_ze_mask_roots, or bare MemoryError

1/7 reviewer in the reviewer.py pool returned APPROVE without findings; the remaining 6/7 requested changes.

Fix commit will address the test failures (stale regex, test_cpu_oom regression), has_any power_w, _NA / _parse_metric hoist, empty-mask telemetry guard, and trivial trainer.py whitespace. The FLAT-hierarchy, GGUF codec, subdevice auto-select refinement, and VRAM fallback items are noted for a follow-up because they either require architectural rework or are pre-existing lines outside this PR's scope.

- test_gpu_selection.py:105 regex: update stale assertion from "uses
  UUID/MIG" to "uses non-numeric or subdevice" after the PR broadened
  resolve_requested_gpu_ids' error message to cover XPU subdevice masks.
  Three reviewers independently reproduced the suite failure.
- utils/utils.py: revert the module-top `from utils.hardware import
  get_device` hoist that broke test_utils.py::TestFormatErrorMessage::test_cpu_oom
  -- the test patches utils.hardware.get_device at call time, so the
  import must stay function-local. Keep the comment explaining why.
- hardware.py _get_xpu_utilization: lift _NA and _parse_metric out of
  the hot path to module scope (renamed _XPU_SMI_NA /
  _parse_xpu_smi_metric); re-instantiating them on every successful
  xpu-smi call is wasteful.
- hardware.py has_any check: include power_w alongside gpu_util, temp
  and vram_used_gb so a row that only exposes power is not silently
  discarded.
- hardware.py get_visible_gpu_utilization + get_backend_visible_gpu_info:
  honor explicit "no devices visible" masks (ZE_AFFINITY_MASK="" or
  CUDA_VISIBLE_DEVICES="" / "-1") by short-circuiting before the
  enumerate-visible-ordinals fallback. Previously get_visible_gpu_count
  returned 0 correctly but the telemetry helpers still enumerated torch
  devices, letting auto_select_gpu_ids pick a GPU the process explicitly
  hid.
- trainer.py: collapse the two consecutive blank lines left after
  removing inline `from utils.hardware import get_torch_device_str`
  imports at lines 1749 and 1985.
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

Comment thread studio/backend/utils/hardware/hardware.py Fixed
Comment thread studio/backend/utils/hardware/hardware.py Fixed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for Intel XPU (Level Zero) across the backend, including hardware detection, memory management, and telemetry. Key changes include the implementation of XPU-specific device visibility using ZE_AFFINITY_MASK, the addition of xpu-smi based utilization monitoring, and the introduction of utility functions like get_torch_device_str and clear_gpu_cache to unify device handling. Additionally, the PR addresses potential os.fork() issues on XPU by disabling multiprocessing in dataset mapping and updates error handling to provide clearer feedback for XPU-related memory issues. Feedback suggests optimizing the xpu-smi path resolution, narrowing exception handling in cache clearing, and ensuring multiprocessing is safely disabled for XPU.

Comment on lines +1905 to +1912
if get_device() == DeviceType.XPU:
try:
import torch

if hasattr(torch, "xpu") and torch.xpu.is_initialized():
return None
except Exception:
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For Intel XPU on Linux, os.fork() is generally unsafe once the Level Zero runtime has been touched. Since get_device() calls torch.xpu.is_available(), the runtime is likely already initialized by the time this check runs. To avoid "Pointer argument doesn't reference XPU device memory" errors, it is safer to always return None (disabling multiprocessing) for XPU on Linux, rather than relying on torch.xpu.is_initialized() which might be checked too late.

    if get_device() == DeviceType.XPU:
        # os.fork() is unsafe on Linux once the XPU runtime (Level Zero)
        # has been touched. get_device() already calls is_available(),
        # which is often enough to trigger initialization issues.
        return None

Comment on lines +199 to +200
except Exception:
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a broad except Exception: pass can hide unexpected errors. Since the intent is to guard against missing attributes in older torch-xpu builds, it is better to catch AttributeError specifically, or at least log other exceptions at a debug level to aid in troubleshooting.

Suggested change
except Exception:
pass
except AttributeError:
pass


# Skip subprocess entirely when xpu-smi is not on PATH, avoiding
# a multi-second timeout on systems without the Intel tooling.
xpu_smi = shutil.which("xpu-smi")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

shutil.which("xpu-smi") is called every time _get_xpu_utilization is executed, which can be frequent during telemetry updates. Since scanning the PATH is relatively expensive, it is better to cache the resolved path to the xpu-smi binary at the module level or using a function-level cache.

@danielhanchen

Copy link
Copy Markdown
Member Author

Review loop 3. Aggregated from 11 independent reviewers (7 reviewer.py --slow workers, 3 Sonnet subagents, and the hosted /gemini review bot) against commit e22db8f.

Split of verdicts from reviewer.py loop 3: 3/7 APPROVE, 4/7 REQUEST_CHANGES. Subagent B ran the touched test suites locally with structlog pre-mocked and reports 43/43 passed in test_gpu_selection.py, 24/24 passed in test_gpu_selection_sandbox.py, and 8/8 passed in test_utils.py::TestFormatErrorMessage. Subagent C gives an APPROVE WITH FOLLOW-UP verdict.

Reviewers Severity Finding
2/11 High apply_gpu_ids XPU path pops CUDA_VISIBLE_DEVICES, which regresses hybrid NVIDIA+Intel hosts: a parent that sets CUDA_VISIBLE_DEVICES="" to disable NVIDIA and force Studio onto XPU is silently unblocked when the worker re-runs detect_hardware() and flips back to CUDA
2/11 High llama_cpp._start_process subprocess env still hard-codes CUDA_VISIBLE_DEVICES for GPU pinning on XPU hosts; llama-server's SYCL build needs ZE_AFFINITY_MASK
2/11 High GGUF TTS codec load/decode (init_audio_codec, generate_audio_response) was promoted from cpu to get_torch_device_str() on non-CUDA hosts, forcing SNAC/BiCodec/DAC onto XPU without a capability probe; upstream codecs are not yet validated on Intel XPU
2/11 High Trainer audio preprocessing (_preprocess_snac_dataset, _preprocess_bicodec_dataset, _preprocess_dac_dataset) was promoted from cpu to get_torch_device_str() with the same capability-probe concern
2/11 Medium dataset_map_num_proc returns None on older torch-xpu builds where torch.xpu.is_initialized is absent, disabling pre-init dataset parallelism the PR intentionally wanted to preserve
1/11 High Hosted Gemini: on Linux XPU, os.fork() is unsafe once Level Zero is touched and get_device() may have already touched it; safer to always return None
1/11 Medium Hosted Gemini: shutil.which("xpu-smi") is called on every telemetry poll; cache the result
1/11 Medium Hosted Gemini: clear_gpu_cache / dataset_map_num_proc except Exception too broad; narrow to AttributeError / log at debug
1/11 Medium get_visible_gpu_utilization enumerates XPU under subdevice masks from a physical-root count that can under-report when a mock has device_count() == 1; in practice torch.xpu.device_count() returns the logical count so this only reproduces against a particular mock fixture
1/11 Low get_backend_visible_gpu_info computes parent_visible_ids before the empty-mask guard; inefficient but not a bug
1/11 Low test_gpu_selection.py:106 regex match could be tightened to re.escape("uses non-numeric or subdevice entries")

Fixes landing in this commit:

  1. Revert the CUDA_VISIBLE_DEVICES pop in apply_gpu_ids XPU branch (hybrid-host regression).
  2. llama_cpp._start_process now sets ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere.
  3. Revert GGUF TTS codec device in init_audio_codec / generate_audio_response to the pre-PR "cuda" if torch.cuda.is_available() else "cpu" fallback; drop the now-unused get_torch_device_str import from llama_cpp.py.
  4. Revert trainer audio preprocessing (_preprocess_snac_dataset, _preprocess_bicodec_dataset, _preprocess_dac_dataset) to the pre-PR CPU fallback on non-CUDA hosts; drop the now-unused get_torch_device_str import from trainer.py.
  5. Harden dataset_map_num_proc to only disable parallelism when torch.xpu.is_initialized exists and returns True, matching the docstring's "pre-init CPU preprocessing can still parallelize" contract.
  6. Cache the resolved xpu-smi binary path once in _resolve_xpu_smi_binary() so repeated telemetry polls do not re-scan PATH.
  7. Move the parent_visible_ids = get_parent_visible_gpu_ids() call below the empty-mask guard in get_backend_visible_gpu_info to avoid an unnecessary _get_parent_visible_gpu_spec() recomputation.

Deferred (not blocking merge, out-of-scope or architectural):

  • Intel FLAT device hierarchy root-vs-tile mapping for explicit XPU gpu_ids.
  • Mixed CUDA+XPU detection ordering (detect_hardware always picks CUDA first).
  • GGUF _get_gpu_free_memory nvidia-smi-only path.
  • Narrowing except Exception in clear_gpu_cache to AttributeError (broad except is technically correct since BaseException subclasses are not caught, and narrowing would miss ImportError from environments without torch.xpu).

- apply_gpu_ids XPU: revert the CUDA_VISIBLE_DEVICES pop from loop 1.
  Popping it re-enabled CUDA detection on hybrid NVIDIA+Intel hosts
  where the parent had set CUDA_VISIBLE_DEVICES="" to force Studio
  onto XPU; the worker's follow-up detect_hardware() then flipped
  back to CUDA. torch.xpu only reads ZE_AFFINITY_MASK so the stale
  CUDA_VISIBLE_DEVICES is cosmetically redundant but functionally
  harmless, and leaving it alone preserves hybrid-host detection.

- llama_cpp._start_process: pin the llama-server subprocess via
  ZE_AFFINITY_MASK on XPU hosts and CUDA_VISIBLE_DEVICES elsewhere.
  llama-server's SYCL build reads ZE_AFFINITY_MASK, not
  CUDA_VISIBLE_DEVICES, so previous pinning was silently ignored
  on Intel.

- llama_cpp init_audio_codec / generate_audio_response: revert the
  promotion from get_torch_device_str() to "xpu" on Intel hosts.
  SNAC / BiCodec / DAC codecs are not yet validated on Intel XPU
  and the old CPU fallback was the known-working non-CUDA path.
  Drop the now-unused get_torch_device_str import from llama_cpp.py.

- trainer.py _preprocess_snac_dataset / _preprocess_bicodec_dataset /
  _preprocess_dac_dataset: revert the same unconditional XPU routing
  for audio dataset preprocessing back to the pre-PR CPU fallback on
  non-CUDA hosts. Spark-TTS BiCodec, SNAC, and OuteTTS DAC / Whisper
  paths were all CPU-backed on every non-CUDA host before this PR;
  promoting them to XPU without capability probes regressed the
  previously working CPU path. Drop the now-unused get_torch_device_str
  import from trainer.py.

- dataset_map_num_proc: only disable multiprocessing when
  torch.xpu.is_initialized exists and returns True. Older torch-xpu
  builds without is_initialized() were previously falling through the
  broad except and returning None, silently disabling pre-init CPU
  dataset parallelism the docstring explicitly says should still work.

- _get_xpu_utilization: cache the resolved xpu-smi binary path in a
  module-level sentinel via _resolve_xpu_smi_binary() so repeated
  telemetry polls do not re-scan PATH on every tick.

- get_backend_visible_gpu_info: move the parent_visible_ids lookup
  below the empty-mask short-circuit so the spec is not computed twice
  on the fast exit path.
power_w = _parse_xpu_smi_metric(parts[3])
temp = _parse_xpu_smi_metric(parts[4])
break
except Exception:
props = torch.xpu.get_device_properties(idx)
vram_total_gb = round(props.total_memory / (1024**3), 2)
vram_used_gb = round(torch.xpu.memory_allocated(idx) / (1024**3), 2)
except Exception:
@danielhanchen

Copy link
Copy Markdown
Member Author

Fixes pushed to original PR unslothai#4724. Closing staging copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants