fix: use backend device type in GGUF merge path by danielhanchen · Pull Request #14 · shimmyshimmer/unsloth-zoo-staging-2

danielhanchen · 2026-05-04T01:27:43Z

Staging mirror of unslothai#615

Original PR: unslothai#615
Author: andomeder

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

Use DEVICE_TYPE / DEVICE_TYPE_TORCH in unsloth_zoo/saving_utils.py instead of assuming CUDA directly in the GGUF merge/export path.

This updates:

_merge_lora to move tensors with torch.device(DEVICE_TYPE_TORCH, index)
post-merge cache clearing to use the active backend
Tesla T4 detection to run only on DEVICE_TYPE == "cuda"

This was failing on Intel XPU during export with:

Torch not compiled with CUDA enabled

danielhanchen · 2026-05-04T01:28:49Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors unsloth_zoo/saving_utils.py to support non-CUDA backends by replacing hardcoded device strings and cache clearing calls with dynamic logic. Feedback highlights that several instances of torch.cuda.empty_cache() were missed and should be updated, ideally through a new helper function to avoid repetition. Additionally, the device selection logic in _merge_lora can be simplified into a single line using the optional index parameter of the torch.device constructor.

gemini-code-assist · 2026-05-04T01:30:32Z

+                if DEVICE_TYPE_TORCH == "cuda":
+                    torch.cuda.empty_cache()
+                elif DEVICE_TYPE_TORCH == "xpu":
+                    torch.xpu.empty_cache()


While this change correctly introduces backend-aware cache clearing for this specific line, the PR appears to be incomplete as it misses several other hardcoded torch.cuda.empty_cache() calls in this file that will still cause failures on non-CUDA backends like XPU. Please update the following locations as well: Line 632 (inside the same function), Line 812, Line 837, Line 889, Line 1558 (inside _merge_and_overwrite_lora_mxfp4), and Line 2276 (inside merge_and_overwrite_lora). Consider adding a device_empty_cache() helper in unsloth_zoo/device_type.py (similar to device_synchronize()) to avoid repeating this logic.

gemini-code-assist · 2026-05-04T01:30:32Z

+    if W.device.index is None:
+        device = torch.device(DEVICE_TYPE_TORCH)
+    else:
+        device = torch.device(DEVICE_TYPE_TORCH, W.device.index)


The device selection logic can be simplified into a single line. The torch.device constructor accepts an optional index, and W.device.index will correctly provide the current index (or None if the tensor is on CPU), which matches the intended behavior of the current if/else block.

device = torch.device(DEVICE_TYPE_TORCH, index = W.device.index)

…ackend Add device_empty_cache() helper in device_type.py alongside the existing device_synchronize(), and route every torch.cuda.empty_cache() / .synchronize() call in saving_utils.py through these helpers so XPU and HIP builds no longer crash or silently no-op during GGUF export. Concretely, this fixes: - Unguarded torch.cuda.empty_cache() in the outer shard loop of merge_and_overwrite_lora and inside _merge_and_overwrite_lora_mxfp4, both of which raise "Torch not compiled with CUDA enabled" on XPU after the first shard / mxfp4 tensor is processed. - Six guarded torch.cuda.empty_cache() / .synchronize() sites inside _merge_and_overwrite_lora and _merge_and_overwrite_lora_mxfp4 that silently no-op on XPU, leaving XPU VRAM unflushed mid-export. Add a private _active_merge_device(W) helper that returns W.device when W is already on the active backend, otherwise constructs torch.device( DEVICE_TYPE_TORCH[, index]). Route _merge_lora and the five MoE expert merge helpers (_merge_moe_gate_expert, _merge_moe_up_expert, _merge_moe_down_proj_expert, _merge_moe_fused_gate_up_expert, _merge_moe_fused_down_proj_expert) through it so MoE LoRA merges run on the active accelerator instead of silently falling back to CPU on XPU. CUDA/HIP behavior is unchanged because DEVICE_TYPE_TORCH equals "cuda" for both backends and device_empty_cache() preserves the existing torch.cuda.is_available() guard.

danielhanchen · 2026-05-04T02:03:25Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces cross-platform support for device cache management and synchronization by adding the device_empty_cache function and refactoring saving_utils.py to use it alongside device_synchronize. It also implements an _active_merge_device helper to dynamically determine the appropriate device for LoRA merging and adds a safety check for CUDA-specific calls to prevent runtime errors on other backends. I have no feedback to provide.

danielhanchen · 2026-05-04T02:19:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces cross-platform support for device-specific operations by abstracting CUDA-specific calls into generic functions like device_empty_cache and device_synchronize. It also adds a helper function, _active_merge_device, to dynamically determine the correct torch device for tensor operations, replacing hardcoded "cuda" strings. Feedback was provided regarding the robustness of the device type comparison in _active_merge_device, specifically noting that ROCm/HIP systems might report device types as "hip", which could bypass the current optimization check.

gemini-code-assist · 2026-05-04T02:24:08Z

+def _active_merge_device(W):
+    if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH:
+        return W.device
+    if W.device.index is None:
+        return torch.device(DEVICE_TYPE_TORCH)
+    return torch.device(DEVICE_TYPE_TORCH, W.device.index)
+pass


The _active_merge_device helper is a great addition for cross-platform support. However, consider that W.device.type on ROCm/HIP systems might sometimes be reported as "hip" depending on the PyTorch version, while DEVICE_TYPE_TORCH is explicitly set to "cuda" for HIP in device_type.py. While torch.device("cuda") works on HIP, the direct string comparison in line 161 might fail to recognize an existing HIP device as the target, causing an unnecessary torch.device object creation and potential move (though to() is usually a no-op if the device matches). Given the current DEVICE_TYPE_TORCH logic, this is likely safe, but worth noting if "hip" device types start appearing in W.device.type.

Add device_is_bf16_supported() to device_type.py alongside the existing device_synchronize() and device_empty_cache() helpers, and route the three torch.cuda.is_bf16_supported() callsites in llama_cpp.py's convert_to_gguf mmproj/outtype branches through it. On XPU torch builds these calls would otherwise raise "Torch not compiled with CUDA enabled" during VLM GGUF export, mirroring the same crash class fixed in saving_utils.py. CUDA and HIP behavior is unchanged (DEVICE_TYPE in ("cuda","hip") -> the helper returns torch.cuda.is_bf16_supported() exactly as before).

danielhanchen · 2026-05-04T02:33:40Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces cross-platform device management utilities to support non-CUDA backends such as XPU and HIP. It adds device_empty_cache and device_is_bf16_supported functions in device_type.py and refactors llama_cpp.py and saving_utils.py to use these abstractions instead of direct torch.cuda calls. Additionally, a helper function _active_merge_device was introduced in saving_utils.py to dynamically determine the appropriate device for LoRA merging operations. I have no feedback to provide as there were no review comments.

Mirror the defensive hasattr pattern from device_is_bf16_supported in device_empty_cache so that a torch.xpu module that exposes is_available but not empty_cache (custom or partial XPU build) does not raise AttributeError when the active backend cache is flushed.

danielhanchen · 2026-05-04T02:47:58Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces cross-platform device utilities, including device_empty_cache and device_is_bf16_supported, to support hardware backends like Intel XPU alongside CUDA and HIP. These utilities are integrated into llama_cpp.py and saving_utils.py to replace hardcoded CUDA calls. Additionally, a new _active_merge_device helper was added to manage device selection during LoRA merging. Feedback suggests simplifying the _active_merge_device function for better conciseness.

gemini-code-assist · 2026-05-04T02:49:54Z

+    if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH:
+        return W.device
+    if W.device.index is None:
+        return torch.device(DEVICE_TYPE_TORCH)
+    return torch.device(DEVICE_TYPE_TORCH, W.device.index)


The _active_merge_device helper can be simplified. Since torch.device objects always have type and index attributes, and the constructor handles index=None gracefully (defaulting to the current device), the logic can be more concise while remaining safe.

Suggested change

if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH:

return W.device

if W.device.index is None:

return torch.device(DEVICE_TYPE_TORCH)

return torch.device(DEVICE_TYPE_TORCH, W.device.index)

def _active_merge_device(W):

if W.device.type == DEVICE_TYPE_TORCH:

return W.device

return torch.device(DEVICE_TYPE_TORCH, index = W.device.index)

pass

Mirror the defensive hasattr pattern already applied to device_empty_cache and device_is_bf16_supported so a torch.xpu module that exposes is_available but not synchronize (custom or partial XPU build) does not raise AttributeError when device_synchronize is invoked from the GGUF merge path.

danielhanchen · 2026-05-04T02:59:59Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces cross-platform hardware abstraction for device synchronization, cache clearing, and BF16 support detection, extending compatibility to XPU and HIP devices alongside CUDA. Key changes include the addition of device_empty_cache and device_is_bf16_supported wrappers in device_type.py, and the replacement of hardcoded 'cuda' references with a dynamic device selection helper in saving_utils.py. I have no feedback to provide.

danielhanchen · 2026-05-04T04:06:54Z

Fixes pushed to unslothai#615.

andomeder and others added 3 commits April 28, 2026 21:31

fix: use backend device type in GGUF merge path

91cde98

fix: preserve active device index in GGUF merge path

1b90f79

Merge remote-tracking branch 'origin/main' into staging branch

93f1a8c

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

danielhanchen force-pushed the pr-615-head branch from 37c0b4f to e08c1df Compare May 4, 2026 03:21

danielhanchen closed this May 4, 2026

Conversation

danielhanchen commented May 4, 2026

Staging mirror of unslothai#615

Original description

Uh oh!

danielhanchen commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants