fix: use backend device type in GGUF merge path#14
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors unsloth_zoo/saving_utils.py to support non-CUDA backends by replacing hardcoded device strings and cache clearing calls with dynamic logic. Feedback highlights that several instances of torch.cuda.empty_cache() were missed and should be updated, ideally through a new helper function to avoid repetition. Additionally, the device selection logic in _merge_lora can be simplified into a single line using the optional index parameter of the torch.device constructor.
| if DEVICE_TYPE_TORCH == "cuda": | ||
| torch.cuda.empty_cache() | ||
| elif DEVICE_TYPE_TORCH == "xpu": | ||
| torch.xpu.empty_cache() |
There was a problem hiding this comment.
While this change correctly introduces backend-aware cache clearing for this specific line, the PR appears to be incomplete as it misses several other hardcoded torch.cuda.empty_cache() calls in this file that will still cause failures on non-CUDA backends like XPU. Please update the following locations as well: Line 632 (inside the same function), Line 812, Line 837, Line 889, Line 1558 (inside _merge_and_overwrite_lora_mxfp4), and Line 2276 (inside merge_and_overwrite_lora). Consider adding a device_empty_cache() helper in unsloth_zoo/device_type.py (similar to device_synchronize()) to avoid repeating this logic.
| if W.device.index is None: | ||
| device = torch.device(DEVICE_TYPE_TORCH) | ||
| else: | ||
| device = torch.device(DEVICE_TYPE_TORCH, W.device.index) |
There was a problem hiding this comment.
The device selection logic can be simplified into a single line. The torch.device constructor accepts an optional index, and W.device.index will correctly provide the current index (or None if the tensor is on CPU), which matches the intended behavior of the current if/else block.
device = torch.device(DEVICE_TYPE_TORCH, index = W.device.index)…ackend Add device_empty_cache() helper in device_type.py alongside the existing device_synchronize(), and route every torch.cuda.empty_cache() / .synchronize() call in saving_utils.py through these helpers so XPU and HIP builds no longer crash or silently no-op during GGUF export. Concretely, this fixes: - Unguarded torch.cuda.empty_cache() in the outer shard loop of merge_and_overwrite_lora and inside _merge_and_overwrite_lora_mxfp4, both of which raise "Torch not compiled with CUDA enabled" on XPU after the first shard / mxfp4 tensor is processed. - Six guarded torch.cuda.empty_cache() / .synchronize() sites inside _merge_and_overwrite_lora and _merge_and_overwrite_lora_mxfp4 that silently no-op on XPU, leaving XPU VRAM unflushed mid-export. Add a private _active_merge_device(W) helper that returns W.device when W is already on the active backend, otherwise constructs torch.device( DEVICE_TYPE_TORCH[, index]). Route _merge_lora and the five MoE expert merge helpers (_merge_moe_gate_expert, _merge_moe_up_expert, _merge_moe_down_proj_expert, _merge_moe_fused_gate_up_expert, _merge_moe_fused_down_proj_expert) through it so MoE LoRA merges run on the active accelerator instead of silently falling back to CPU on XPU. CUDA/HIP behavior is unchanged because DEVICE_TYPE_TORCH equals "cuda" for both backends and device_empty_cache() preserves the existing torch.cuda.is_available() guard.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces cross-platform support for device cache management and synchronization by adding the device_empty_cache function and refactoring saving_utils.py to use it alongside device_synchronize. It also implements an _active_merge_device helper to dynamically determine the appropriate device for LoRA merging and adds a safety check for CUDA-specific calls to prevent runtime errors on other backends. I have no feedback to provide.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces cross-platform support for device-specific operations by abstracting CUDA-specific calls into generic functions like device_empty_cache and device_synchronize. It also adds a helper function, _active_merge_device, to dynamically determine the correct torch device for tensor operations, replacing hardcoded "cuda" strings. Feedback was provided regarding the robustness of the device type comparison in _active_merge_device, specifically noting that ROCm/HIP systems might report device types as "hip", which could bypass the current optimization check.
| def _active_merge_device(W): | ||
| if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH: | ||
| return W.device | ||
| if W.device.index is None: | ||
| return torch.device(DEVICE_TYPE_TORCH) | ||
| return torch.device(DEVICE_TYPE_TORCH, W.device.index) | ||
| pass |
There was a problem hiding this comment.
The _active_merge_device helper is a great addition for cross-platform support. However, consider that W.device.type on ROCm/HIP systems might sometimes be reported as "hip" depending on the PyTorch version, while DEVICE_TYPE_TORCH is explicitly set to "cuda" for HIP in device_type.py. While torch.device("cuda") works on HIP, the direct string comparison in line 161 might fail to recognize an existing HIP device as the target, causing an unnecessary torch.device object creation and potential move (though to() is usually a no-op if the device matches). Given the current DEVICE_TYPE_TORCH logic, this is likely safe, but worth noting if "hip" device types start appearing in W.device.type.
Add device_is_bf16_supported() to device_type.py alongside the existing
device_synchronize() and device_empty_cache() helpers, and route the three
torch.cuda.is_bf16_supported() callsites in llama_cpp.py's convert_to_gguf
mmproj/outtype branches through it. On XPU torch builds these calls would
otherwise raise "Torch not compiled with CUDA enabled" during VLM GGUF
export, mirroring the same crash class fixed in saving_utils.py.
CUDA and HIP behavior is unchanged (DEVICE_TYPE in ("cuda","hip") -> the
helper returns torch.cuda.is_bf16_supported() exactly as before).
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces cross-platform device management utilities to support non-CUDA backends such as XPU and HIP. It adds device_empty_cache and device_is_bf16_supported functions in device_type.py and refactors llama_cpp.py and saving_utils.py to use these abstractions instead of direct torch.cuda calls. Additionally, a helper function _active_merge_device was introduced in saving_utils.py to dynamically determine the appropriate device for LoRA merging operations. I have no feedback to provide as there were no review comments.
Mirror the defensive hasattr pattern from device_is_bf16_supported in device_empty_cache so that a torch.xpu module that exposes is_available but not empty_cache (custom or partial XPU build) does not raise AttributeError when the active backend cache is flushed.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces cross-platform device utilities, including device_empty_cache and device_is_bf16_supported, to support hardware backends like Intel XPU alongside CUDA and HIP. These utilities are integrated into llama_cpp.py and saving_utils.py to replace hardcoded CUDA calls. Additionally, a new _active_merge_device helper was added to manage device selection during LoRA merging. Feedback suggests simplifying the _active_merge_device function for better conciseness.
| if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH: | ||
| return W.device | ||
| if W.device.index is None: | ||
| return torch.device(DEVICE_TYPE_TORCH) | ||
| return torch.device(DEVICE_TYPE_TORCH, W.device.index) |
There was a problem hiding this comment.
The _active_merge_device helper can be simplified. Since torch.device objects always have type and index attributes, and the constructor handles index=None gracefully (defaulting to the current device), the logic can be more concise while remaining safe.
| if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH: | |
| return W.device | |
| if W.device.index is None: | |
| return torch.device(DEVICE_TYPE_TORCH) | |
| return torch.device(DEVICE_TYPE_TORCH, W.device.index) | |
| def _active_merge_device(W): | |
| if W.device.type == DEVICE_TYPE_TORCH: | |
| return W.device | |
| return torch.device(DEVICE_TYPE_TORCH, index = W.device.index) | |
| pass |
Mirror the defensive hasattr pattern already applied to device_empty_cache and device_is_bf16_supported so a torch.xpu module that exposes is_available but not synchronize (custom or partial XPU build) does not raise AttributeError when device_synchronize is invoked from the GGUF merge path.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces cross-platform hardware abstraction for device synchronization, cache clearing, and BF16 support detection, extending compatibility to XPU and HIP devices alongside CUDA. Key changes include the addition of device_empty_cache and device_is_bf16_supported wrappers in device_type.py, and the replacement of hardcoded 'cuda' references with a dynamic device selection helper in saving_utils.py. I have no feedback to provide.
|
Fixes pushed to unslothai#615. |
Staging mirror of unslothai#615
Original PR: unslothai#615
Author: andomeder
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description
Use
DEVICE_TYPE/DEVICE_TYPE_TORCHinunsloth_zoo/saving_utils.pyinstead of assuming CUDA directly in the GGUF merge/export path.This updates:
_merge_lorato move tensors withtorch.device(DEVICE_TYPE_TORCH, index)DEVICE_TYPE == "cuda"This was failing on Intel XPU during export with: