Skip to content

fix: use backend device type in GGUF merge path#14

Closed
danielhanchen wants to merge 7 commits into
mainfrom
pr-615-head
Closed

fix: use backend device type in GGUF merge path#14
danielhanchen wants to merge 7 commits into
mainfrom
pr-615-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Collaborator

Staging mirror of unslothai#615

Original PR: unslothai#615
Author: andomeder

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Use DEVICE_TYPE / DEVICE_TYPE_TORCH in unsloth_zoo/saving_utils.py instead of assuming CUDA directly in the GGUF merge/export path.

This updates:

  • _merge_lora to move tensors with torch.device(DEVICE_TYPE_TORCH, index)
  • post-merge cache clearing to use the active backend
  • Tesla T4 detection to run only on DEVICE_TYPE == "cuda"

This was failing on Intel XPU during export with:

Torch not compiled with CUDA enabled

@danielhanchen

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors unsloth_zoo/saving_utils.py to support non-CUDA backends by replacing hardcoded device strings and cache clearing calls with dynamic logic. Feedback highlights that several instances of torch.cuda.empty_cache() were missed and should be updated, ideally through a new helper function to avoid repetition. Additionally, the device selection logic in _merge_lora can be simplified into a single line using the optional index parameter of the torch.device constructor.

Comment thread unsloth_zoo/saving_utils.py Outdated
Comment on lines +780 to +783
if DEVICE_TYPE_TORCH == "cuda":
torch.cuda.empty_cache()
elif DEVICE_TYPE_TORCH == "xpu":
torch.xpu.empty_cache()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly introduces backend-aware cache clearing for this specific line, the PR appears to be incomplete as it misses several other hardcoded torch.cuda.empty_cache() calls in this file that will still cause failures on non-CUDA backends like XPU. Please update the following locations as well: Line 632 (inside the same function), Line 812, Line 837, Line 889, Line 1558 (inside _merge_and_overwrite_lora_mxfp4), and Line 2276 (inside merge_and_overwrite_lora). Consider adding a device_empty_cache() helper in unsloth_zoo/device_type.py (similar to device_synchronize()) to avoid repeating this logic.

Comment thread unsloth_zoo/saving_utils.py Outdated
Comment on lines +162 to +165
if W.device.index is None:
device = torch.device(DEVICE_TYPE_TORCH)
else:
device = torch.device(DEVICE_TYPE_TORCH, W.device.index)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The device selection logic can be simplified into a single line. The torch.device constructor accepts an optional index, and W.device.index will correctly provide the current index (or None if the tensor is on CPU), which matches the intended behavior of the current if/else block.

    device = torch.device(DEVICE_TYPE_TORCH, index = W.device.index)

…ackend

Add device_empty_cache() helper in device_type.py alongside the existing
device_synchronize(), and route every torch.cuda.empty_cache() / .synchronize()
call in saving_utils.py through these helpers so XPU and HIP builds no longer
crash or silently no-op during GGUF export. Concretely, this fixes:

- Unguarded torch.cuda.empty_cache() in the outer shard loop of
  merge_and_overwrite_lora and inside _merge_and_overwrite_lora_mxfp4, both
  of which raise "Torch not compiled with CUDA enabled" on XPU after the
  first shard / mxfp4 tensor is processed.
- Six guarded torch.cuda.empty_cache() / .synchronize() sites inside
  _merge_and_overwrite_lora and _merge_and_overwrite_lora_mxfp4 that
  silently no-op on XPU, leaving XPU VRAM unflushed mid-export.

Add a private _active_merge_device(W) helper that returns W.device when W is
already on the active backend, otherwise constructs torch.device(
DEVICE_TYPE_TORCH[, index]). Route _merge_lora and the five MoE expert merge
helpers (_merge_moe_gate_expert, _merge_moe_up_expert,
_merge_moe_down_proj_expert, _merge_moe_fused_gate_up_expert,
_merge_moe_fused_down_proj_expert) through it so MoE LoRA merges run on the
active accelerator instead of silently falling back to CPU on XPU.

CUDA/HIP behavior is unchanged because DEVICE_TYPE_TORCH equals "cuda" for
both backends and device_empty_cache() preserves the existing
torch.cuda.is_available() guard.
@danielhanchen

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cross-platform support for device cache management and synchronization by adding the device_empty_cache function and refactoring saving_utils.py to use it alongside device_synchronize. It also implements an _active_merge_device helper to dynamically determine the appropriate device for LoRA merging and adds a safety check for CUDA-specific calls to prevent runtime errors on other backends. I have no feedback to provide.

@danielhanchen

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cross-platform support for device-specific operations by abstracting CUDA-specific calls into generic functions like device_empty_cache and device_synchronize. It also adds a helper function, _active_merge_device, to dynamically determine the correct torch device for tensor operations, replacing hardcoded "cuda" strings. Feedback was provided regarding the robustness of the device type comparison in _active_merge_device, specifically noting that ROCm/HIP systems might report device types as "hip", which could bypass the current optimization check.

Comment on lines +160 to +166
def _active_merge_device(W):
if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH:
return W.device
if W.device.index is None:
return torch.device(DEVICE_TYPE_TORCH)
return torch.device(DEVICE_TYPE_TORCH, W.device.index)
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The _active_merge_device helper is a great addition for cross-platform support. However, consider that W.device.type on ROCm/HIP systems might sometimes be reported as "hip" depending on the PyTorch version, while DEVICE_TYPE_TORCH is explicitly set to "cuda" for HIP in device_type.py. While torch.device("cuda") works on HIP, the direct string comparison in line 161 might fail to recognize an existing HIP device as the target, causing an unnecessary torch.device object creation and potential move (though to() is usually a no-op if the device matches). Given the current DEVICE_TYPE_TORCH logic, this is likely safe, but worth noting if "hip" device types start appearing in W.device.type.

Add device_is_bf16_supported() to device_type.py alongside the existing
device_synchronize() and device_empty_cache() helpers, and route the three
torch.cuda.is_bf16_supported() callsites in llama_cpp.py's convert_to_gguf
mmproj/outtype branches through it. On XPU torch builds these calls would
otherwise raise "Torch not compiled with CUDA enabled" during VLM GGUF
export, mirroring the same crash class fixed in saving_utils.py.

CUDA and HIP behavior is unchanged (DEVICE_TYPE in ("cuda","hip") -> the
helper returns torch.cuda.is_bf16_supported() exactly as before).
@danielhanchen

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cross-platform device management utilities to support non-CUDA backends such as XPU and HIP. It adds device_empty_cache and device_is_bf16_supported functions in device_type.py and refactors llama_cpp.py and saving_utils.py to use these abstractions instead of direct torch.cuda calls. Additionally, a helper function _active_merge_device was introduced in saving_utils.py to dynamically determine the appropriate device for LoRA merging operations. I have no feedback to provide as there were no review comments.

Mirror the defensive hasattr pattern from device_is_bf16_supported in
device_empty_cache so that a torch.xpu module that exposes is_available
but not empty_cache (custom or partial XPU build) does not raise
AttributeError when the active backend cache is flushed.
@danielhanchen

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cross-platform device utilities, including device_empty_cache and device_is_bf16_supported, to support hardware backends like Intel XPU alongside CUDA and HIP. These utilities are integrated into llama_cpp.py and saving_utils.py to replace hardcoded CUDA calls. Additionally, a new _active_merge_device helper was added to manage device selection during LoRA merging. Feedback suggests simplifying the _active_merge_device function for better conciseness.

Comment on lines +161 to +165
if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH:
return W.device
if W.device.index is None:
return torch.device(DEVICE_TYPE_TORCH)
return torch.device(DEVICE_TYPE_TORCH, W.device.index)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The _active_merge_device helper can be simplified. Since torch.device objects always have type and index attributes, and the constructor handles index=None gracefully (defaulting to the current device), the logic can be more concise while remaining safe.

Suggested change
if getattr(W.device, "type", None) == DEVICE_TYPE_TORCH:
return W.device
if W.device.index is None:
return torch.device(DEVICE_TYPE_TORCH)
return torch.device(DEVICE_TYPE_TORCH, W.device.index)
def _active_merge_device(W):
if W.device.type == DEVICE_TYPE_TORCH:
return W.device
return torch.device(DEVICE_TYPE_TORCH, index = W.device.index)
pass

Mirror the defensive hasattr pattern already applied to device_empty_cache
and device_is_bf16_supported so a torch.xpu module that exposes is_available
but not synchronize (custom or partial XPU build) does not raise
AttributeError when device_synchronize is invoked from the GGUF merge path.
@danielhanchen

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cross-platform hardware abstraction for device synchronization, cache clearing, and BF16 support detection, extending compatibility to XPU and HIP devices alongside CUDA. Key changes include the addition of device_empty_cache and device_is_bf16_supported wrappers in device_type.py, and the replacement of hardcoded 'cuda' references with a dynamic device selection helper in saving_utils.py. I have no feedback to provide.

@danielhanchen

Copy link
Copy Markdown
Collaborator Author

Fixes pushed to unslothai#615.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants