Skip to content

[FIX] Move loss and n_items to logits device in fast_cross_entropy_loss loss for multi-GPU support#4063

Merged
danielhanchen merged 2 commits into
unslothai:mainfrom
nole69:fix/cross-entropy-device-mismatch-contd
Feb 15, 2026
Merged

[FIX] Move loss and n_items to logits device in fast_cross_entropy_loss loss for multi-GPU support#4063
danielhanchen merged 2 commits into
unslothai:mainfrom
nole69:fix/cross-entropy-device-mismatch-contd

Conversation

@nole69
Copy link
Copy Markdown
Contributor

@nole69 nole69 commented Feb 15, 2026

Continuation of @devchilll 's PR #4059 . PR 4059 resolves the RuntimeError at the masked_fill_ call in the chunked cross-entropy forward path but similar error continues in return loss.sum() / n_items call downstream.

Fix: explicitly move loss and n_items to the same device as logits at the end of fast_cross_entropy_loss

Completes final fix of #4041

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @nole69, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical multi-GPU runtime error in the fast_cross_entropy_loss function. By explicitly moving the calculated loss and n_items to the correct device, it resolves an issue where these tensors could reside on different devices, leading to failures during the final summation and division. This enhances the robustness of the loss calculation for distributed training setups.

Highlights

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth/kernels/cross_entropy_loss.py
    • Captured the device of the logits tensor.
    • Explicitly moved the loss and n_items tensors to the logits device before returning the final calculated loss.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request correctly identifies and addresses a device mismatch issue in multi-GPU environments when calculating the final cross-entropy loss. However, the current implementation will crash if n_items is passed as a Python integer (which occurs in recent Transformers versions). Additionally, moving loss to the device is redundant as it is already allocated on the correct device within the kernel. The suggested change ensures the code is robust for both tensor and scalar inputs.

Comment thread unsloth/kernels/cross_entropy_loss.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74a349e12f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread unsloth/kernels/cross_entropy_loss.py Outdated
if n_items is None:
n_items = torch.count_nonzero(labels != -100)
loss = loss.to(device)
n_items = n_items.to(device)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle scalar n_items before device transfer

fast_cross_entropy_loss now unconditionally calls n_items.to(device), but callers pass n_items straight from kwargs (unsloth/models/llama.py:1507-1515, unsloth/models/mistral.py:366-373) and that value can be a plain Python scalar from trainer plumbing; in that case this line raises AttributeError: 'int' object has no attribute 'to' and training fails before computing loss. This is a regression from the previous behavior, which accepted numeric n_items values without requiring tensor methods.

Useful? React with 👍 / 👎.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@danielhanchen
Copy link
Copy Markdown
Member

Thanks

@danielhanchen danielhanchen merged commit 56c8f96 into unslothai:main Feb 15, 2026
1 check passed
abiswas-realadvice pushed a commit to abiswas-realadvice/unsloth that referenced this pull request May 14, 2026
…ss loss for multi-GPU support (unslothai#4063)

* bug fix for multi-GPU

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants