Skip to content

Fix multi-node distributed training with single GPU per node#4143

Merged
Datta0 merged 2 commits into
unslothai:mainfrom
Maxusmusti:fix/multinode-single-gpu-distributed
Mar 3, 2026
Merged

Fix multi-node distributed training with single GPU per node#4143
Datta0 merged 2 commits into
unslothai:mainfrom
Maxusmusti:fix/multinode-single-gpu-distributed

Conversation

@Maxusmusti
Copy link
Copy Markdown
Contributor

@Maxusmusti Maxusmusti commented Mar 2, 2026

Fixes #4142

When each node has only one visible GPU (DEVICE_COUNT == 1), the _prepare_backend patch incorrectly disables distributed training — even in multi-node setups where WORLD_SIZE > 1.

This adds a WORLD_SIZE check so the patch is only applied when genuinely on a single device in a single-node setup.

Changes

One-line change in unsloth/models/_utils.py:

 -if DEVICE_COUNT == 1:
 +if DEVICE_COUNT == 1 and int(os.environ.get("WORLD_SIZE", "1")) <= 1:

Context

DEVICE_COUNT comes from torch.cuda.device_count(), which only counts locally visible GPUs. In a multi-node setup with one GPU per node (or CUDA_VISIBLE_DEVICES=0), DEVICE_COUNT is 1, but WORLD_SIZE is > 1. Without this fix, accelerate.state.PartialState._prepare_backend is patched to return DistributedType.NO, which prevents distributed communication from initializing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug affecting multi-node distributed training configurations, specifically when each node utilizes a single GPU. The change ensures that the system correctly identifies and initializes distributed communication, preventing unintended disabling of distributed training and allowing models to leverage multi-node resources as intended.

Highlights

  • Distributed Training Fix: Corrected an issue where accelerate.state.PartialState._prepare_backend was incorrectly patched to disable distributed training in multi-node setups where each node had only one visible GPU.
  • Conditional Patching Logic: Modified the condition for applying the _prepare_backend patch to include a check for WORLD_SIZE, ensuring it only applies when genuinely on a single device in a single-node environment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth/models/_utils.py
    • Imported the os module to access environment variables.
    • Modified the conditional statement for patching _prepare_backend to check both DEVICE_COUNT and WORLD_SIZE.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue in multi-node distributed training setups with a single GPU per node. The change introduces a check for the WORLD_SIZE environment variable, preventing distributed training from being incorrectly disabled. The logic is sound. I've added one comment to suggest removing a redundant import statement for better code quality.

Comment thread unsloth/models/_utils.py Outdated
exec(BitsAndBytesConfig__init__, globals())

if DEVICE_COUNT == 1:
import os
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The os module is already imported at the top of this file (line 97). This import os statement is redundant and can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed!

@Maxusmusti
Copy link
Copy Markdown
Contributor Author

Maxusmusti commented Mar 2, 2026

Fix is tested: now in multi-node, single-gpu setting, correctly launch training with Data Parallel GPUs = 2, and runs as expected.

@Maxusmusti Maxusmusti force-pushed the fix/multinode-single-gpu-distributed branch from e774993 to 44ccf4d Compare March 2, 2026 21:57
Copy link
Copy Markdown
Collaborator

@Datta0 Datta0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does make sense to me and seems quite a simple change

Copy link
Copy Markdown
Collaborator

@mmathew23 mmathew23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some edge cases where are config sets a default value as empty string, but we can patch it if that becomes an issue.

@Datta0 Datta0 merged commit 8c7d93d into unslothai:main Mar 3, 2026
1 check passed
@Maxusmusti
Copy link
Copy Markdown
Contributor Author

@mmathew23 yeah thats a fair point, I was also thinking about that earlier. I figured since its a torchrun override its very rarely set manually, and when it is, throwing an error on a non-integer value would be good to let the user know their env has a bad value set for world size.

I didnt test the empty-string case though 😅, for some reason I was thinking since its an env var empty = unset and it'd pick the default, but thinking now I dont think that's true lol. If that becomes an issue someone hits in the future, I am happy to throw up a quick PR to patch that case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Multi-node distributed training broken only in one-GPU-per-node setting (one-line fix)

3 participants