Ensure device mesh patching is applied by djsaunde · Pull Request #2842 · axolotl-ai-cloud/axolotl

djsaunde · 2025-06-27T21:22:32Z

Description

Our accelerate patching to enable SP seems to have broken recently; this fixes it. I also am more explicit about using FSDP when enabled.

Motivation and Context

How has this been tested?

Confirmed working with FSDP x SP 2 x 2 device mesh on 4x H100 SXM. Also works with various optimizations (Liger optims, CCE).

Example config (modified from user's): https://gist.github.com/djsaunde/aca6285273cf9d476e69baa1cdcab6c7. This uses ~29GB VRAM per GPU.

I'll repro training losses and post here as well.

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

New Features
- Improved support for sequence parallelism by automatically applying relevant patches when enabled in the configuration.
Bug Fixes
- Corrected a typo in documentation for data loader patching.
Refactor
- Simplified and updated the handling of device mesh dimensions for better alignment with PyTorch naming conventions.
- Relocated sequence parallel patch application logic to ensure it is consistently applied during model loading.
- Removed redundant patching during ring attention registration for cleaner integration.
Tests
- Updated tests to load configuration from disk files, enhancing test reliability and realism.

coderabbitai · 2025-06-27T21:22:57Z

Walkthrough

The sequence parallel patching logic has been moved from the SequenceParallelContextManager to a new private method within the PatchManager class. The patching functions are now invoked during the pre-model load patch phase. Additionally, the patching utilities were updated to support an FSDP flag and improve dynamic patching, while redundant patch calls were removed from the context manager. Tests were updated to load configurations from YAML files instead of using in-memory dictionaries.

Changes

File(s)	Change Summary
src/axolotl/loaders/patch_manager.py	Added `_apply_sequence_parallel_patches` to handle sequence parallel patching in `PatchManager`.
src/axolotl/monkeypatch/ring_attn/patch.py	Improved patching logic, added `fsdp` parameter, refined dynamic code execution and patching.
src/axolotl/utils/ctx_managers/sequence_parallel.py	Removed imports and calls to patching functions from `SequenceParallelContextManager`.
tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py	Updated tests to load configuration from YAML files using `temp_dir` fixture instead of in-memory.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PatchManager
    participant PatchUtils

    User->>PatchManager: apply_pre_model_load_patches(config)
    PatchManager->>PatchManager: _apply_sequence_parallel_patches(config)
    alt sequence_parallel_degree > 1
        PatchManager->>PatchUtils: patch_prepare_data_loader()
        PatchManager->>PatchUtils: patch_prepare_device_mesh(sequence_parallel_degree, fsdp)
    end

Possibly related PRs

axolotl-ai-cloud/axolotl#2699: Refactors and enhances SequenceParallelContextManager by adding ring attention registration and patching, indicating both PRs modify how and where sequence parallel patches and ring attention are applied.

Suggested reviewers

djsaunde

Poem

In the warren of code, a patch hops anew,
Sequence parallel magic, now cleanly in view.
No more double dipping in the context’s old den,
PatchManager’s the hero, again and again.
With FSDP flags waving, the rabbits all cheer—
For tidier patches, and logic more clear!
🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f1f8d1 and f71a1c7.

📒 Files selected for processing (1)

src/axolotl/loaders/patch_manager.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/axolotl/loaders/patch_manager.py

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.7.1)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.5.1)

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2025-06-27T21:30:30Z

Codecov Report

Attention: Patch coverage is 30.76923% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/monkeypatch/ring_attn/patch.py	14.28%	6 Missing ⚠️
src/axolotl/loaders/patch_manager.py	50.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

winglian

winglian · 2025-06-27T22:44:12Z

Looks like you need a guard in the _apply_sequence_parallel_patches function in case sp degree is none

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (1)
429-435: Consider the implications of introducing config loading pipeline into kernel patching tests.

While this change aligns with testing the actual usage pattern, it significantly changes the test scope. The load_cfg function performs extensive validation, normalization, plugin preparation, and environment setup that wasn't part of the original test. This makes the test less isolated and introduces dependencies on the entire configuration loading pipeline.

Consider whether these tests should remain focused on kernel patching functionality specifically, or if the broader config loading integration should be tested separately.

Additionally, add error handling for the file operations:
 # Write cfg to yaml file
 path = Path(temp_dir) / "config.yaml"
-with open(path, "w", encoding="utf-8") as fout:
-    fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+try:
+    with open(path, "w", encoding="utf-8") as fout:
+        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+except Exception as e:
+    pytest.fail(f"Failed to write test config: {e}")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f14858 and 4f1f8d1.

📒 Files selected for processing (4)

src/axolotl/loaders/patch_manager.py (2 hunks)
src/axolotl/monkeypatch/ring_attn/patch.py (3 hunks)
src/axolotl/utils/ctx_managers/sequence_parallel.py (0 hunks)
tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (4 hunks)

💤 Files with no reviewable changes (1)

src/axolotl/utils/ctx_managers/sequence_parallel.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/axolotl/monkeypatch/ring_attn/patch.py
src/axolotl/loaders/patch_manager.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (1)

src/axolotl/cli/config.py (1)

load_cfg (164-249)

🔇 Additional comments (2)

tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (2)

399-399: Function signature change looks appropriate.

The addition of the temp_dir fixture parameter aligns with the new file-based configuration approach.

516-516: Function signature change looks appropriate.

The addition of the temp_dir fixture parameter is consistent with the first modified function.

djsaunde requested review from salmanmohammadi and winglian June 27, 2025 21:22

djsaunde self-assigned this Jun 27, 2025

winglian approved these changes Jun 27, 2025

View reviewed changes

salmanmohammadi approved these changes Jun 27, 2025

View reviewed changes

djsaunde added 2 commits June 28, 2025 09:52

move patches; make patch stronger

eecd258

fix broken tests

4f1f8d1

djsaunde force-pushed the model-load-fix branch from 4f14858 to 4f1f8d1 Compare June 28, 2025 14:01

coderabbitai Bot reviewed Jun 28, 2025

View reviewed changes

Comment thread tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py

guard sequence_parallel_degree comparison against none

f71a1c7

djsaunde commented Jun 28, 2025

View reviewed changes

Comment thread src/axolotl/loaders/patch_manager.py

Merge branch 'main' into model-load-fix

3fa9516

djsaunde merged commit 35fdbce into main Jun 30, 2025
8 of 9 checks passed

djsaunde deleted the model-load-fix branch June 30, 2025 02:16

coderabbitai Bot mentioned this pull request Jul 30, 2025

Distributed/ND-Parallel #2977

Merged

coderabbitai Bot mentioned this pull request Aug 22, 2025

make multipack sampler patch explicit #3096

Merged

coderabbitai Bot mentioned this pull request May 8, 2026

fix: make prepare_context_parallel_inputs no-op #3520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure device mesh patching is applied#2842

Ensure device mesh patching is applied#2842
djsaunde merged 4 commits into
mainfrom
model-load-fix

djsaunde commented Jun 27, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 27, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

codecov Bot commented Jun 27, 2025

Uh oh!

winglian left a comment

Uh oh!

winglian commented Jun 27, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

djsaunde commented Jun 27, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

codecov Bot commented Jun 27, 2025

Codecov Report

Uh oh!

winglian left a comment

Choose a reason for hiding this comment

Uh oh!

winglian commented Jun 27, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djsaunde commented Jun 27, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 27, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)