Skip to content

Ensure device mesh patching is applied#2842

Merged
djsaunde merged 4 commits into
mainfrom
model-load-fix
Jun 30, 2025
Merged

Ensure device mesh patching is applied#2842
djsaunde merged 4 commits into
mainfrom
model-load-fix

Conversation

@djsaunde

@djsaunde djsaunde commented Jun 27, 2025

Copy link
Copy Markdown
Collaborator

Description

Our accelerate patching to enable SP seems to have broken recently; this fixes it. I also am more explicit about using FSDP when enabled.

Motivation and Context

How has this been tested?

Confirmed working with FSDP x SP 2 x 2 device mesh on 4x H100 SXM. Also works with various optimizations (Liger optims, CCE).

Example config (modified from user's): https://gist.github.com/djsaunde/aca6285273cf9d476e69baa1cdcab6c7. This uses ~29GB VRAM per GPU.

I'll repro training losses and post here as well.

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • New Features

    • Improved support for sequence parallelism by automatically applying relevant patches when enabled in the configuration.
  • Bug Fixes

    • Corrected a typo in documentation for data loader patching.
  • Refactor

    • Simplified and updated the handling of device mesh dimensions for better alignment with PyTorch naming conventions.
    • Relocated sequence parallel patch application logic to ensure it is consistently applied during model loading.
    • Removed redundant patching during ring attention registration for cleaner integration.
  • Tests

    • Updated tests to load configuration from disk files, enhancing test reliability and realism.

@djsaunde djsaunde self-assigned this Jun 27, 2025
@coderabbitai

coderabbitai Bot commented Jun 27, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

The sequence parallel patching logic has been moved from the SequenceParallelContextManager to a new private method within the PatchManager class. The patching functions are now invoked during the pre-model load patch phase. Additionally, the patching utilities were updated to support an FSDP flag and improve dynamic patching, while redundant patch calls were removed from the context manager. Tests were updated to load configurations from YAML files instead of using in-memory dictionaries.

Changes

File(s) Change Summary
src/axolotl/loaders/patch_manager.py Added _apply_sequence_parallel_patches to handle sequence parallel patching in PatchManager.
src/axolotl/monkeypatch/ring_attn/patch.py Improved patching logic, added fsdp parameter, refined dynamic code execution and patching.
src/axolotl/utils/ctx_managers/sequence_parallel.py Removed imports and calls to patching functions from SequenceParallelContextManager.
tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py Updated tests to load configuration from YAML files using temp_dir fixture instead of in-memory.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PatchManager
    participant PatchUtils

    User->>PatchManager: apply_pre_model_load_patches(config)
    PatchManager->>PatchManager: _apply_sequence_parallel_patches(config)
    alt sequence_parallel_degree > 1
        PatchManager->>PatchUtils: patch_prepare_data_loader()
        PatchManager->>PatchUtils: patch_prepare_device_mesh(sequence_parallel_degree, fsdp)
    end
Loading

Possibly related PRs

  • axolotl-ai-cloud/axolotl#2699: Refactors and enhances SequenceParallelContextManager by adding ring attention registration and patching, indicating both PRs modify how and where sequence parallel patches and ring attention are applied.

Suggested reviewers

  • djsaunde

Poem

In the warren of code, a patch hops anew,
Sequence parallel magic, now cleanly in view.
No more double dipping in the context’s old den,
PatchManager’s the hero, again and again.
With FSDP flags waving, the rabbits all cheer—
For tidier patches, and logic more clear!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f1f8d1 and f71a1c7.

📒 Files selected for processing (1)
  • src/axolotl/loaders/patch_manager.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/axolotl/loaders/patch_manager.py
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.7.1)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.5.1)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov

codecov Bot commented Jun 27, 2025

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 30.76923% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/monkeypatch/ring_attn/patch.py 14.28% 6 Missing ⚠️
src/axolotl/loaders/patch_manager.py 50.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@winglian

Copy link
Copy Markdown
Collaborator

Looks like you need a guard in the _apply_sequence_parallel_patches function in case sp degree is none

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (1)

429-435: Consider the implications of introducing config loading pipeline into kernel patching tests.

While this change aligns with testing the actual usage pattern, it significantly changes the test scope. The load_cfg function performs extensive validation, normalization, plugin preparation, and environment setup that wasn't part of the original test. This makes the test less isolated and introduces dependencies on the entire configuration loading pipeline.

Consider whether these tests should remain focused on kernel patching functionality specifically, or if the broader config loading integration should be tested separately.

Additionally, add error handling for the file operations:

 # Write cfg to yaml file
 path = Path(temp_dir) / "config.yaml"
-with open(path, "w", encoding="utf-8") as fout:
-    fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+try:
+    with open(path, "w", encoding="utf-8") as fout:
+        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+except Exception as e:
+    pytest.fail(f"Failed to write test config: {e}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f14858 and 4f1f8d1.

📒 Files selected for processing (4)
  • src/axolotl/loaders/patch_manager.py (2 hunks)
  • src/axolotl/monkeypatch/ring_attn/patch.py (3 hunks)
  • src/axolotl/utils/ctx_managers/sequence_parallel.py (0 hunks)
  • tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (4 hunks)
💤 Files with no reviewable changes (1)
  • src/axolotl/utils/ctx_managers/sequence_parallel.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/axolotl/monkeypatch/ring_attn/patch.py
  • src/axolotl/loaders/patch_manager.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (1)
src/axolotl/cli/config.py (1)
  • load_cfg (164-249)
🔇 Additional comments (2)
tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py (2)

399-399: Function signature change looks appropriate.

The addition of the temp_dir fixture parameter aligns with the new file-based configuration approach.


516-516: Function signature change looks appropriate.

The addition of the temp_dir fixture parameter is consistent with the first modified function.

Comment thread tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py
Comment thread src/axolotl/loaders/patch_manager.py
@djsaunde djsaunde merged commit 35fdbce into main Jun 30, 2025
8 of 9 checks passed
@djsaunde djsaunde deleted the model-load-fix branch June 30, 2025 02:16
@coderabbitai coderabbitai Bot mentioned this pull request Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants