[Train] Support custom attention with context parallel #910

lxd-cumt · 2025-11-06T03:27:03Z

PR Category

Train

PR Types

New Features

PR Description

support flex attention with context parallel, which enables dynamic attention mask in training. Config flex_attention: true for using.
add a unified RingAttention interface, to support dynamic attention mechanisms, such as sparse attention, which can be implemented by simply replacing the forward_impl and backward_impl in RingAttention.
supports more optimization strategies, such as comp/comm overlapping by gpu multi-stream, pruning redundant computation for sparse attention, and load-balancing optimizations for CP-rank computation.

…g_flex_attn

gemini-code-assist

Code Review

This pull request introduces support for custom attention mechanisms, specifically FlexAttention and RingAttention, to enable context parallelism with dynamic attention masks. The changes include new modules for these attention mechanisms, modifications to the GPT model builder to incorporate them, and new command-line arguments for configuration.

My review identifies a critical issue that would cause a runtime error, a high-risk dependency on internal PyTorch APIs, and a maintainability concern regarding a magic number. Addressing these points will improve the robustness and maintainability of the new features.

gemini-code-assist · 2025-11-06T03:29:12Z

flagscale/backends/Megatron-LM/megatron/core/transformer/flex_attention.py.patch

+            assert HAVE_FA3 or is_fa_min_version(
+                "2.7.3"
+            ), "flash attn verion v2.7.3 and above is required for dynamic batching."


The variable HAVE_FA3 is used here but it's not defined or imported, which will lead to a NameError at runtime. It seems you intended to check for the availability of flash-attention v3.

You should define HAVE_FA3 at the top of this file, similar to how it's done in megatron/core/transformer/attention.py, by adding the following try-except block:

try: from flash_attn.flash_attn_interface import flash_attn_varlen_func as flash_attn_varlen_func_fa3 from flash_attn.flash_attn_interface import flash_attn_func as flash_attn_func_fa3 HAVE_FA3 = True except ImportError: HAVE_FA3 = False

gemini-code-assist · 2025-11-06T03:29:12Z

flagscale/backends/Megatron-LM/megatron/core/transformer/ring_attention.py.patch

+from torch._higher_order_ops.flex_attention import (
+    sdpa_dense_backward, 
+    create_fw_bw_graph,
+)


This implementation relies on internal PyTorch APIs (sdpa_dense_backward, create_fw_bw_graph from torch._higher_order_ops.flex_attention). These APIs are not part of the public contract, are undocumented, and can change or be removed without notice in future PyTorch releases. This poses a significant maintenance risk and could break the functionality unexpectedly upon a PyTorch upgrade.

It is highly recommended to either:

Find an alternative implementation that uses public PyTorch APIs.

If no alternative exists, add a prominent warning in the documentation and code about this dependency, and ideally add version checks for PyTorch to catch breaking changes early.

gemini-code-assist · 2025-11-06T03:29:13Z

flagscale/backends/Megatron-LM/megatron/training/arguments.py.patch

+        assert not args.multi_latent_attention, "FlexAttention with Multi Latent Attention is not supported"
+        assert args.transformer_impl == 'transformer_engine', \
+            'FlexAttention is only supported with transformer_engine implementation'
+        assert args.seq_length % (args.context_parallel_size * 128) == 0, "seq_length must be divisible by context_parallel_size*flex_attn_block_size(128)"


The value 128 is a magic number. Although the comment in the assertion message mentions it's flex_attn_block_size, it's better to define it as a constant to improve readability and maintainability. This avoids having to search for this hardcoded value if the block size ever needs to be updated.

Consider defining it as a constant at the top of the file or function and using the constant here.

FLEX_ATTN_BLOCK_SIZE = 128 assert args.seq_length % (args.context_parallel_size * FLEX_ATTN_BLOCK_SIZE) == 0, f"seq_length must be divisible by context_parallel_size*flex_attn_block_size({FLEX_ATTN_BLOCK_SIZE})"

into ring_flex_attn

zhaoyinglia

Add memory performance results in pr description

zhaoyinglia · 2025-11-06T12:28:22Z

flagscale/backends/Megatron-LM/megatron/training/arguments.py.patch

+        assert not args.multi_latent_attention, "FlexAttention with Multi Latent Attention is not supported"
+        assert args.transformer_impl == 'transformer_engine', \
+            'FlexAttention is only supported with transformer_engine implementation'
+        assert args.seq_length % (args.context_parallel_size * 128) == 0, "seq_length must be divisible by context_parallel_size*flex_attn_block_size(128)"


dose this assert need to consider tensor parallel?

zhaoyinglia · 2025-11-06T12:30:19Z

flagscale/backends/Megatron-LM/megatron/core/transformer/ring_attention.py.patch

+    return mask_mod
+
+
+def _flex_forward(


Draw a sketch to show how you chunk q/k/v and how the communication works.

CLAassistant · 2025-11-18T13:07:12Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ mslv
❌ lxd-cumt

mslv seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

mslv and others added 2 commits November 6, 2025 11:14

flex-ring attn support

0746383

Merge branch 'main' of https://github.com/FlagOpen/FlagScale into rin…

c6bb722

…g_flex_attn

lxd-cumt requested review from a team, aoyulong, heavyrain-lzy and zhaoyinglia as code owners November 6, 2025 03:27

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

lxd-cumt changed the title ~~Support custom attention with context parallel~~ [Train] Support custom attention with context parallel Nov 6, 2025

mslv and others added 2 commits November 6, 2025 13:12

format

9c30fc1

Merge branch 'ring_flex_attn' of https://gitee.com/lxdcumt/FlagScale_dev

6391199

into ring_flex_attn

zhaoyinglia reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Train] Support custom attention with context parallel #910

[Train] Support custom attention with context parallel #910

Uh oh!

lxd-cumt commented Nov 6, 2025 •

edited by ftgreat

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Uh oh!

gemini-code-assist bot Nov 6, 2025

Uh oh!

gemini-code-assist bot Nov 6, 2025

Uh oh!

zhaoyinglia left a comment

Uh oh!

zhaoyinglia Nov 6, 2025

Uh oh!

zhaoyinglia Nov 6, 2025

Uh oh!

CLAassistant commented Nov 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Train] Support custom attention with context parallel #910

Are you sure you want to change the base?

[Train] Support custom attention with context parallel #910

Uh oh!

Conversation

lxd-cumt commented Nov 6, 2025 • edited by ftgreat Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

PR Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

zhaoyinglia left a comment

Choose a reason for hiding this comment

Uh oh!

zhaoyinglia Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

zhaoyinglia Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lxd-cumt commented Nov 6, 2025 •

edited by ftgreat

Loading

CLAassistant commented Nov 18, 2025 •

edited

Loading