Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request introduces Expert Parallelism (EP) support via DeepEP integration, enabling efficient distributed training for Mixture-of-Experts models. The implementation adds a complete integration package with plugin architecture, expert sharding, buffer management, kernel registration, and mesh composition with FSDP. Changes span documentation, configuration examples, core integration modules, parallelism infrastructure modifications, and comprehensive tests. Changes
Sequence DiagramsequenceDiagram
participant User
participant Plugin as ExpertParallelPlugin
participant DeepEP as DeepEP<br/>(Buffer)
participant Model as MoE Model<br/>(Experts)
participant Output
User->>Plugin: configure expert_parallel_size > 1
Plugin->>Plugin: pre_model_load: register kernels
Plugin->>Model: post_model_build: shard experts
Plugin->>DeepEP: configure_buffer(ep_group, nvl_bytes)
activate Model
Model->>DeepEP: forward: dispatch(hidden_states, routing)
Note over DeepEP: Send tokens to<br/>correct expert ranks
DeepEP->>Model: receive dispatched tokens
Model->>Model: run local experts<br/>(grouped_mm, scattermoe, etc.)
Model->>DeepEP: combine(expert_outputs)
Note over DeepEP: Reconstruct per-token<br/>outputs across ranks
DeepEP->>Output: per-token expert output
deactivate Model
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes This PR introduces substantial logic across multiple domains: distributed expert sharding (166 lines), high-density kernel registration and dispatch (277 lines), complex plugin lifecycle management (361 lines), and deep Accelerate monkeypatching (279 lines). The heterogeneous nature—spanning buffer management, autograd integration, mesh composition, and DDP/FSDP interop—requires careful reasoning for each subsystem and understanding of distributed training semantics. Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
📖 Documentation Preview: https://6a0eb2f0a659e418c102248d--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit 7e2fbde |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (7)
DEEPEP_SETUP.md (1)
87-90: 💤 Low valueAdd language specifier to fenced code block.
The error message code block is missing a language identifier. While this is a minor linting issue, adding a specifier improves rendering consistency.
📝 Suggested fix
-``` +```text ImportError: deep_ep_cpp.cpython-...-x86_64-linux-gnu.so: undefined symbol: _ZN7deep_ep12internode_ll17query_mask_bufferEPiiS1_P11CUstream_st</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@DEEPEP_SETUP.mdaround lines 87 - 90, The fenced code block in
DEEPEP_SETUP.md containing the ImportError message lacks a language specifier;
update that block to include a language identifier (e.g., "text") so the block
readstext ...to improve linting and rendering consistency—locate the
ImportError snippet in DEEPEP_SETUP.md and add the "text" specifier to the
opening backticks around the block containing the
_ZN7deep_ep12internode_ll17query_mask_buffer... symbol.</details> </blockquote></details> <details> <summary>src/axolotl/integrations/expert_parallel/buffer.py (1)</summary><blockquote> `59-62`: _💤 Low value_ **`reset_buffer()` doesn't reset configuration parameters.** When `reset_buffer()` is called (e.g., between tests), only `_BUFFER` is cleared but `_EP_GROUP`, `_NUM_NVL_BYTES`, and `_NUM_RDMA_BYTES` retain their previous values. If a subsequent test calls `get_buffer()` without first calling `configure_buffer()`, it will use stale configuration from the previous test. <details> <summary>💡 Consider resetting all state for better test isolation</summary> ```diff def reset_buffer() -> None: """Drop the cached Buffer. Used in tests.""" - global _BUFFER + global _BUFFER, _EP_GROUP, _NUM_NVL_BYTES, _NUM_RDMA_BYTES _BUFFER = None + _EP_GROUP = None + _NUM_NVL_BYTES = 256 << 20 + _NUM_RDMA_BYTES = 0 ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/expert_parallel/buffer.py` around lines 59 - 62, reset_buffer() currently only clears the cached _BUFFER leaving _EP_GROUP, _NUM_NVL_BYTES, and _NUM_RDMA_BYTES with stale values; update reset_buffer to also reset those globals to their initial/default state (e.g., None or zero as used by configure_buffer/get_buffer) so tests calling get_buffer() without reconfiguring won't see previous configuration; reference the symbols _EP_GROUP, _NUM_NVL_BYTES, _NUM_RDMA_BYTES, reset_buffer(), configure_buffer(), and get_buffer() when making the change. ``` </details> </blockquote></details> <details> <summary>src/axolotl/utils/distributed.py (1)</summary><blockquote> `332-340`: _💤 Low value_ **EP world size handling is correct but relies on separate env var/monkeypatch.** The EP size is correctly validated and divided from `remaining_world_size`, but unlike TP/CP/DP, it's not added to `pc_kwargs`. This is intentional since EP mesh handling is done via `PARALLELISM_CONFIG_EP_SIZE` env var and `patch_parallelism_config()` in `trainer.py`. This two-path approach (standard config for TP/CP/DP, monkeypatch for EP) works but may be worth documenting in a code comment for future maintainers. <details> <summary>💡 Consider adding a clarifying comment</summary> ```diff # EP consumes part of world_size; subtract it up front so the auto-fill # below doesn't put EP ranks into `dp_replicate_size`. + # Note: ep_size is NOT added to pc_kwargs; it's handled separately via + # PARALLELISM_CONFIG_EP_SIZE env var and patch_parallelism_config(). if expert_parallel_size and expert_parallel_size > 1: ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@src/axolotl/utils/distributed.py` around lines 332 - 340, The EP handling block validates and divides remaining_world_size using expert_parallel_size but does not add EP size to pc_kwargs because EP is configured via the PARALLELISM_CONFIG_EP_SIZE env var and patch_parallelism_config() in trainer.py; add a brief clarifying comment next to the expert_parallel_size/remaining_world_size block that states that EP is intentionally excluded from pc_kwargs, references PARALLELISM_CONFIG_EP_SIZE and patch_parallelism_config(), and explains that EP mesh is applied via environment/monkeypatch elsewhere so future maintainers won't try to add it to pc_kwargs. ``` </details> </blockquote></details> <details> <summary>docs/nd_parallelism.qmd (1)</summary><blockquote> `104-106`: _💤 Low value_ **Example 4 is helpful but could clarify plugin configuration.** The example mentions adding `ExpertParallelPlugin` to `plugins:` but doesn't show the exact YAML syntax. Consider adding the full plugin path for clarity. <details> <summary>📝 Consider expanding the example</summary> ```diff 4. FSDP + EP on a 4-GPU MoE training run: - You want EP to shard the experts and FSDP to shard non-expert params on orthogonal mesh axes. - - Set `expert_parallel_size: 2` and `dp_shard_size: 2` (`ep × dp_shard == world_size`). Add the `ExpertParallelPlugin` to `plugins:`. + - Set `expert_parallel_size: 2` and `dp_shard_size: 2` (`ep × dp_shard == world_size`). Add the plugin: + ```yaml + plugins: + - axolotl.integrations.expert_parallel.ExpertParallelPlugin + ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@docs/nd_parallelism.qmd` around lines 104 - 106, Clarify the plugin configuration by showing the fully-qualified plugin path in the example: when using expert_parallel_size and dp_shard_size with ExpertParallelPlugin, update the plugins block to include the full class path (e.g., axolotl.integrations.expert_parallel.ExpertParallelPlugin) so readers can copy-paste the exact YAML; ensure the example references ExpertParallelPlugin, expert_parallel_size, dp_shard_size, and plugins together for clarity. ``` </details> </blockquote></details> <details> <summary>src/axolotl/integrations/expert_parallel/README.md (1)</summary><blockquote> `113-113`: _💤 Low value_ **Fix heading level increment.** The heading jumps from `##` (h2) to `####` (h4), skipping h3. This breaks accessibility and consistent document structure per markdown best practices. <details> <summary>Proposed fix</summary> ```diff -#### Implementation notes +### Implementation notes ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/expert_parallel/README.md` at line 113, The "Implementation notes" heading currently uses h4 (#### Implementation notes) which skips h3 and breaks document structure; change that heading to h3 (### Implementation notes) so the sequence follows the preceding h2 and maintains proper accessibility and semantic hierarchy in the README. ``` </details> </blockquote></details> <details> <summary>tests/integrations/test_expert_parallel.py (1)</summary><blockquote> `148-164`: _💤 Low value_ **Good test coverage for expert module detection.** Tests correctly verify that 3D expert tensors are detected while non-3D modules are skipped. Minor: the unpacked `name` variable on line 153 is unused. <details> <summary>Silence unused variable warning</summary> ```diff - name, module = found[0] + _name, module = found[0] ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tests/integrations/test_expert_parallel.py` around lines 148 - 164, The test unpacks a (name, module) tuple but never uses name in TestExpertModuleDetection.test_detects_qwen3moe_experts; update the unpacking to avoid an unused-variable warning by replacing "name, module = found[0]" with a discard (e.g., "_, module = found[0]") or by directly accessing the module (e.g., "module = found[0][1]") where the tuple comes from _detect_experts_modules; keep the rest of the assertions unchanged. ``` </details> </blockquote></details> <details> <summary>src/axolotl/integrations/expert_parallel/plugin.py (1)</summary><blockquote> `120-128`: _💤 Low value_ **Suffix matching may over-match parameter names.** The condition `full.endswith(short_name)` without a dot prefix could match unrelated parameters. For example, if `short_name="proj"`, it would match both `gate_up_proj` and `down_proj`, which may be intended but could also match `some_other_proj` unexpectedly. However, since `ignore_list` comes from sharded expert modules and typically contains full submodule paths like `experts.gate_up_proj`, this is likely safe in practice. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/expert_parallel/plugin.py` around lines 120 - 128, Suffix-only matching can over-match unrelated parameter names; in the loop over ignore_list/all_names in plugin.py (variables: ignore_list, all_names, resolved, short_name, full) replace the broad full.endswith(short_name) check with a safer check that compares the final path segment (e.g., full.rsplit(".", 1)[-1] == short_name) or requires a dot prefix (full.endswith("." + short_name)); update the conditional so matches are only when full equals short_name, ends with "." + short_name, or the final segment equals short_name, then append to resolved as before. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@examples/expert_parallel/ep_30b_fullft_4gpu.yaml:
- Line 26: The config uses warmup_steps: 0.1 which is invalid because
warmup_steps expects an integer; replace the key warmup_steps with warmup_ratio
(i.e., change warmup_steps -> warmup_ratio) so the float 0.1 is treated as a
ratio, or alternatively set warmup_steps to an integer if you really want a step
count; update any references expecting warmup_steps accordingly (look for
warmup_steps and warmup_ratio in the config or loader).In
@examples/expert_parallel/ep_fsdp_30b_fullft_4gpu.yaml:
- Line 28: The YAML sets warmup_steps: 0.1 which is a misconfiguration because
warmup_steps expects an integer; change the key to warmup_ratio and keep the 0.1
value (i.e., replace warmup_steps with warmup_ratio) so the scheduler reads a
ratio instead of an integer—update the entry for warmup_steps to warmup_ratio in
the config (referencing the warmup_steps/warmup_ratio keys).In
@src/axolotl/monkeypatch/accelerate/parallelism_config.py:
- Around line 85-92: The _patched_get_mesh function can raise ValueError when
mesh_order.index(x[0]) is called for an unknown dimension name; update
_patched_get_mesh to validate or handle unexpected names by filtering or mapping
active_mesh_dims against the allowed mesh_order before sorting (e.g., build
mesh_dims from self.active_mesh_dims but only include keys present in
mesh_order, or assign a fallback index for unknown keys), and ensure the
returned sorted_items (used to produce the tuple) only contains known dimensions
so zip(*sorted_items, strict=True) stays safe; reference the symbols
_patched_get_mesh, mesh_order, mesh_dims, active_mesh_dims, and sorted_items
when making the change.
Nitpick comments:
In@DEEPEP_SETUP.md:
- Around line 87-90: The fenced code block in DEEPEP_SETUP.md containing the
ImportError message lacks a language specifier; update that block to include a
language identifier (e.g., "text") so the block readstext ...to improve
linting and rendering consistency—locate the ImportError snippet in
DEEPEP_SETUP.md and add the "text" specifier to the opening backticks around the
block containing the _ZN7deep_ep12internode_ll17query_mask_buffer... symbol.In
@docs/nd_parallelism.qmd:
- Around line 104-106: Clarify the plugin configuration by showing the
fully-qualified plugin path in the example: when using expert_parallel_size and
dp_shard_size with ExpertParallelPlugin, update the plugins block to include the
full class path (e.g.,
axolotl.integrations.expert_parallel.ExpertParallelPlugin) so readers can
copy-paste the exact YAML; ensure the example references ExpertParallelPlugin,
expert_parallel_size, dp_shard_size, and plugins together for clarity.In
@src/axolotl/integrations/expert_parallel/buffer.py:
- Around line 59-62: reset_buffer() currently only clears the cached _BUFFER
leaving _EP_GROUP, _NUM_NVL_BYTES, and _NUM_RDMA_BYTES with stale values; update
reset_buffer to also reset those globals to their initial/default state (e.g.,
None or zero as used by configure_buffer/get_buffer) so tests calling
get_buffer() without reconfiguring won't see previous configuration; reference
the symbols _EP_GROUP, _NUM_NVL_BYTES, _NUM_RDMA_BYTES, reset_buffer(),
configure_buffer(), and get_buffer() when making the change.In
@src/axolotl/integrations/expert_parallel/plugin.py:
- Around line 120-128: Suffix-only matching can over-match unrelated parameter
names; in the loop over ignore_list/all_names in plugin.py (variables:
ignore_list, all_names, resolved, short_name, full) replace the broad
full.endswith(short_name) check with a safer check that compares the final path
segment (e.g., full.rsplit(".", 1)[-1] == short_name) or requires a dot prefix
(full.endswith("." + short_name)); update the conditional so matches are only
when full equals short_name, ends with "." + short_name, or the final segment
equals short_name, then append to resolved as before.In
@src/axolotl/integrations/expert_parallel/README.md:
- Line 113: The "Implementation notes" heading currently uses h4 (####
Implementation notes) which skips h3 and breaks document structure; change that
heading to h3 (### Implementation notes) so the sequence follows the preceding
h2 and maintains proper accessibility and semantic hierarchy in the README.In
@src/axolotl/utils/distributed.py:
- Around line 332-340: The EP handling block validates and divides
remaining_world_size using expert_parallel_size but does not add EP size to
pc_kwargs because EP is configured via the PARALLELISM_CONFIG_EP_SIZE env var
and patch_parallelism_config() in trainer.py; add a brief clarifying comment
next to the expert_parallel_size/remaining_world_size block that states that EP
is intentionally excluded from pc_kwargs, references PARALLELISM_CONFIG_EP_SIZE
and patch_parallelism_config(), and explains that EP mesh is applied via
environment/monkeypatch elsewhere so future maintainers won't try to add it to
pc_kwargs.In
@tests/integrations/test_expert_parallel.py:
- Around line 148-164: The test unpacks a (name, module) tuple but never uses
name in TestExpertModuleDetection.test_detects_qwen3moe_experts; update the
unpacking to avoid an unused-variable warning by replacing "name, module =
found[0]" with a discard (e.g., "_, module = found[0]") or by directly accessing
the module (e.g., "module = found[0][1]") where the tuple comes from
_detect_experts_modules; keep the rest of the assertions unchanged.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro **Run ID**: `4d41e62d-ddb9-4683-8e93-7d88aae13904` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between ac77da96daf074475fc8f207b179510ed7ba3f17 and ca82a1a028e607ce4f1c1adecd67b28b93da9688. </details> <details> <summary>📒 Files selected for processing (16)</summary> * `DEEPEP_SETUP.md` * `docs/nd_parallelism.qmd` * `examples/expert_parallel/ep_30b_fullft_4gpu.yaml` * `examples/expert_parallel/ep_fsdp_30b_fullft_4gpu.yaml` * `src/axolotl/integrations/expert_parallel/README.md` * `src/axolotl/integrations/expert_parallel/__init__.py` * `src/axolotl/integrations/expert_parallel/args.py` * `src/axolotl/integrations/expert_parallel/buffer.py` * `src/axolotl/integrations/expert_parallel/experts_fn.py` * `src/axolotl/integrations/expert_parallel/plugin.py` * `src/axolotl/integrations/expert_parallel/shard.py` * `src/axolotl/monkeypatch/accelerate/fsdp2.py` * `src/axolotl/monkeypatch/accelerate/parallelism_config.py` * `src/axolotl/utils/distributed.py` * `src/axolotl/utils/trainer.py` * `tests/integrations/test_expert_parallel.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| optimizer: adamw_torch_8bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 0.003 | ||
| warmup_steps: 0.1 |
There was a problem hiding this comment.
Likely misconfiguration: warmup_steps should be warmup_ratio.
warmup_steps expects an integer (number of steps), but the value 0.1 suggests a ratio was intended. This will likely cause unexpected behavior or an error.
Proposed fix
-warmup_steps: 0.1
+warmup_ratio: 0.1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| warmup_steps: 0.1 | |
| warmup_ratio: 0.1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/expert_parallel/ep_30b_fullft_4gpu.yaml` at line 26, The config uses
warmup_steps: 0.1 which is invalid because warmup_steps expects an integer;
replace the key warmup_steps with warmup_ratio (i.e., change warmup_steps ->
warmup_ratio) so the float 0.1 is treated as a ratio, or alternatively set
warmup_steps to an integer if you really want a step count; update any
references expecting warmup_steps accordingly (look for warmup_steps and
warmup_ratio in the config or loader).
| optimizer: adamw_torch_8bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 0.003 | ||
| warmup_steps: 0.1 |
There was a problem hiding this comment.
Same misconfiguration: warmup_steps should be warmup_ratio.
Same issue as in ep_30b_fullft_4gpu.yaml — warmup_steps expects an integer, not a float ratio.
Proposed fix
-warmup_steps: 0.1
+warmup_ratio: 0.1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| warmup_steps: 0.1 | |
| warmup_ratio: 0.1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/expert_parallel/ep_fsdp_30b_fullft_4gpu.yaml` at line 28, The YAML
sets warmup_steps: 0.1 which is a misconfiguration because warmup_steps expects
an integer; change the key to warmup_ratio and keep the 0.1 value (i.e., replace
warmup_steps with warmup_ratio) so the scheduler reads a ratio instead of an
integer—update the entry for warmup_steps to warmup_ratio in the config
(referencing the warmup_steps/warmup_ratio keys).
| def _patched_get_mesh(self): | ||
| """Build (dim_names, shape) for `init_device_mesh`. Order keeps the dp | ||
| block (ep, dp_replicate, dp_shard) contiguous so `_flatten("dp")` works. | ||
| """ | ||
| mesh_dims = {p: self._sizes[p] for p in self.active_mesh_dims} | ||
| mesh_order = ["ep", "dp_replicate", "dp_shard", "cp", "sp", "tp"] | ||
| sorted_items = sorted(mesh_dims.items(), key=lambda x: mesh_order.index(x[0])) | ||
| return tuple(zip(*sorted_items, strict=True)) |
There was a problem hiding this comment.
Potential ValueError if mesh contains unexpected dimension names.
mesh_order.index(x[0]) will raise ValueError if active_mesh_dims contains a dimension not in mesh_order. Consider adding a fallback or validation.
Proposed defensive fix
def _patched_get_mesh(self):
"""Build (dim_names, shape) for `init_device_mesh`. Order keeps the dp
block (ep, dp_replicate, dp_shard) contiguous so `_flatten("dp")` works.
"""
mesh_dims = {p: self._sizes[p] for p in self.active_mesh_dims}
mesh_order = ["ep", "dp_replicate", "dp_shard", "cp", "sp", "tp"]
- sorted_items = sorted(mesh_dims.items(), key=lambda x: mesh_order.index(x[0]))
+ def order_key(item):
+ try:
+ return mesh_order.index(item[0])
+ except ValueError:
+ return len(mesh_order) # unknown dims go last
+ sorted_items = sorted(mesh_dims.items(), key=order_key)
return tuple(zip(*sorted_items, strict=True))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/monkeypatch/accelerate/parallelism_config.py` around lines 85 -
92, The _patched_get_mesh function can raise ValueError when
mesh_order.index(x[0]) is called for an unknown dimension name; update
_patched_get_mesh to validate or handle unexpected names by filtering or mapping
active_mesh_dims against the allowed mesh_order before sorting (e.g., build
mesh_dims from self.active_mesh_dims but only include keys present in
mesh_order, or assign a fallback index for unknown keys), and ensure the
returned sorted_items (used to produce the tuple) only contains known dimensions
so zip(*sorted_items, strict=True) stays safe; reference the symbols
_patched_get_mesh, mesh_order, mesh_dims, active_mesh_dims, and sorted_items
when making the change.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/axolotl/integrations/expert_parallel/plugin.py (2)
229-229: 💤 Low valueUse ASCII
xinstead of Unicode×in error messages.The multiplication sign
×(Unicode) can cause display issues in some terminals and logging systems.♻️ Suggested fix
- f"ep={ep_size}, dp_shard={dp_shard_size}, tp={tp_size}, cp={cp_size}. " - "v1 supports only EP-only or EP × dp_shard." + f"ep={ep_size}, dp_shard={dp_shard_size}, tp={tp_size}, cp={cp_size}. " + "v1 supports only EP-only or EP x dp_shard."Also on line 252:
- "Set dp_shard_size such that ep × dp_shard == world_size, or set " + "Set dp_shard_size such that ep x dp_shard == world_size, or set "🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/expert_parallel/plugin.py` at line 229, Replace the Unicode multiplication sign '×' with the ASCII letter 'x' in the error message string "v1 supports only EP-only or EP × dp_shard." (and the other similar message referenced at the other location) so it reads "v1 supports only EP-only or EP x dp_shard."; update both occurrences to avoid terminals/logging display issues.
118-126: 💤 Low valueSuffix matching logic could match unintended parameters.
The condition on line 124 (
full.endswith(short_name)) matches without requiring a dot separator, which could incorrectly match parameters that share a suffix. For example, ifshort_name = "proj", it would match"gate_up_proj".In practice,
short_namevalues fromshard.pyare fully qualified paths (e.g.,"layers.0.experts.gate_up_proj"), so false positives are unlikely, but the matching could be tightened:♻️ Suggested fix to tighten matching
for short_name in ignore_list: # Match either an exact suffix or with PEFT's `base_model.model.` prefix. for full in all_names: - if ( - full == short_name - or full.endswith("." + short_name) - or full.endswith(short_name) - ): + if full == short_name or full.endswith("." + short_name): resolved.append(full)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/expert_parallel/plugin.py` around lines 118 - 126, The matching in the loop using full, short_name, ignore_list, all_names, and resolved is too permissive because full.endswith(short_name) can match unintended suffixes; tighten it so you only accept exact matches or suffixes preceded by a dot (or the PEFT prefix case already handled), e.g., replace the loose full.endswith(short_name) check with a check that requires a dot separator before short_name (or use a regex like (^|\\.)short_name$) so only true path-segment suffixes are resolved into resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/axolotl/integrations/expert_parallel/plugin.py`:
- Line 229: Replace the Unicode multiplication sign '×' with the ASCII letter
'x' in the error message string "v1 supports only EP-only or EP × dp_shard."
(and the other similar message referenced at the other location) so it reads "v1
supports only EP-only or EP x dp_shard."; update both occurrences to avoid
terminals/logging display issues.
- Around line 118-126: The matching in the loop using full, short_name,
ignore_list, all_names, and resolved is too permissive because
full.endswith(short_name) can match unintended suffixes; tighten it so you only
accept exact matches or suffixes preceded by a dot (or the PEFT prefix case
already handled), e.g., replace the loose full.endswith(short_name) check with a
check that requires a dot separator before short_name (or use a regex like
(^|\\.)short_name$) so only true path-segment suffixes are resolved into
resolved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 08a0283e-778d-424f-8b47-587a914046a1
📒 Files selected for processing (5)
src/axolotl/integrations/expert_parallel/__init__.pysrc/axolotl/integrations/expert_parallel/args.pysrc/axolotl/integrations/expert_parallel/plugin.pysrc/axolotl/integrations/expert_parallel/shard.pysrc/axolotl/utils/collators/mm_chat.py
✅ Files skipped from review due to trivial changes (1)
- src/axolotl/integrations/expert_parallel/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/axolotl/integrations/expert_parallel/args.py
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Description
Adds EP via DeepEP
Tested on 2xA100 EP, 4xA100 EP & EP/FSDP and individual bench.
To test:
Very experimental.
Motivation and Context
How has this been tested?
Sweep and E2E tests with loss match with atol and grad norm + correctness check single layer.
AI Usage Disclaimer
Claude heavily
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
New Features
Documentation
Bug Fixes