Skip to content

add gpu correctness tests for scattermoe-lora#3474

Merged
winglian merged 1 commit into
mainfrom
moe-lora-tests
Mar 7, 2026
Merged

add gpu correctness tests for scattermoe-lora#3474
winglian merged 1 commit into
mainfrom
moe-lora-tests

Conversation

@winglian

@winglian winglian commented Mar 7, 2026

Copy link
Copy Markdown
Collaborator

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • Tests
    • Added comprehensive test suite for ScatterMoE LoRA fused kernel implementations, covering forward/backward passes, gradient calculations, numerical stability across precision variants, and autograd integration with edge cases
    • Added integration tests validating LoRA adapter compatibility with OLMoE models, including layout conversions, kernel optimization paths, and gradient correctness validation

@coderabbitai

coderabbitai Bot commented Mar 7, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Introduces two comprehensive integration test suites for ScatterMoE LoRA fused kernels: one validating forward and backward paths across dimensions, data types, and configurations; another testing ScatterMoE integration with OLMoE models and peft LoRA adapters, including layout conversions and kernelized execution paths.

Changes

Cohort / File(s) Summary
ScatterMoE LoRA Kernels Tests
tests/e2e/integrations/test_scattermoe_lora_kernels.py
Comprehensive test suite with reference implementations for routing, forward, and backward computations. Includes forward pass validation, LoRA gradient calculations, autograd integration, edge cases (single expert, empty experts), fused kernel testing (dX, gather), token rounding utilities, and combined optimization validation across multiple dimensions, data types (fp32, bf16, fp16), and grouped/ungrouped configurations.
ScatterMoE LoRA OLMoE Integration Tests
tests/e2e/integrations/test_scattermoe_lora_olmoe.py
Integration tests for ScatterMoE with OLMoE models and peft LoRA adapters, covering layout conversions (peft rank-major to expert-major), forward equivalence validation, LoRA delta computations, CUDA kernel execution via LocalLayerRepository, kernelized forward comparisons, backward gradient consistency, and shared-expert handling with fallback behavior for CPU-only environments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

ready to merge

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add gpu correctness tests for scattermoe-lora' is clear, specific, and accurately summarizes the main change—introducing comprehensive GPU correctness tests for ScatterMoE with LoRA fused kernels.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch moe-lora-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
tests/e2e/integrations/test_scattermoe_lora_kernels.py (4)

835-836: Minor: Unused expert_offsets in edge case tests.

The variable is returned but not used. Consider using underscore prefix.

♻️ Suggested fix
-        sorted_expert_idxs, sorted_scattered_idxs, expert_offsets = (
+        sorted_expert_idxs, sorted_scattered_idxs, _expert_offsets = (
             flatten_sort_count_ref(selected_experts, E)
         )

Apply same change to line 879.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` around lines 835 -
836, The tuple returned by flatten_sort_count_ref assigns expert_offsets which
is never used; rename that element to have an underscore prefix (e.g.,
_expert_offsets) wherever the tuple is destructured (the occurrence assigning
sorted_expert_idxs, sorted_scattered_idxs, expert_offsets and the similar
destructuring at the later occurrence around line 879) so the unused variable is
clearly marked as intentional; update both destructuring sites referencing
flatten_sort_count_ref to use _expert_offsets instead of expert_offsets.

74-74: Minor: Unused variable K in unpacking.

The variable K is unpacked but never used in the function. Consider using underscore prefix.

♻️ Suggested fix
-    E, K, N = W.shape
+    E, _K, N = W.shape
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` at line 74, The
unpacking currently binds an unused variable `K` from `W.shape`; change the
unpacking in the test (where `E, K, N = W.shape` appears) to use an
underscore-prefixed name like `E, _K, N = W.shape` (or `E, _, N = W.shape`) so
linters and readers know the middle dimension is intentionally unused; update
any references to `K` if they exist (there should be none).

537-537: Minor: Unused unpacked variables ref_dA and ref_dB.

These are returned by reference_lora_backward but not used since this test only verifies input gradients.

♻️ Suggested fix
-        ref_dX, ref_dA, ref_dB = reference_lora_backward(
+        ref_dX, _ref_dA, _ref_dB = reference_lora_backward(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` at line 537, The test
unpacks three values from reference_lora_backward into ref_dX, ref_dA, ref_dB
but only uses ref_dX; update the unpacking to avoid unused variables by either
assigning only the first return (e.g., ref_dX = reference_lora_backward(...)[0])
or using Python throwaway names (e.g., ref_dX, _, _ =
reference_lora_backward(...)) so ref_dA and ref_dB are not left unused; ensure
the call site remains reference_lora_backward(...) and that ref_dX is used
unchanged.

1281-1289: Minor: Unused unpacked variables in token rounding tests.

padded_ei and padded_si are returned but only padded_offsets and real_offsets are verified in this test.

♻️ Suggested fix
-        padded_ei, padded_si, padded_offsets, real_offsets = (
+        _padded_ei, _padded_si, padded_offsets, real_offsets = (
             lora_ops.round_expert_counts(

Apply same pattern to lines 1344 and 1386.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` around lines 1281 -
1289, The test unpacks padded_ei and padded_si from lora_ops.round_expert_counts
but never uses them; update the unpacking in the test to ignore those values
(e.g., use _ or _padded_ei/_padded_si) when calling lora_ops.round_expert_counts
so only padded_offsets and real_offsets are kept, and apply the same change to
the other similar unpack sites referenced (the calls around the locations noted,
e.g., the other calls to lora_ops.round_expert_counts at the later test blocks).
tests/e2e/integrations/test_scattermoe_lora_olmoe.py (1)

494-505: Consider using underscore prefix for intentionally unused unpacked variables.

The gup_s and down_s scaling values are unpacked but not used in these shape verification tests.

♻️ Suggested fix
         E, r = config.num_experts, 4
-        gup_A, gup_B, gup_s = gup_lora
+        gup_A, gup_B, _gup_s = gup_lora
         assert gup_A.shape == (E * r, config.hidden_size), (
...
         # down_proj W = param.T = [E, inter, hidden], K=inter, N=hidden
-        down_A, down_B, down_s = down_lora
+        down_A, down_B, _down_s = down_lora
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_olmoe.py` around lines 494 - 505,
The test unpacks scaling variables gup_s and down_s but never uses them; change
those unpacked names to use an underscore prefix (e.g., _gup_s and _down_s) to
indicate intentional unused variables in the assertions around gup_A, gup_B,
down_A, down_B; update the unpacking lines where gup_lora and down_lora are
destructured so the unused values are named _gup_s and _down_s respectively
(leaving the rest of the assertions unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/integrations/test_scattermoe_lora_olmoe.py`:
- Around line 1221-1229: The repo_path calculation in
test_scattermoe_lora_olmoe.py duplicates the incorrect relative navigation used
by _get_repo_path() (it builds tests/src/axolotl/... instead of
<repo_root>/src/axolotl/...), so replace the inline Path(...) construction with
a call to the shared helper (e.g. _get_repo_path or a newly exported
get_repo_root helper) and update tests to use that helper; ensure the helper
returns the repository root (not relative to the tests folder) and then append
/"src"/"axolotl"/"integrations"/"kernels"/"libs"/"scattermoe_lora" to that root
to compute repo_path.
- Around line 932-943: The _get_repo_path() helper builds an incorrect path by
stopping at parent.parent.parent (which lands in tests/) then appending "src",
causing tests/src/... to be sought; update the path traversal in
_get_repo_path() to ascend one more level (use parent.parent.parent.parent) so
the constructed path points to the repository root and resolves to
src/axolotl/integrations/kernels/libs/scattermoe_lora accordingly.

---

Nitpick comments:
In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py`:
- Around line 835-836: The tuple returned by flatten_sort_count_ref assigns
expert_offsets which is never used; rename that element to have an underscore
prefix (e.g., _expert_offsets) wherever the tuple is destructured (the
occurrence assigning sorted_expert_idxs, sorted_scattered_idxs, expert_offsets
and the similar destructuring at the later occurrence around line 879) so the
unused variable is clearly marked as intentional; update both destructuring
sites referencing flatten_sort_count_ref to use _expert_offsets instead of
expert_offsets.
- Line 74: The unpacking currently binds an unused variable `K` from `W.shape`;
change the unpacking in the test (where `E, K, N = W.shape` appears) to use an
underscore-prefixed name like `E, _K, N = W.shape` (or `E, _, N = W.shape`) so
linters and readers know the middle dimension is intentionally unused; update
any references to `K` if they exist (there should be none).
- Line 537: The test unpacks three values from reference_lora_backward into
ref_dX, ref_dA, ref_dB but only uses ref_dX; update the unpacking to avoid
unused variables by either assigning only the first return (e.g., ref_dX =
reference_lora_backward(...)[0]) or using Python throwaway names (e.g., ref_dX,
_, _ = reference_lora_backward(...)) so ref_dA and ref_dB are not left unused;
ensure the call site remains reference_lora_backward(...) and that ref_dX is
used unchanged.
- Around line 1281-1289: The test unpacks padded_ei and padded_si from
lora_ops.round_expert_counts but never uses them; update the unpacking in the
test to ignore those values (e.g., use _ or _padded_ei/_padded_si) when calling
lora_ops.round_expert_counts so only padded_offsets and real_offsets are kept,
and apply the same change to the other similar unpack sites referenced (the
calls around the locations noted, e.g., the other calls to
lora_ops.round_expert_counts at the later test blocks).

In `@tests/e2e/integrations/test_scattermoe_lora_olmoe.py`:
- Around line 494-505: The test unpacks scaling variables gup_s and down_s but
never uses them; change those unpacked names to use an underscore prefix (e.g.,
_gup_s and _down_s) to indicate intentional unused variables in the assertions
around gup_A, gup_B, down_A, down_B; update the unpacking lines where gup_lora
and down_lora are destructured so the unused values are named _gup_s and _down_s
respectively (leaving the rest of the assertions unchanged).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8b17951-1cf1-4187-9264-86aa82667a00

📥 Commits

Reviewing files that changed from the base of the PR and between 80f7088 and 0acf21f.

📒 Files selected for processing (2)
  • tests/e2e/integrations/test_scattermoe_lora_kernels.py
  • tests/e2e/integrations/test_scattermoe_lora_olmoe.py

Comment on lines +932 to +943
@staticmethod
def _get_repo_path():
"""Get the path to scattermoe_lora within axolotl's plugin."""
return (
Path(__file__).parent.parent.parent
/ "src"
/ "axolotl"
/ "integrations"
/ "kernels"
/ "libs"
/ "scattermoe_lora"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify the path structure and check if the calculated path is correct

# Find the test file location
fd -t f "test_scattermoe_lora_olmoe.py" --exec dirname {}

# Check what exists at the calculated path (3 levels up)
fd -t f "test_scattermoe_lora_olmoe.py" --exec sh -c 'dirname "$1" | xargs -I{} sh -c "cd {} && cd ../../.. && pwd && ls -la | head -5"' _ {}

# Find the actual scattermoe_lora directory
fd -t d "scattermoe_lora" | head -5

Repository: axolotl-ai-cloud/axolotl

Length of output: 416


Path calculation error: _get_repo_path() navigates to non-existent tests/src/ directory.

From tests/e2e/integrations/test_scattermoe_lora_olmoe.py, parent.parent.parent reaches tests/, then appending "src" creates tests/src/axolotl/... which does not exist. The actual scattermoe_lora directory is at the repository root: src/axolotl/integrations/kernels/libs/scattermoe_lora/.

To reach the repository root, use parent.parent.parent.parent:

Fix
     `@staticmethod`
     def _get_repo_path():
         """Get the path to scattermoe_lora within axolotl's plugin."""
         return (
-            Path(__file__).parent.parent.parent
+            Path(__file__).parent.parent.parent.parent
             / "src"
             / "axolotl"
             / "integrations"
             / "kernels"
             / "libs"
             / "scattermoe_lora"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_olmoe.py` around lines 932 - 943,
The _get_repo_path() helper builds an incorrect path by stopping at
parent.parent.parent (which lands in tests/) then appending "src", causing
tests/src/... to be sought; update the path traversal in _get_repo_path() to
ascend one more level (use parent.parent.parent.parent) so the constructed path
points to the repository root and resolves to
src/axolotl/integrations/kernels/libs/scattermoe_lora accordingly.

Comment on lines +1221 to +1229
repo_path = (
Path(__file__).parent.parent.parent
/ "src"
/ "axolotl"
/ "integrations"
/ "kernels"
/ "libs"
/ "scattermoe_lora"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Duplicated potentially incorrect path calculation.

This path calculation has the same issue as _get_repo_path() - it navigates to tests/src/axolotl/... instead of <repo_root>/src/axolotl/.... Consider extracting this to a shared helper to avoid duplication and ensure consistency.

♻️ Suggested refactor to use the shared helper
         # Kernelize
-        repo_path = (
-            Path(__file__).parent.parent.parent
-            / "src"
-            / "axolotl"
-            / "integrations"
-            / "kernels"
-            / "libs"
-            / "scattermoe_lora"
-        )
+        repo_path = TestKernelizeIntegration._get_repo_path()
         local_repo = LocalLayerRepository(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_olmoe.py` around lines 1221 -
1229, The repo_path calculation in test_scattermoe_lora_olmoe.py duplicates the
incorrect relative navigation used by _get_repo_path() (it builds
tests/src/axolotl/... instead of <repo_root>/src/axolotl/...), so replace the
inline Path(...) construction with a call to the shared helper (e.g.
_get_repo_path or a newly exported get_repo_root helper) and update tests to use
that helper; ensure the helper returns the repository root (not relative to the
tests folder) and then append
/"src"/"axolotl"/"integrations"/"kernels"/"libs"/"scattermoe_lora" to that root
to compute repo_path.

@codecov

codecov Bot commented Mar 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@winglian winglian merged commit a36aaa7 into main Mar 7, 2026
13 of 17 checks passed
@winglian winglian deleted the moe-lora-tests branch March 7, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant