Skip to content

fix: ray batch_size derivation, fsdp schema migration, FakeExperts peft 0.19 compat#3671

Merged
winglian merged 2 commits into
mainfrom
fix/ci-ray-batchsize-fsdp-fakeexperts
May 20, 2026
Merged

fix: ray batch_size derivation, fsdp schema migration, FakeExperts peft 0.19 compat#3671
winglian merged 2 commits into
mainfrom
fix/ci-ray-batchsize-fsdp-fakeexperts

Conversation

@winglian

@winglian winglian commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Three independent CI regressions on main. Two were introduced by afd74ae (#3664 "Fix: ci being broken (fa2, ray)"); the third is from a peft upgrade to 0.19.x. All three reproduce on main independent of any feature branch.

Failing job: test-axolotl-multigpu (130, 13.0.0, 3.11, 2.9.1, 2)

Failure 1 — Ray workers crash with cfg.batch_size = None

tests/e2e/multigpu/test_ray.py::TestMultiGPURay::{test_lora_ddp,test_ds_zero2_packed[1|2]} fail with TypeError: unsupported operand type(s) for /: 'float' and 'NoneType' at trainer.py:519 (calculate_total_num_steps).

Cause: afd74ae wrapped both batch_size and gradient_accumulation_steps derivations in normalize_config under if not cfg.use_ray:. The Ray worker's re-validate rejects having BOTH set, but doesn't derive batch_size either — so it stays None and blows up downstream.

Fix: only defer gradient_accumulation_steps (the field the worker validator rejects); always derive batch_size on the controller.

Failure 2 — fsdp schema rejects bool after deprecation shim

tests/e2e/multigpu/test_ray.py::TestMultiGPURay::test_sft_fsdp2_packed[1|2] fails inside the Ray worker's re-validation with pydantic_core._pydantic_core.ValidationError: fsdp / Input should be a valid list [type=list_type, input_value=True, input_type=bool].

Cause: src/axolotl/utils/trainer.py::prepare_optim_env mutates cfg.fsdp = True when migrating from fsdp_config-style configs (line 650). The schema types fsdp: list[str] | None, so the bool fails the worker's re-validate.

Fix: drop the gratuitous mutation. Every downstream caller already handles cfg.fsdp_config or cfg.fsdp (verified across core/builders/base.py, core/builders/rl.py, loaders/model.py, train.py, utils/config/__init__.py, utils/distributed.py). The existing TODO to remove the legacy cfg.fsdp check in 0.12 still stands.

Failure 3 — FakeExperts mock breaks under peft >= 0.19

tests/utils/schemas/validation/test_moe_quant.py::TestMoeAdapterTrainMergeRoundtrip::test_train_save_merge_no_size_mismatch fails with AttributeError: 'FakeExperts' object has no attribute 'weight' in peft/utils/save_and_load.py:520.

Cause: peft 0.19 added _maybe_shard_state_dict_for_tp (called unconditionally from set_peft_model_state_dict), which reads base_layer.weight.device. FakeExperts uses the target_parameters style (gate_up_proj / down_proj) and legitimately has no .weight — real nn.Linear-style modules always do, peft assumed it.

Fix: stub a zero-size buffer on FakeExperts.weight. Optional follow-up: upstream PEFT could guard the call with if hasattr(base_layer, "weight").

Test plan

  • pytest tests/utils/ -k "normalize_config or batch_size" -x — 2 pass
  • pytest tests/patched/test_validation.py -x — 65 pass, 1 skipped
  • pytest tests/utils/schemas/validation/test_moe_quant.py::TestMoeAdapterTrainMergeRoundtrip -x — 1 pass
  • pre-commit run --files <changed files> — all 8 hooks pass
  • Multi-GPU test_ray.py failures are CI-environment-specific; verifying via this PR's CI run

Out of scope (per handoff)

  • The pre-existing "Ray Train Controller actor state" log spam
  • The torchao cpp-extensions import warning
  • The Modal runner kernel-version warning

Summary by CodeRabbit

  • Bug Fixes

    • Fixed FSDP configuration handling during initialization to prevent unintended state mutations and ensure consistent behavior.
  • Tests

    • Enhanced test suite for expert model configurations to ensure better compatibility with distributed training frameworks.
  • Refactor

    • Updated internal code documentation and formatting to improve clarity and maintainability.

Review Change Stack

…ft 0.19 compat

Three independent CI regressions, all reproducible on main:

1. normalize_config: the `if not cfg.use_ray:` guard introduced in afd74ae
   wrapped BOTH derivations (`gradient_accumulation_steps` and
   `batch_size`). Only `gradient_accumulation_steps` is what the Ray worker
   re-validate rejects when both are set; `batch_size` must still derive on
   the controller because `calculate_total_num_steps` divides by it. Without
   this fix: `TypeError: unsupported operand type(s) for /: 'float' and 'NoneType'`
   at trainer.py:519.

2. prepare_optim_env mutated `cfg.fsdp = True` when migrating from
   fsdp_config-style configs. The schema now types `fsdp: list[str] | None`,
   so the bool fails Ray worker re-validation with `list_type` ValidationError.
   Every downstream caller already handles `cfg.fsdp_config or cfg.fsdp`, so
   the mutation is gratuitous — drop it. The TODO to remove the cfg.fsdp
   check entirely in 0.12 stays.

3. peft 0.19's `_maybe_shard_state_dict_for_tp` reads
   `base_layer.weight.device` unconditionally. The FakeExperts test mock
   uses `target_parameters` style (gate_up_proj / down_proj), so it
   legitimately has no `.weight`. Stub a zero-size buffer.

Failing job: test-axolotl-multigpu (130, 13.0.0, 3.11, 2.9.1, 2)

Signed-off-by: Wing Lian <wing@axolotl.ai>
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c5898e6-f3f0-47cf-8105-1c36272ea92c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR contains three focused changes: a configuration comment and formatting update clarifying Ray-deferred batch computation, removal of forced FSDP mutation in trainer optimization setup with documentation that fsdp_config is the source of truth, and addition of a dummy weight buffer to a test fixture for PEFT compatibility.

Changes

Configuration, trainer setup, and test compatibility updates

Layer / File(s) Summary
Configuration and trainer optimization setup
src/axolotl/utils/config/__init__.py, src/axolotl/utils/trainer.py
Ray-handling comment in normalize_config is updated for clarity; prepare_optim_env stops mutating cfg.fsdp when cfg.fsdp_config is present, replacing the mutation with documentation that fsdp_config is the source of truth.
Test fixture compatibility for PEFT
tests/utils/schemas/validation/test_moe_quant.py
FakeExperts module registers a non-persistent weight buffer to provide the base_layer.weight.device attribute that PEFT accesses unconditionally during parametrized training.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested labels

ready to merge

Suggested reviewers

  • SalmanMohammadi
  • NanoCode012
  • ved1beta
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes all three main fixes in the changeset: Ray batch_size derivation, FSDP schema migration, and FakeExperts PEFT compatibility.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ci-ray-batchsize-fsdp-fakeexperts

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/axolotl/utils/trainer.py`:
- Around line 650-654: Replace the five-line comment near the cfg.fsdp /
fsdp_config logic with a single short line: "# fsdp_config is source of truth;
mutating cfg.fsdp to bool breaks Ray worker schema validation" — update the
comment adjacent to the cfg.fsdp and fsdp_config references (see the cfg.fsdp
assignment and the downstream check `if cfg.fsdp or cfg.fsdp_config:`) so it
conforms to the one-line comment guideline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0fb2509-4fe6-4d8f-b2c9-4fb6fd6ef84b

📥 Commits

Reviewing files that changed from the base of the PR and between afd74ae and 49ace58.

📒 Files selected for processing (3)
  • src/axolotl/utils/config/__init__.py
  • src/axolotl/utils/trainer.py
  • tests/utils/schemas/validation/test_moe_quant.py

Comment thread src/axolotl/utils/trainer.py Outdated
Comment on lines +650 to +654
# Don't mutate cfg.fsdp to True when fsdp_config is the source of
# truth: the schema types fsdp as list[str] | None, and Ray workers
# re-validate the controller's dumped config, where a bool would
# fail (`list_type` ValidationError). Downstream callers
# (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Condense comment to maximum one short line.

The comment spans 5 lines but the coding guidelines require comments in src/axolotl/**/*.py to be a maximum of one short line. Consider condensing to: # fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation

As per coding guidelines, "Comments should be a maximum of one short line" for files matching src/axolotl/**/*.py.

✂️ Proposed condensed comment
-    # Don't mutate cfg.fsdp to True when fsdp_config is the source of
-    # truth: the schema types fsdp as list[str] | None, and Ray workers
-    # re-validate the controller's dumped config, where a bool would
-    # fail (`list_type` ValidationError). Downstream callers
-    # (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case.
+    # fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Don't mutate cfg.fsdp to True when fsdp_config is the source of
# truth: the schema types fsdp as list[str] | None, and Ray workers
# re-validate the controller's dumped config, where a bool would
# fail (`list_type` ValidationError). Downstream callers
# (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case.
# fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/axolotl/utils/trainer.py` around lines 650 - 654, Replace the five-line
comment near the cfg.fsdp / fsdp_config logic with a single short line: "#
fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker
schema validation" — update the comment adjacent to the cfg.fsdp and fsdp_config
references (see the cfg.fsdp assignment and the downstream check `if cfg.fsdp or
cfg.fsdp_config:`) so it conforms to the one-line comment guideline.

@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@winglian winglian merged commit dc8f7c7 into main May 20, 2026
2 checks passed
@winglian winglian deleted the fix/ci-ray-batchsize-fsdp-fakeexperts branch May 20, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant