fix: ray batch_size derivation, fsdp schema migration, FakeExperts peft 0.19 compat by winglian · Pull Request #3671 · axolotl-ai-cloud/axolotl

winglian · 2026-05-20T20:11:16Z

Summary

Three independent CI regressions on main. Two were introduced by afd74ae (#3664 "Fix: ci being broken (fa2, ray)"); the third is from a peft upgrade to 0.19.x. All three reproduce on main independent of any feature branch.

Failing job: test-axolotl-multigpu (130, 13.0.0, 3.11, 2.9.1, 2)

Failure 1 — Ray workers crash with `cfg.batch_size = None`

tests/e2e/multigpu/test_ray.py::TestMultiGPURay::{test_lora_ddp,test_ds_zero2_packed[1|2]} fail with TypeError: unsupported operand type(s) for /: 'float' and 'NoneType' at trainer.py:519 (calculate_total_num_steps).

Cause: afd74ae wrapped both batch_size and gradient_accumulation_steps derivations in normalize_config under if not cfg.use_ray:. The Ray worker's re-validate rejects having BOTH set, but doesn't derive batch_size either — so it stays None and blows up downstream.

Fix: only defer gradient_accumulation_steps (the field the worker validator rejects); always derive batch_size on the controller.

Failure 2 — `fsdp` schema rejects bool after deprecation shim

tests/e2e/multigpu/test_ray.py::TestMultiGPURay::test_sft_fsdp2_packed[1|2] fails inside the Ray worker's re-validation with pydantic_core._pydantic_core.ValidationError: fsdp / Input should be a valid list [type=list_type, input_value=True, input_type=bool].

Cause: src/axolotl/utils/trainer.py::prepare_optim_env mutates cfg.fsdp = True when migrating from fsdp_config-style configs (line 650). The schema types fsdp: list[str] | None, so the bool fails the worker's re-validate.

Fix: drop the gratuitous mutation. Every downstream caller already handles cfg.fsdp_config or cfg.fsdp (verified across core/builders/base.py, core/builders/rl.py, loaders/model.py, train.py, utils/config/__init__.py, utils/distributed.py). The existing TODO to remove the legacy cfg.fsdp check in 0.12 still stands.

Failure 3 — `FakeExperts` mock breaks under `peft >= 0.19`

tests/utils/schemas/validation/test_moe_quant.py::TestMoeAdapterTrainMergeRoundtrip::test_train_save_merge_no_size_mismatch fails with AttributeError: 'FakeExperts' object has no attribute 'weight' in peft/utils/save_and_load.py:520.

Cause: peft 0.19 added _maybe_shard_state_dict_for_tp (called unconditionally from set_peft_model_state_dict), which reads base_layer.weight.device. FakeExperts uses the target_parameters style (gate_up_proj / down_proj) and legitimately has no .weight — real nn.Linear-style modules always do, peft assumed it.

Fix: stub a zero-size buffer on FakeExperts.weight. Optional follow-up: upstream PEFT could guard the call with if hasattr(base_layer, "weight").

Test plan

pytest tests/utils/ -k "normalize_config or batch_size" -x — 2 pass
pytest tests/patched/test_validation.py -x — 65 pass, 1 skipped
pytest tests/utils/schemas/validation/test_moe_quant.py::TestMoeAdapterTrainMergeRoundtrip -x — 1 pass
pre-commit run --files <changed files> — all 8 hooks pass
Multi-GPU test_ray.py failures are CI-environment-specific; verifying via this PR's CI run

Out of scope (per handoff)

The pre-existing "Ray Train Controller actor state" log spam
The torchao cpp-extensions import warning
The Modal runner kernel-version warning

Summary by CodeRabbit

Bug Fixes
- Fixed FSDP configuration handling during initialization to prevent unintended state mutations and ensure consistent behavior.
Tests
- Enhanced test suite for expert model configurations to ensure better compatibility with distributed training frameworks.
Refactor
- Updated internal code documentation and formatting to improve clarity and maintainability.

…ft 0.19 compat Three independent CI regressions, all reproducible on main: 1. normalize_config: the `if not cfg.use_ray:` guard introduced in afd74ae wrapped BOTH derivations (`gradient_accumulation_steps` and `batch_size`). Only `gradient_accumulation_steps` is what the Ray worker re-validate rejects when both are set; `batch_size` must still derive on the controller because `calculate_total_num_steps` divides by it. Without this fix: `TypeError: unsupported operand type(s) for /: 'float' and 'NoneType'` at trainer.py:519. 2. prepare_optim_env mutated `cfg.fsdp = True` when migrating from fsdp_config-style configs. The schema now types `fsdp: list[str] | None`, so the bool fails Ray worker re-validation with `list_type` ValidationError. Every downstream caller already handles `cfg.fsdp_config or cfg.fsdp`, so the mutation is gratuitous — drop it. The TODO to remove the cfg.fsdp check entirely in 0.12 stays. 3. peft 0.19's `_maybe_shard_state_dict_for_tp` reads `base_layer.weight.device` unconditionally. The FakeExperts test mock uses `target_parameters` style (gate_up_proj / down_proj), so it legitimately has no `.weight`. Stub a zero-size buffer. Failing job: test-axolotl-multigpu (130, 13.0.0, 3.11, 2.9.1, 2) Signed-off-by: Wing Lian <wing@axolotl.ai>

coderabbitai · 2026-05-20T20:14:26Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c5898e6-f3f0-47cf-8105-1c36272ea92c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR contains three focused changes: a configuration comment and formatting update clarifying Ray-deferred batch computation, removal of forced FSDP mutation in trainer optimization setup with documentation that fsdp_config is the source of truth, and addition of a dummy weight buffer to a test fixture for PEFT compatibility.

Changes

Configuration, trainer setup, and test compatibility updates

Layer / File(s)	Summary
Configuration and trainer optimization setup `src/axolotl/utils/config/__init__.py`, `src/axolotl/utils/trainer.py`	Ray-handling comment in `normalize_config` is updated for clarity; `prepare_optim_env` stops mutating `cfg.fsdp` when `cfg.fsdp_config` is present, replacing the mutation with documentation that fsdp_config is the source of truth.
Test fixture compatibility for PEFT `tests/utils/schemas/validation/test_moe_quant.py`	`FakeExperts` module registers a non-persistent `weight` buffer to provide the `base_layer.weight.device` attribute that PEFT accesses unconditionally during parametrized training.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

axolotl-ai-cloud/axolotl#3664: Overlaps with this PR's normalize_config Ray-handling changes in the same function.
axolotl-ai-cloud/axolotl#3170: Related to this PR's FSDP config refactor, which depends on the FSDPConfig schema and downstream checks from that PR.

Suggested labels

ready to merge

Suggested reviewers

SalmanMohammadi
NanoCode012
ved1beta

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes all three main fixes in the changeset: Ray batch_size derivation, FSDP schema migration, and FakeExperts PEFT compatibility.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/ci-ray-batchsize-fsdp-fakeexperts

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/axolotl/utils/trainer.py`:
- Around line 650-654: Replace the five-line comment near the cfg.fsdp /
fsdp_config logic with a single short line: "# fsdp_config is source of truth;
mutating cfg.fsdp to bool breaks Ray worker schema validation" — update the
comment adjacent to the cfg.fsdp and fsdp_config references (see the cfg.fsdp
assignment and the downstream check `if cfg.fsdp or cfg.fsdp_config:`) so it
conforms to the one-line comment guideline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0fb2509-4fe6-4d8f-b2c9-4fb6fd6ef84b

📥 Commits

Reviewing files that changed from the base of the PR and between afd74ae and 49ace58.

📒 Files selected for processing (3)

src/axolotl/utils/config/__init__.py
src/axolotl/utils/trainer.py
tests/utils/schemas/validation/test_moe_quant.py

coderabbitai · 2026-05-20T20:16:37Z

+        # Don't mutate cfg.fsdp to True when fsdp_config is the source of
+        # truth: the schema types fsdp as list[str] | None, and Ray workers
+        # re-validate the controller's dumped config, where a bool would
+        # fail (`list_type` ValidationError). Downstream callers
+        # (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case.


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Condense comment to maximum one short line.

The comment spans 5 lines but the coding guidelines require comments in src/axolotl/**/*.py to be a maximum of one short line. Consider condensing to: # fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation

As per coding guidelines, "Comments should be a maximum of one short line" for files matching src/axolotl/**/*.py.

✂️ Proposed condensed comment

- # Don't mutate cfg.fsdp to True when fsdp_config is the source of - # truth: the schema types fsdp as list[str] | None, and Ray workers - # re-validate the controller's dumped config, where a bool would - # fail (`list_type` ValidationError). Downstream callers - # (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case. + # fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Don't mutate cfg.fsdp to True when fsdp_config is the source of

# truth: the schema types fsdp as list[str] | None, and Ray workers

# re-validate the controller's dumped config, where a bool would

# fail (`list_type` ValidationError). Downstream callers

# (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case.

# fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/axolotl/utils/trainer.py` around lines 650 - 654, Replace the five-line comment near the cfg.fsdp / fsdp_config logic with a single short line: "# fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation" — update the comment adjacent to the cfg.fsdp and fsdp_config references (see the cfg.fsdp assignment and the downstream check `if cfg.fsdp or cfg.fsdp_config:`) so it conforms to the one-line comment guideline.

codecov · 2026-05-20T20:23:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

less verbosity comment

f035c18

winglian merged commit dc8f7c7 into main May 20, 2026
2 checks passed

winglian deleted the fix/ci-ray-batchsize-fsdp-fakeexperts branch May 20, 2026 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ray batch_size derivation, fsdp schema migration, FakeExperts peft 0.19 compat#3671

fix: ray batch_size derivation, fsdp schema migration, FakeExperts peft 0.19 compat#3671
winglian merged 2 commits into
mainfrom
fix/ci-ray-batchsize-fsdp-fakeexperts

winglian commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

winglian commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Failure 1 — Ray workers crash with cfg.batch_size = None

Failure 2 — fsdp schema rejects bool after deprecation shim

Failure 3 — FakeExperts mock breaks under peft >= 0.19

Test plan

Out of scope (per handoff)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

winglian commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Failure 1 — Ray workers crash with `cfg.batch_size = None`

Failure 2 — `fsdp` schema rejects bool after deprecation shim

Failure 3 — `FakeExperts` mock breaks under `peft >= 0.19`

coderabbitai Bot commented May 20, 2026 •

edited

Loading