fix: ray batch_size derivation, fsdp schema migration, FakeExperts peft 0.19 compat#3671
Conversation
…ft 0.19 compat Three independent CI regressions, all reproducible on main: 1. normalize_config: the `if not cfg.use_ray:` guard introduced in afd74ae wrapped BOTH derivations (`gradient_accumulation_steps` and `batch_size`). Only `gradient_accumulation_steps` is what the Ray worker re-validate rejects when both are set; `batch_size` must still derive on the controller because `calculate_total_num_steps` divides by it. Without this fix: `TypeError: unsupported operand type(s) for /: 'float' and 'NoneType'` at trainer.py:519. 2. prepare_optim_env mutated `cfg.fsdp = True` when migrating from fsdp_config-style configs. The schema now types `fsdp: list[str] | None`, so the bool fails Ray worker re-validation with `list_type` ValidationError. Every downstream caller already handles `cfg.fsdp_config or cfg.fsdp`, so the mutation is gratuitous — drop it. The TODO to remove the cfg.fsdp check entirely in 0.12 stays. 3. peft 0.19's `_maybe_shard_state_dict_for_tp` reads `base_layer.weight.device` unconditionally. The FakeExperts test mock uses `target_parameters` style (gate_up_proj / down_proj), so it legitimately has no `.weight`. Stub a zero-size buffer. Failing job: test-axolotl-multigpu (130, 13.0.0, 3.11, 2.9.1, 2) Signed-off-by: Wing Lian <wing@axolotl.ai>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR contains three focused changes: a configuration comment and formatting update clarifying Ray-deferred batch computation, removal of forced FSDP mutation in trainer optimization setup with documentation that fsdp_config is the source of truth, and addition of a dummy weight buffer to a test fixture for PEFT compatibility. ChangesConfiguration, trainer setup, and test compatibility updates
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/axolotl/utils/trainer.py`:
- Around line 650-654: Replace the five-line comment near the cfg.fsdp /
fsdp_config logic with a single short line: "# fsdp_config is source of truth;
mutating cfg.fsdp to bool breaks Ray worker schema validation" — update the
comment adjacent to the cfg.fsdp and fsdp_config references (see the cfg.fsdp
assignment and the downstream check `if cfg.fsdp or cfg.fsdp_config:`) so it
conforms to the one-line comment guideline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c0fb2509-4fe6-4d8f-b2c9-4fb6fd6ef84b
📒 Files selected for processing (3)
src/axolotl/utils/config/__init__.pysrc/axolotl/utils/trainer.pytests/utils/schemas/validation/test_moe_quant.py
| # Don't mutate cfg.fsdp to True when fsdp_config is the source of | ||
| # truth: the schema types fsdp as list[str] | None, and Ray workers | ||
| # re-validate the controller's dumped config, where a bool would | ||
| # fail (`list_type` ValidationError). Downstream callers | ||
| # (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case. |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Condense comment to maximum one short line.
The comment spans 5 lines but the coding guidelines require comments in src/axolotl/**/*.py to be a maximum of one short line. Consider condensing to: # fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation
As per coding guidelines, "Comments should be a maximum of one short line" for files matching src/axolotl/**/*.py.
✂️ Proposed condensed comment
- # Don't mutate cfg.fsdp to True when fsdp_config is the source of
- # truth: the schema types fsdp as list[str] | None, and Ray workers
- # re-validate the controller's dumped config, where a bool would
- # fail (`list_type` ValidationError). Downstream callers
- # (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case.
+ # fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Don't mutate cfg.fsdp to True when fsdp_config is the source of | |
| # truth: the schema types fsdp as list[str] | None, and Ray workers | |
| # re-validate the controller's dumped config, where a bool would | |
| # fail (`list_type` ValidationError). Downstream callers | |
| # (`if cfg.fsdp or cfg.fsdp_config:`) handle the None case. | |
| # fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker schema validation |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/axolotl/utils/trainer.py` around lines 650 - 654, Replace the five-line
comment near the cfg.fsdp / fsdp_config logic with a single short line: "#
fsdp_config is source of truth; mutating cfg.fsdp to bool breaks Ray worker
schema validation" — update the comment adjacent to the cfg.fsdp and fsdp_config
references (see the cfg.fsdp assignment and the downstream check `if cfg.fsdp or
cfg.fsdp_config:`) so it conforms to the one-line comment guideline.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
Three independent CI regressions on
main. Two were introduced byafd74ae(#3664 "Fix: ci being broken (fa2, ray)"); the third is from apeftupgrade to 0.19.x. All three reproduce onmainindependent of any feature branch.Failing job:
test-axolotl-multigpu (130, 13.0.0, 3.11, 2.9.1, 2)Failure 1 — Ray workers crash with
cfg.batch_size = Nonetests/e2e/multigpu/test_ray.py::TestMultiGPURay::{test_lora_ddp,test_ds_zero2_packed[1|2]}fail withTypeError: unsupported operand type(s) for /: 'float' and 'NoneType'attrainer.py:519(calculate_total_num_steps).Cause:
afd74aewrapped bothbatch_sizeandgradient_accumulation_stepsderivations innormalize_configunderif not cfg.use_ray:. The Ray worker's re-validate rejects having BOTH set, but doesn't derivebatch_sizeeither — so it staysNoneand blows up downstream.Fix: only defer
gradient_accumulation_steps(the field the worker validator rejects); always derivebatch_sizeon the controller.Failure 2 —
fsdpschema rejects bool after deprecation shimtests/e2e/multigpu/test_ray.py::TestMultiGPURay::test_sft_fsdp2_packed[1|2]fails inside the Ray worker's re-validation withpydantic_core._pydantic_core.ValidationError: fsdp / Input should be a valid list [type=list_type, input_value=True, input_type=bool].Cause:
src/axolotl/utils/trainer.py::prepare_optim_envmutatescfg.fsdp = Truewhen migrating fromfsdp_config-style configs (line 650). The schema typesfsdp: list[str] | None, so the bool fails the worker's re-validate.Fix: drop the gratuitous mutation. Every downstream caller already handles
cfg.fsdp_config or cfg.fsdp(verified acrosscore/builders/base.py,core/builders/rl.py,loaders/model.py,train.py,utils/config/__init__.py,utils/distributed.py). The existing TODO to remove the legacycfg.fsdpcheck in 0.12 still stands.Failure 3 —
FakeExpertsmock breaks underpeft >= 0.19tests/utils/schemas/validation/test_moe_quant.py::TestMoeAdapterTrainMergeRoundtrip::test_train_save_merge_no_size_mismatchfails withAttributeError: 'FakeExperts' object has no attribute 'weight'inpeft/utils/save_and_load.py:520.Cause: peft 0.19 added
_maybe_shard_state_dict_for_tp(called unconditionally fromset_peft_model_state_dict), which readsbase_layer.weight.device.FakeExpertsuses thetarget_parametersstyle (gate_up_proj / down_proj) and legitimately has no.weight— realnn.Linear-style modules always do, peft assumed it.Fix: stub a zero-size buffer on
FakeExperts.weight. Optional follow-up: upstream PEFT could guard the call withif hasattr(base_layer, "weight").Test plan
pytest tests/utils/ -k "normalize_config or batch_size" -x— 2 passpytest tests/patched/test_validation.py -x— 65 pass, 1 skippedpytest tests/utils/schemas/validation/test_moe_quant.py::TestMoeAdapterTrainMergeRoundtrip -x— 1 passpre-commit run --files <changed files>— all 8 hooks passtest_ray.pyfailures are CI-environment-specific; verifying via this PR's CI runOut of scope (per handoff)
torchaocpp-extensions import warningSummary by CodeRabbit
Bug Fixes
Tests
Refactor