roundup_power2_divisions not needed with newer pytorch versions#3540
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughMultiple changes across CI, dependencies, plugin system, and tests: CI matrix expanded to include Python 3.14 with PyTorch version exclusions; vllm dependency constraint relaxed from exact to lower-bounded; diffusion plugin config assignment made conditional on attribute presence; CUDA memory allocation config restructured for PyTorch version branches; test fixture added for plugin manager cleanup; and e2e tests updated to call prepare_plugins before config validation. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/axolotl/utils/__init__.py (1)
53-64:⚠️ Potential issue | 🟠 MajorPrevent fallback to legacy allocator suffix on PyTorch >=2.9.
At Line 56, if
PYTORCH_ALLOC_CONFis already set, the firstifis skipped; then Line 60elifstill matches and setsPYTORCH_CUDA_ALLOC_CONFwithroundup_power2_divisions:16. That can reintroduce the fragmentation behavior this PR is removing.Proposed fix
- if ( - torch_major == 2 - and torch_minor >= 9 - and os.getenv("PYTORCH_ALLOC_CONF") is None - ): - os.environ["PYTORCH_ALLOC_CONF"] = config_value - elif ( - torch_major == 2 - and torch_minor >= 2 - and os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None - ): - os.environ["PYTORCH_CUDA_ALLOC_CONF"] = config_value + config_older_suffix + if torch_major == 2 and torch_minor >= 9: + if os.getenv("PYTORCH_ALLOC_CONF") is None: + os.environ["PYTORCH_ALLOC_CONF"] = config_value + elif torch_major == 2 and torch_minor >= 2: + if os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None: + os.environ["PYTORCH_CUDA_ALLOC_CONF"] = config_value + config_older_suffix🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/axolotl/utils/__init__.py` around lines 53 - 64, The current elif block can run on PyTorch >=2.9 when PYTORCH_ALLOC_CONF is already set, reintroducing the legacy allocator suffix; update the elif condition for the PYTORCH_CUDA_ALLOC_CONF branch to only apply to older 2.x versions and only when the top-level allocator var is unset by changing the condition to require torch_major == 2, 2 <= torch_minor < 9, os.getenv("PYTORCH_ALLOC_CONF") is None, and os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None so that config_value + config_older_suffix is only written for older PyTorch and never when PYTORCH_ALLOC_CONF is present (referencing torch_major, torch_minor, PYTORCH_ALLOC_CONF, PYTORCH_CUDA_ALLOC_CONF, config_value, and config_older_suffix).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/axolotl/integrations/diffusion/plugin.py`:
- Around line 41-42: The current change assigns trainer.axolotl_cfg directly but
skips DiffusionTrainer.set_config(...) which performs required setup (local
config assignment, token-id cache init, and generation callback wiring); update
the logic to call trainer.set_config(cfg) when the trainer exposes that method
(or is an instance of DiffusionTrainer) so those side effects run, and only fall
back to setting trainer.axolotl_cfg = cfg if set_config is not present.
In `@tests/conftest.py`:
- Around line 477-484: The fixture reset_plugin_manager currently clears
PluginManager._instance and PluginManager.plugins but misses the class-level
PluginManager._cfg; update the fixture (reset_plugin_manager) to also reset
PluginManager._cfg (e.g. assign None or an empty dict) after yield so test
config state is fully isolated between tests.
---
Outside diff comments:
In `@src/axolotl/utils/__init__.py`:
- Around line 53-64: The current elif block can run on PyTorch >=2.9 when
PYTORCH_ALLOC_CONF is already set, reintroducing the legacy allocator suffix;
update the elif condition for the PYTORCH_CUDA_ALLOC_CONF branch to only apply
to older 2.x versions and only when the top-level allocator var is unset by
changing the condition to require torch_major == 2, 2 <= torch_minor < 9,
os.getenv("PYTORCH_ALLOC_CONF") is None, and
os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None so that config_value +
config_older_suffix is only written for older PyTorch and never when
PYTORCH_ALLOC_CONF is present (referencing torch_major, torch_minor,
PYTORCH_ALLOC_CONF, PYTORCH_CUDA_ALLOC_CONF, config_value, and
config_older_suffix).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 64a496b4-1af6-44b4-bed3-e809a84fa73b
📒 Files selected for processing (6)
.github/workflows/tests.ymlsetup.pysrc/axolotl/integrations/diffusion/plugin.pysrc/axolotl/utils/__init__.pytests/conftest.pytests/e2e/test_diffusion.py
| if hasattr(trainer, "axolotl_cfg"): | ||
| trainer.axolotl_cfg = cfg |
There was a problem hiding this comment.
Restore diffusion trainer initialization side effects.
This change no longer calls DiffusionTrainer.set_config(...), which contains required setup (trainer-local config assignment, token-id cache initialization, and generation callback wiring in src/axolotl/integrations/diffusion/trainer.py Lines 26-38). Assigning only axolotl_cfg can leave diffusion behavior partially uninitialized.
Proposed fix
def post_trainer_create(self, cfg: DictDefault, trainer: DiffusionTrainer):
"""Configure trainer after creation."""
if hasattr(trainer, "axolotl_cfg"):
trainer.axolotl_cfg = cfg
+ if hasattr(trainer, "set_config"):
+ trainer.set_config(cfg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/integrations/diffusion/plugin.py` around lines 41 - 42, The
current change assigns trainer.axolotl_cfg directly but skips
DiffusionTrainer.set_config(...) which performs required setup (local config
assignment, token-id cache init, and generation callback wiring); update the
logic to call trainer.set_config(cfg) when the trainer exposes that method (or
is an instance of DiffusionTrainer) so those side effects run, and only fall
back to setting trainer.axolotl_cfg = cfg if set_config is not present.
There was a problem hiding this comment.
@CodeRabbit did this get fixed properly in the latest changeset?
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
cbf1802 to
6f1d805
Compare
| "lr_scheduler": "cosine", | ||
| "max_steps": 5, | ||
| "flash_attention": True, | ||
| "flash_attention": False, |
| lora_target_parameters: | ||
| - mlp.experts.gate_up_proj | ||
| - mlp.experts.down_proj |
There was a problem hiding this comment.
Should we default this to on?
If so, would we also need to enable the above shared_expert targeting?
We should also update the vram usage for these newer settings
| liger_glu_activation: true | ||
| liger_rms_norm_gated: true | ||
|
|
||
| torch_compile: false |
There was a problem hiding this comment.
it doesn't seem to compose well with everything else in experiments
* nemo gym integration with grpo wip * mostly working * cleanup * simplify * update docs * nemo gym support wip * cleanup * chore: lint * address PR review and add more tests * chore: lint * post merge lora fixes for CI (#3536) [skip ci] * post merge lora fixes for CI * handle lora kernel auto-enable for moe without grouped_mm * prefer not to import torch in schema validation * address pr comments, add timeout, add tests * roundup_power2_divisions not needed with newer pytorch versions (#3540) * roundup_power2_divisions not needed with newer pytorch versions * remove typo * update qwen3.5 moe 35b-a3b yaml for 5090 * more bug fixes * fix tests to match updated trainer * don't use fa2 for hooks test * reset plugins on the instance * retry download * fix references to renamed axolotl_cfg property on trainer * Fix ref to trainer cfg * fix: robust handling of race condition on patching check (#3543) [skip ci] * EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527) [skip ci] * EBFT wip * fixes * more fixeS * add missing strided module * ebft fixes for multi-turn * make ebft work with async * add example for ebft w qwen3.5 * fix for split thinking and update yaml for lora over linear attention only * enforce_eager for vllm arg in schema * fix sync weights * fix multi-gpu * handle updated sig for mm * ddp fixes * improve multi-gpu handling, don't calculate logits, adaptive completion length * chore: lint * chore: lint * support completion_mean * Address corereview feedback * clamp min IS ratio * Address PR code review * more fixes identified * address code review * Fix property from rebase conflict * fix for ebft sync and update docs * make trainer loss patch check a solo test --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
roundup_power2_divisionswas causing fragmentation in newer versions of PyTorch which would oom even at smaller sequence lengths when scaling up even slightly.also misc fixes in tests
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests