roundup_power2_divisions not needed with newer pytorch versions by winglian · Pull Request #3540 · axolotl-ai-cloud/axolotl

winglian · 2026-03-23T14:38:48Z

Description

roundup_power2_divisions was causing fragmentation in newer versions of PyTorch which would oom even at smaller sequence lengths when scaling up even slightly.

also misc fixes in tests

Summary by CodeRabbit

New Features
- Added Python 3.14 support in the CI test pipeline.
Bug Fixes
- Improved plugin configuration handling in the diffusion integration.
- Enhanced CUDA memory allocation configuration management for compatibility.
Chores
- Loosened vllm dependency constraint to allow newer compatible versions.
Tests
- Improved plugin isolation in test fixtures.
- Enhanced diffusion test initialization flow.

coderabbitai · 2026-03-23T14:39:14Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 491f55c9-9066-42ac-a28a-988cb5b3d091

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Multiple changes across CI, dependencies, plugin system, and tests: CI matrix expanded to include Python 3.14 with PyTorch version exclusions; vllm dependency constraint relaxed from exact to lower-bounded; diffusion plugin config assignment made conditional on attribute presence; CUDA memory allocation config restructured for PyTorch version branches; test fixture added for plugin manager cleanup; and e2e tests updated to call prepare_plugins before config validation.

Changes

Cohort / File(s)	Summary
CI Configuration `.github/workflows/tests.yml`	Updated pytest job matrices to test Python 3.14 alongside 3.12, with exclusions preventing 3.14 runs against PyTorch 2.8.0 and 2.9.1.
Dependency Constraints `setup.py`	Relaxed vllm dependency from exact pinned version (`vllm==0.17.1`) to lower-bounded constraint (`vllm>=0.17.1`) for PyTorch versions ≥ (2, 10).
Plugin System Updates `src/axolotl/integrations/diffusion/plugin.py`	Modified `post_trainer_create` to conditionally assign config to `trainer.axolotl_cfg` only when the attribute exists, replacing unconditional `set_config()` call.
CUDA Configuration `src/axolotl/utils/__init__.py`	Restructured `set_pytorch_cuda_alloc_conf()` to split configuration into base and suffix components, applying suffix only for PyTorch 2.2–2.8 range.
Test Infrastructure `tests/conftest.py`	Added autouse fixture to reset `PluginManager` singleton state (clearing `_instance` and `plugins`) after each test function.
E2E Diffusion Tests `tests/e2e/test_diffusion.py`	Added `prepare_plugins()` call before config validation in smoke test and SFT labels test.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

upgrade to support latest transformers release #2984 — Modifies vllm/extras dependency handling for specific PyTorch versions in setup.py, directly related to dependency constraint changes in this PR.

Suggested reviewers

djsaunde

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the primary change: removing roundup_power2_divisions from PYTORCH_CUDA_ALLOC_CONF for newer PyTorch versions (2.9+), which aligns with the main code modification in src/axolotl/utils/init.py.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch alloc-fragmentation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/axolotl/utils/__init__.py (1)

53-64: ⚠️ Potential issue | 🟠 Major

Prevent fallback to legacy allocator suffix on PyTorch >=2.9.

At Line 56, if PYTORCH_ALLOC_CONF is already set, the first if is skipped; then Line 60 elif still matches and sets PYTORCH_CUDA_ALLOC_CONF with roundup_power2_divisions:16. That can reintroduce the fragmentation behavior this PR is removing.

Proposed fix

-    if (
-        torch_major == 2
-        and torch_minor >= 9
-        and os.getenv("PYTORCH_ALLOC_CONF") is None
-    ):
-        os.environ["PYTORCH_ALLOC_CONF"] = config_value
-    elif (
-        torch_major == 2
-        and torch_minor >= 2
-        and os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None
-    ):
-        os.environ["PYTORCH_CUDA_ALLOC_CONF"] = config_value + config_older_suffix
+    if torch_major == 2 and torch_minor >= 9:
+        if os.getenv("PYTORCH_ALLOC_CONF") is None:
+            os.environ["PYTORCH_ALLOC_CONF"] = config_value
+    elif torch_major == 2 and torch_minor >= 2:
+        if os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None:
+            os.environ["PYTORCH_CUDA_ALLOC_CONF"] = config_value + config_older_suffix

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/utils/__init__.py` around lines 53 - 64, The current elif block
can run on PyTorch >=2.9 when PYTORCH_ALLOC_CONF is already set, reintroducing
the legacy allocator suffix; update the elif condition for the
PYTORCH_CUDA_ALLOC_CONF branch to only apply to older 2.x versions and only when
the top-level allocator var is unset by changing the condition to require
torch_major == 2, 2 <= torch_minor < 9, os.getenv("PYTORCH_ALLOC_CONF") is None,
and os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None so that config_value +
config_older_suffix is only written for older PyTorch and never when
PYTORCH_ALLOC_CONF is present (referencing torch_major, torch_minor,
PYTORCH_ALLOC_CONF, PYTORCH_CUDA_ALLOC_CONF, config_value, and
config_older_suffix).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/diffusion/plugin.py`:
- Around line 41-42: The current change assigns trainer.axolotl_cfg directly but
skips DiffusionTrainer.set_config(...) which performs required setup (local
config assignment, token-id cache init, and generation callback wiring); update
the logic to call trainer.set_config(cfg) when the trainer exposes that method
(or is an instance of DiffusionTrainer) so those side effects run, and only fall
back to setting trainer.axolotl_cfg = cfg if set_config is not present.

In `@tests/conftest.py`:
- Around line 477-484: The fixture reset_plugin_manager currently clears
PluginManager._instance and PluginManager.plugins but misses the class-level
PluginManager._cfg; update the fixture (reset_plugin_manager) to also reset
PluginManager._cfg (e.g. assign None or an empty dict) after yield so test
config state is fully isolated between tests.

---

Outside diff comments:
In `@src/axolotl/utils/__init__.py`:
- Around line 53-64: The current elif block can run on PyTorch >=2.9 when
PYTORCH_ALLOC_CONF is already set, reintroducing the legacy allocator suffix;
update the elif condition for the PYTORCH_CUDA_ALLOC_CONF branch to only apply
to older 2.x versions and only when the top-level allocator var is unset by
changing the condition to require torch_major == 2, 2 <= torch_minor < 9,
os.getenv("PYTORCH_ALLOC_CONF") is None, and
os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None so that config_value +
config_older_suffix is only written for older PyTorch and never when
PYTORCH_ALLOC_CONF is present (referencing torch_major, torch_minor,
PYTORCH_ALLOC_CONF, PYTORCH_CUDA_ALLOC_CONF, config_value, and
config_older_suffix).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 64a496b4-1af6-44b4-bed3-e809a84fa73b

📥 Commits

Reviewing files that changed from the base of the PR and between 86be9f3 and dee1c14.

📒 Files selected for processing (6)

.github/workflows/tests.yml
setup.py
src/axolotl/integrations/diffusion/plugin.py
src/axolotl/utils/__init__.py
tests/conftest.py
tests/e2e/test_diffusion.py

coderabbitai · 2026-03-23T14:45:21Z

+        if hasattr(trainer, "axolotl_cfg"):
+            trainer.axolotl_cfg = cfg


⚠️ Potential issue | 🟠 Major

Restore diffusion trainer initialization side effects.

This change no longer calls DiffusionTrainer.set_config(...), which contains required setup (trainer-local config assignment, token-id cache initialization, and generation callback wiring in src/axolotl/integrations/diffusion/trainer.py Lines 26-38). Assigning only axolotl_cfg can leave diffusion behavior partially uninitialized.

Proposed fix

def post_trainer_create(self, cfg: DictDefault, trainer: DiffusionTrainer): """Configure trainer after creation.""" if hasattr(trainer, "axolotl_cfg"): trainer.axolotl_cfg = cfg + if hasattr(trainer, "set_config"): + trainer.set_config(cfg)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/diffusion/plugin.py` around lines 41 - 42, The current change assigns trainer.axolotl_cfg directly but skips DiffusionTrainer.set_config(...) which performs required setup (local config assignment, token-id cache init, and generation callback wiring); update the logic to call trainer.set_config(cfg) when the trainer exposes that method (or is an instance of DiffusionTrainer) so those side effects run, and only fall back to setting trainer.axolotl_cfg = cfg if set_config is not present.

@CodeRabbit did this get fixed properly in the latest changeset?

codecov · 2026-03-23T14:52:44Z

Codecov Report

❌ Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/cli/checks.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

NanoCode012 · 2026-03-24T10:30:04Z

                "lr_scheduler": "cosine",
                "max_steps": 5,
-                "flash_attention": True,
+                "flash_attention": False,


Why is this changed?

NanoCode012 · 2026-03-24T10:31:15Z

+lora_target_parameters:
+  - mlp.experts.gate_up_proj
+  - mlp.experts.down_proj


Should we default this to on?

If so, would we also need to enable the above shared_expert targeting?

We should also update the vram usage for these newer settings

NanoCode012 · 2026-03-24T10:31:43Z

+liger_glu_activation: true
+liger_rms_norm_gated: true
+
+torch_compile: false


Why compile false?

it doesn't seem to compose well with everything else in experiments

* nemo gym integration with grpo wip * mostly working * cleanup * simplify * update docs * nemo gym support wip * cleanup * chore: lint * address PR review and add more tests * chore: lint * post merge lora fixes for CI (#3536) [skip ci] * post merge lora fixes for CI * handle lora kernel auto-enable for moe without grouped_mm * prefer not to import torch in schema validation * address pr comments, add timeout, add tests * roundup_power2_divisions not needed with newer pytorch versions (#3540) * roundup_power2_divisions not needed with newer pytorch versions * remove typo * update qwen3.5 moe 35b-a3b yaml for 5090 * more bug fixes * fix tests to match updated trainer * don't use fa2 for hooks test * reset plugins on the instance * retry download * fix references to renamed axolotl_cfg property on trainer * Fix ref to trainer cfg * fix: robust handling of race condition on patching check (#3543) [skip ci] * EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527) [skip ci] * EBFT wip * fixes * more fixeS * add missing strided module * ebft fixes for multi-turn * make ebft work with async * add example for ebft w qwen3.5 * fix for split thinking and update yaml for lora over linear attention only * enforce_eager for vllm arg in schema * fix sync weights * fix multi-gpu * handle updated sig for mm * ddp fixes * improve multi-gpu handling, don't calculate logits, adaptive completion length * chore: lint * chore: lint * support completion_mean * Address corereview feedback * clamp min IS ratio * Address PR code review * more fixes identified * address code review * Fix property from rebase conflict * fix for ebft sync and update docs * make trainer loss patch check a solo test --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

winglian and others added 6 commits March 23, 2026 18:35

roundup_power2_divisions not needed with newer pytorch versions

7399d93

remove typo

b2d80dc

update qwen3.5 moe 35b-a3b yaml for 5090

c452ad2

more bug fixes

89acb7a

fix tests to match updated trainer

b5c9434

don't use fa2 for hooks test

6f1d805

winglian force-pushed the alloc-fragmentation branch from cbf1802 to 6f1d805 Compare March 23, 2026 18:35

NanoCode012 reviewed Mar 24, 2026

View reviewed changes

winglian and others added 2 commits March 24, 2026 12:59

reset plugins on the instance

70ba03f

retry download

3edefa0

winglian added the scheduled_release This PR is slated for the upcoming release label Mar 24, 2026

winglian and others added 2 commits March 24, 2026 12:44

fix references to renamed axolotl_cfg property on trainer

6709283

Fix ref to trainer cfg

690be0e

winglian merged commit e412370 into main Mar 24, 2026
18 of 23 checks passed

winglian deleted the alloc-fragmentation branch March 24, 2026 19:40

		if hasattr(trainer, "axolotl_cfg"):
		trainer.axolotl_cfg = cfg

Uh oh!

Conversation

winglian commented Mar 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

winglian Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NanoCode012 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

winglian Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

winglian commented Mar 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 23, 2026 •

edited

Loading

codecov Bot commented Mar 23, 2026 •

edited

Loading