Skip to content

roundup_power2_divisions not needed with newer pytorch versions#3540

Merged
winglian merged 10 commits into
mainfrom
alloc-fragmentation
Mar 24, 2026
Merged

roundup_power2_divisions not needed with newer pytorch versions#3540
winglian merged 10 commits into
mainfrom
alloc-fragmentation

Conversation

@winglian
Copy link
Copy Markdown
Collaborator

@winglian winglian commented Mar 23, 2026

Description

roundup_power2_divisions was causing fragmentation in newer versions of PyTorch which would oom even at smaller sequence lengths when scaling up even slightly.

also misc fixes in tests

Summary by CodeRabbit

  • New Features

    • Added Python 3.14 support in the CI test pipeline.
  • Bug Fixes

    • Improved plugin configuration handling in the diffusion integration.
    • Enhanced CUDA memory allocation configuration management for compatibility.
  • Chores

    • Loosened vllm dependency constraint to allow newer compatible versions.
  • Tests

    • Improved plugin isolation in test fixtures.
    • Enhanced diffusion test initialization flow.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 23, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 491f55c9-9066-42ac-a28a-988cb5b3d091

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Multiple changes across CI, dependencies, plugin system, and tests: CI matrix expanded to include Python 3.14 with PyTorch version exclusions; vllm dependency constraint relaxed from exact to lower-bounded; diffusion plugin config assignment made conditional on attribute presence; CUDA memory allocation config restructured for PyTorch version branches; test fixture added for plugin manager cleanup; and e2e tests updated to call prepare_plugins before config validation.

Changes

Cohort / File(s) Summary
CI Configuration
.github/workflows/tests.yml
Updated pytest job matrices to test Python 3.14 alongside 3.12, with exclusions preventing 3.14 runs against PyTorch 2.8.0 and 2.9.1.
Dependency Constraints
setup.py
Relaxed vllm dependency from exact pinned version (vllm==0.17.1) to lower-bounded constraint (vllm>=0.17.1) for PyTorch versions ≥ (2, 10).
Plugin System Updates
src/axolotl/integrations/diffusion/plugin.py
Modified post_trainer_create to conditionally assign config to trainer.axolotl_cfg only when the attribute exists, replacing unconditional set_config() call.
CUDA Configuration
src/axolotl/utils/__init__.py
Restructured set_pytorch_cuda_alloc_conf() to split configuration into base and suffix components, applying suffix only for PyTorch 2.2–2.8 range.
Test Infrastructure
tests/conftest.py
Added autouse fixture to reset PluginManager singleton state (clearing _instance and plugins) after each test function.
E2E Diffusion Tests
tests/e2e/test_diffusion.py
Added prepare_plugins() call before config validation in smoke test and SFT labels test.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • djsaunde
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the primary change: removing roundup_power2_divisions from PYTORCH_CUDA_ALLOC_CONF for newer PyTorch versions (2.9+), which aligns with the main code modification in src/axolotl/utils/init.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alloc-fragmentation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/axolotl/utils/__init__.py (1)

53-64: ⚠️ Potential issue | 🟠 Major

Prevent fallback to legacy allocator suffix on PyTorch >=2.9.

At Line 56, if PYTORCH_ALLOC_CONF is already set, the first if is skipped; then Line 60 elif still matches and sets PYTORCH_CUDA_ALLOC_CONF with roundup_power2_divisions:16. That can reintroduce the fragmentation behavior this PR is removing.

Proposed fix
-    if (
-        torch_major == 2
-        and torch_minor >= 9
-        and os.getenv("PYTORCH_ALLOC_CONF") is None
-    ):
-        os.environ["PYTORCH_ALLOC_CONF"] = config_value
-    elif (
-        torch_major == 2
-        and torch_minor >= 2
-        and os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None
-    ):
-        os.environ["PYTORCH_CUDA_ALLOC_CONF"] = config_value + config_older_suffix
+    if torch_major == 2 and torch_minor >= 9:
+        if os.getenv("PYTORCH_ALLOC_CONF") is None:
+            os.environ["PYTORCH_ALLOC_CONF"] = config_value
+    elif torch_major == 2 and torch_minor >= 2:
+        if os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None:
+            os.environ["PYTORCH_CUDA_ALLOC_CONF"] = config_value + config_older_suffix
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/utils/__init__.py` around lines 53 - 64, The current elif block
can run on PyTorch >=2.9 when PYTORCH_ALLOC_CONF is already set, reintroducing
the legacy allocator suffix; update the elif condition for the
PYTORCH_CUDA_ALLOC_CONF branch to only apply to older 2.x versions and only when
the top-level allocator var is unset by changing the condition to require
torch_major == 2, 2 <= torch_minor < 9, os.getenv("PYTORCH_ALLOC_CONF") is None,
and os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None so that config_value +
config_older_suffix is only written for older PyTorch and never when
PYTORCH_ALLOC_CONF is present (referencing torch_major, torch_minor,
PYTORCH_ALLOC_CONF, PYTORCH_CUDA_ALLOC_CONF, config_value, and
config_older_suffix).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/diffusion/plugin.py`:
- Around line 41-42: The current change assigns trainer.axolotl_cfg directly but
skips DiffusionTrainer.set_config(...) which performs required setup (local
config assignment, token-id cache init, and generation callback wiring); update
the logic to call trainer.set_config(cfg) when the trainer exposes that method
(or is an instance of DiffusionTrainer) so those side effects run, and only fall
back to setting trainer.axolotl_cfg = cfg if set_config is not present.

In `@tests/conftest.py`:
- Around line 477-484: The fixture reset_plugin_manager currently clears
PluginManager._instance and PluginManager.plugins but misses the class-level
PluginManager._cfg; update the fixture (reset_plugin_manager) to also reset
PluginManager._cfg (e.g. assign None or an empty dict) after yield so test
config state is fully isolated between tests.

---

Outside diff comments:
In `@src/axolotl/utils/__init__.py`:
- Around line 53-64: The current elif block can run on PyTorch >=2.9 when
PYTORCH_ALLOC_CONF is already set, reintroducing the legacy allocator suffix;
update the elif condition for the PYTORCH_CUDA_ALLOC_CONF branch to only apply
to older 2.x versions and only when the top-level allocator var is unset by
changing the condition to require torch_major == 2, 2 <= torch_minor < 9,
os.getenv("PYTORCH_ALLOC_CONF") is None, and
os.getenv("PYTORCH_CUDA_ALLOC_CONF") is None so that config_value +
config_older_suffix is only written for older PyTorch and never when
PYTORCH_ALLOC_CONF is present (referencing torch_major, torch_minor,
PYTORCH_ALLOC_CONF, PYTORCH_CUDA_ALLOC_CONF, config_value, and
config_older_suffix).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 64a496b4-1af6-44b4-bed3-e809a84fa73b

📥 Commits

Reviewing files that changed from the base of the PR and between 86be9f3 and dee1c14.

📒 Files selected for processing (6)
  • .github/workflows/tests.yml
  • setup.py
  • src/axolotl/integrations/diffusion/plugin.py
  • src/axolotl/utils/__init__.py
  • tests/conftest.py
  • tests/e2e/test_diffusion.py

Comment on lines +41 to +42
if hasattr(trainer, "axolotl_cfg"):
trainer.axolotl_cfg = cfg
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore diffusion trainer initialization side effects.

This change no longer calls DiffusionTrainer.set_config(...), which contains required setup (trainer-local config assignment, token-id cache initialization, and generation callback wiring in src/axolotl/integrations/diffusion/trainer.py Lines 26-38). Assigning only axolotl_cfg can leave diffusion behavior partially uninitialized.

Proposed fix
 def post_trainer_create(self, cfg: DictDefault, trainer: DiffusionTrainer):
     """Configure trainer after creation."""
     if hasattr(trainer, "axolotl_cfg"):
         trainer.axolotl_cfg = cfg
+    if hasattr(trainer, "set_config"):
+        trainer.set_config(cfg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/diffusion/plugin.py` around lines 41 - 42, The
current change assigns trainer.axolotl_cfg directly but skips
DiffusionTrainer.set_config(...) which performs required setup (local config
assignment, token-id cache init, and generation callback wiring); update the
logic to call trainer.set_config(cfg) when the trainer exposes that method (or
is an instance of DiffusionTrainer) so those side effects run, and only fall
back to setting trainer.axolotl_cfg = cfg if set_config is not present.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodeRabbit did this get fixed properly in the latest changeset?

Comment thread tests/conftest.py
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/cli/checks.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian force-pushed the alloc-fragmentation branch from cbf1802 to 6f1d805 Compare March 23, 2026 18:35
Comment thread tests/e2e/integrations/test_hooks.py Outdated
"lr_scheduler": "cosine",
"max_steps": 5,
"flash_attention": True,
"flash_attention": False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed?

Comment on lines +50 to +52
lora_target_parameters:
- mlp.experts.gate_up_proj
- mlp.experts.down_proj
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default this to on?

If so, would we also need to enable the above shared_expert targeting?

We should also update the vram usage for these newer settings

liger_glu_activation: true
liger_rms_norm_gated: true

torch_compile: false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why compile false?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem to compose well with everything else in experiments

@winglian winglian added the scheduled_release This PR is slated for the upcoming release label Mar 24, 2026
@winglian winglian merged commit e412370 into main Mar 24, 2026
18 of 23 checks passed
@winglian winglian deleted the alloc-fragmentation branch March 24, 2026 19:40
winglian added a commit that referenced this pull request Mar 25, 2026
* nemo gym integration with grpo wip

* mostly working

* cleanup

* simplify

* update docs

* nemo gym support wip

* cleanup

* chore: lint

* address PR review and add more tests

* chore: lint

* post merge lora fixes for CI (#3536) [skip ci]

* post merge lora fixes for CI

* handle lora kernel auto-enable for moe without grouped_mm

* prefer not to import torch in schema validation

* address pr comments, add timeout, add tests

* roundup_power2_divisions not needed with newer pytorch versions (#3540)

* roundup_power2_divisions not needed with newer pytorch versions

* remove typo

* update qwen3.5 moe 35b-a3b yaml for 5090

* more bug fixes

* fix tests to match updated trainer

* don't use fa2 for hooks test

* reset plugins on the instance

* retry download

* fix references to renamed axolotl_cfg property on trainer

* Fix ref to trainer cfg

* fix: robust handling of race condition on patching check (#3543) [skip ci]

* EBFT: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models (#3527) [skip ci]

* EBFT wip

* fixes

* more fixeS

* add missing strided module

* ebft fixes for multi-turn

* make ebft work with async

* add example for ebft w qwen3.5

* fix for split thinking and update yaml for lora over linear attention only

* enforce_eager for vllm arg in schema

* fix sync weights

* fix multi-gpu

* handle updated sig for mm

* ddp fixes

* improve multi-gpu handling, don't calculate logits, adaptive completion length

* chore: lint

* chore: lint

* support completion_mean

* Address corereview feedback

* clamp min IS ratio

* Address PR code review

* more fixes identified

* address code review

* Fix property from rebase conflict

* fix for ebft sync and update docs

* make trainer loss patch check a solo test

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scheduled_release This PR is slated for the upcoming release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants