Skip to content

[WIP] Fix SP tests#3309

Merged
winglian merged 1 commit into
transformers-v5from
transformers-v5-fix-accelerate-cp
Dec 16, 2025
Merged

[WIP] Fix SP tests#3309
winglian merged 1 commit into
transformers-v5from
transformers-v5-fix-accelerate-cp

Conversation

@salmanmohammadi

@salmanmohammadi salmanmohammadi commented Dec 8, 2025

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

Release Notes

  • New Features

    • Added FSDP version configuration support with enhanced bidirectional synchronization.
    • Added context-parallel hooks for improved parallelism configuration.
  • Bug Fixes

    • Updated Huggingface CLI commands to use new hf syntax throughout documentation and codebase.
    • Corrected Mistral tokenizer backend patching.
    • Fixed chat template tokenization parameters for improved token handling.
  • Chores

    • Updated dependencies: huggingface-hub, deepspeed, trl, accelerate, and kernels.
    • Optimized test parallelism and GitHub Actions workflows.
    • Removed deprecated test case for Mistral tokenizer patching.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Dec 8, 2025

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

A substantial update spanning dependency versions (huggingface-hub, transformers, accelerate, deepspeed, trl), Hugging Face CLI command migration, Mistral tokenizer backend class refactoring, FSDP configuration with new versioning and bidirectional sync logic, context parallel monkeypatch additions, and test updates reflecting tokenization changes and transformers v5 compatibility adjustments.

Changes

Cohort / File(s) Change Summary
GitHub Actions Workflows
.github/workflows/preview-docs.yml, .github/workflows/tests.yml, cicd/multigpu.sh
Cleanup node removal step added to preview workflow; test CLI commands migrated from huggingface-cli to hf; PyTest parallelism reduced from -n8 to -n4; gate-skip-e2e dependencies narrowed; --maxfail=3 flag added to pytest invocation.
Documentation & Configuration
docs/amd_hpc.qmd, docs/installation.qmd, requirements.txt
Hugging Face commands replaced: huggingface-cli downloadhf download, huggingface-cli loginhf auth login. Requirements updated: huggingface-hub >=0.36.0 → >=1.1.7, transformers from main branch, accelerate 1.11.0 → 1.12.0, deepspeed >=0.17.0 → >=0.18.2, trl 0.25.0 → 0.25.1, kernels >=0.9.0 → >=0.11.2, trackio reintroduced.
CLI & Checks
src/axolotl/cli/checks.py
Error message updated to instruct login via hf auth login instead of huggingface-cli login.
Core Builders
src/axolotl/core/builders/base.py, src/axolotl/core/builders/causal.py
Warmup calculation refactored with explicit type annotation and fallback logic when warmup_ratio > 0.0; include_num_input_tokens_seen aliased to include_tokens_per_second; collator creation gated for single-micro-batch non-eval scenarios.
Mistral Tokenizer Integration
src/axolotl/loaders/patch_manager.py, src/axolotl/loaders/processor.py, src/axolotl/utils/mistral/mistral_tokenizer.py, src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py
Mistral v3 tokenizer patch block removed from patch_manager; patch target changed from MistralCommonTokenizer to MistralCommonBackend in processor and mistral_common_tokenizer; HFMistralTokenizer base class changed from MistralCommonTokenizer to MistralCommonBackend.
Processing & Prompt Strategies
src/axolotl/processing_strategies.py, src/axolotl/prompt_strategies/chat_template.py
Mistral3ProcessingStrategy processor parameter untyped; Mistral3Processor import moved to function scope; chat template tokenization now uses tokenize=True and return_dict=False.
Parallelism & Accelerate
src/axolotl/monkeypatch/accelerate/parallelism_config.py, src/axolotl/utils/trainer.py
New patch_prepare_cp function added to handle context-parallel hooks and cp_comm_strategy configuration; patch_prepare_cp invoked in setup_parallelism_envs when context_parallel_size > 1.
Trainer Monkeypatch
src/axolotl/monkeypatch/transformers/trainer_context_parallel.py
Patched function loading refactored to use dedicated local namespace instead of global namespace pollution.
FSDP Schema & Validation
src/axolotl/utils/schemas/fsdp.py, src/axolotl/utils/schemas/validation.py
New optional version field added to FSDPConfig; check_fsdp_version_in_fsdp_config refactored to provide bidirectional sync between top-level fsdp_version and fsdp_config.version; fsdp_config key normalization broadened to include fsdp_version.
Utilities
src/axolotl/utils/callbacks/perplexity.py
PreTrainedTokenizer import updated with fallback: transformers.tokenization_python → transformers.tokenization_utils on ImportError.
E2E Integration Tests
tests/e2e/integrations/test_*.py
Import paths changed from relative (..utils) to absolute (tests.e2e.utils).
E2E Multi-GPU Tests
tests/e2e/multigpu/test_*.py, tests/e2e/utils.py
Test decorator changed from @require_hopper to @supports_fp8 in fp8_fsdp2; lora kernel flags disabled in test_lora_sft; skip markers added for transformers v5 compatibility breaks; new supports_fp8 utility function added.
Prompt Strategy Tests
tests/prompt_strategies/test_*.py
Token ID expectations updated across alpaca, chat_templates, chat_templates_advanced, and raw_io tests; phi35_tokenizer test commented out due to transformers v5 incompatibility.
Unit & Config Tests
tests/test_*.py, tests/utils/schemas/validation/test_fsdp.py, tests/hf_offline_utils.py
Token ID and expected values updated in test_tokenizers and test_alpaca; test_fsdp tests refactored for version field in fsdp_config; reset_sessions call removed and commented out in hf_offline_utils.
Removed Tests
tests/monkeypatch/test_mistral_tokenizer_patch.py
Integration test module removed (TestMistralTokenizerPatchIntegration class and test_mistral_tokenizer_image_patch method deleted).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • src/axolotl/monkeypatch/accelerate/parallelism_config.py — New patch_prepare_cp function introduces context-parallel hook logic with device mesh and rotation method configuration; requires understanding of accelerate internals and context-parallel mechanics.
  • src/axolotl/utils/schemas/validation.py — Refactored FSDP version sync logic is intricate; bidirectional propagation between top-level and nested fields, plus expanded key normalization, requires careful verification of all code paths and edge cases.
  • Mistral tokenizer changes across multiple files — Coordinated class/target changes in patch_manager, processor, mistral_tokenizer, and monkeypatch files; consistency across files must be verified to ensure patches target the correct backend.
  • requirements.txt dependency upgrade chain — Multiple transitive dependencies updated (huggingface-hub, transformers from main, accelerate, deepspeed, trl, kernels); compatibility and breakage risk should be assessed.
  • Test token ID updates — Widespread token ID expectation changes across multiple test files suggest upstream tokenizer behavior changes; verify these reflect actual tokenizer output after dependency upgrades and are not regressions.

Possibly related PRs

  • Modifications to parallelism configuration and context-parallel patch integration share direct code-level changes with patching logic and device mesh handling across accelerate monkeypatches.
  • FSDP schema extensions and validation/normalization logic changes overlap with prior schema refactoring and fsdp_version/fsdp_config handling improvements.
  • Mistral tokenizer backend class and patching target changes coordinate with prior refactoring of tokenizer monkeypatch implementations and loader logic.

Suggested labels

hold

Suggested reviewers

  • winglian
  • NanoCode012
  • djsaunde

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.58% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The PR title '[WIP] Fix SP tests' is vague and does not clearly convey the actual scope of changes, which include major dependency updates, workflow modifications, tokenizer patches, schema changes, and numerous test updates across multiple areas. Replace with a more descriptive title that captures the main objective, such as 'Update HuggingFace CLI, dependencies, and fix Mistral tokenizer patches' or similar to reflect the substantial changes across the codebase.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch transformers-v5-fix-accelerate-cp

Warning

Tools execution failed with the following error:

Failed to run tools: 14 UNAVAILABLE: read ECONNRESET


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/axolotl/monkeypatch/transformers/trainer_context_parallel.py (1)

49-63: Guard against empty items_to_import to avoid exec syntax errors

Using an isolated namespace for exec is a good improvement, but if items_to_import is ever empty, this line:

exec(f"from {module_name} import ({', '.join(items_to_import)})", namespace)

will evaluate to from {module_name} import (), which is a SyntaxError and will break patching.

A minimal fix is to skip the import exec when there’s nothing to import:

-    # Use a separate namespace to capture the exec'd function
-    namespace = {}
-    exec(f"from {module_name} import ({', '.join(items_to_import)})", namespace)
-    exec(patched_source, namespace)
+    # Use a separate namespace to capture the exec'd function
+    namespace: dict[str, object] = {}
+    if items_to_import:
+        exec(
+            f"from {module_name} import ({', '.join(items_to_import)})",
+            namespace,
+        )
+    exec(patched_source, namespace)
🧹 Nitpick comments (9)
tests/e2e/multigpu/test_gemma3.py (1)

31-33: Make the skip easier to revisit (add TODO-style phrasing).

The skip reason is clear and references the upstream PR; consider explicitly wording it as a TODO (e.g., “TODO: unskip once transformers includes fix from …”) so it’s easy to grep for when upgrading transformers and deciding which tests to re-enable.

tests/e2e/multigpu/test_fsdp1.py (1)

247-247: Clarify skip reason and make it actionable to unskip later.

The blanket reason "broken in transformers v5" works, but you may want to align with the Gemma3 test by referencing the specific upstream issue/PR and/or adding TODO-style wording (e.g., “TODO: unskip when transformers bug X is fixed”) so it’s obvious when this test should be re-enabled.

src/axolotl/core/builders/causal.py (1)

438-440: Clarify early-return semantics for pretraining collator when micro_batch_size == 1

In the training_args.pretraining branch with pretraining_sample_concatenation=True and micro_batch_size==1, this condition now returns None for all training calls (is_eval is False), regardless of sample_packing / pretrain_multipack_attn, because the right-hand side of the or is always true in that case. For eval (is_eval=True), behavior remains return None only when not (sample_packing and pretrain_multipack_attn).

If this is the intended behavior (i.e., rely on the Trainer’s default collator for that pretraining + mbsz=1 training case), you might want to make it more explicit to avoid future misreads:

-            if not (self.cfg.sample_packing and self.cfg.pretrain_multipack_attn) or (
-                self.cfg.micro_batch_size == 1 and is_eval is False
-            ):
-                return None
+            if self.cfg.micro_batch_size == 1 and not is_eval:
+                return None
+            if not (self.cfg.sample_packing and self.cfg.pretrain_multipack_attn):
+                return None

Same logic, but easier to reason about when tweaking pretraining/sample-packing configs later.

src/axolotl/core/builders/base.py (1)

207-241: Warmup ratio → warmup_steps fallback looks correct for transformers v5

The restructuring of warmup logic (typed warmup_steps: int | float plus the warmup_ratio > 0.0 and warmup_steps == 0 fallback) cleanly preserves existing precedence (warmup_steps > warmup_ratio > default) while switching to a single warmup_steps kwarg, and should keep behaviour consistent when total_num_steps is unknown.

If you expect future readers to be confused by warmup_steps sometimes being a ratio (float), a short comment explaining that transformers v5 interprets warmup_steps < 1.0 as a ratio would help maintainability, but the runtime logic itself looks sound.

src/axolotl/utils/schemas/validation.py (1)

865-876: Consider adding conflict detection for version mismatches.

The bidirectional synchronization logic works correctly for the tested scenarios. However, it doesn't detect conflicts when both fsdp_version and fsdp_config.version are set to different values. For example, if a user sets fsdp_version: 2 and fsdp_config.version: 1, neither synchronization condition triggers, potentially leaving an inconsistent state.

Consider adding a check to detect and handle version conflicts:

 @model_validator(mode="before")
 @classmethod
 def check_fsdp_version_in_fsdp_config(cls, data):
     fsdp_config = data.get("fsdp_config") or {}
     fsdp_version = data.get("fsdp_version", None)
+    
+    # Check for conflicts between top-level and nested version
+    if fsdp_version and fsdp_config.get("version") and fsdp_version != fsdp_config.get("version"):
+        raise ValueError(
+            f"Conflicting FSDP versions: fsdp_version={fsdp_version} but "
+            f"fsdp_config.version={fsdp_config.get('version')}. "
+            "Please set only one or ensure they match."
+        )
+    
     if not fsdp_version and fsdp_config and fsdp_config.get("version"):
         data["fsdp_version"] = fsdp_config.get("version")
     if not fsdp_version and fsdp_config and fsdp_config.get("fsdp_version"):
         data["fsdp_version"] = fsdp_config.get("fsdp_version")
     elif fsdp_version and fsdp_config and not fsdp_config.get("version"):
         data["fsdp_config"]["version"] = fsdp_version
     return data
tests/prompt_strategies/test_raw_io.py (1)

79-88: Token‑level expectations are very tokenizer‑version specific

The updated decode string, input_ids (e.g. 21558 for "hello"), and corresponding labels correctly reflect the new single‑token behavior, but they tightly couple this test to the exact tokenizer version and vocab. To reduce future breakage when upgrading models/tokenizers, consider:

  • Deriving specific ids via the tokenizer (e.g., with an encode call) instead of hard‑coding numeric ids.
  • Asserting key subsequences / structure (presence and ordering of segments and <eot> tokens) rather than the full list of ids where possible.

Also applies to: 102-116

src/axolotl/utils/mistral/mistral_tokenizer.py (1)

10-15: Base‑class switch to MistralCommonBackend looks consistent with the new backend flow

Sub‑classing MistralCommonBackend and updating the from_pretrained docstrings / error messages to reference it lines up with the rest of the PR’s mistral‑common backend changes. A couple of follow‑ups:

  • Please verify against the transformers version you pin that transformers.tokenization_mistral_common.MistralCommonBackend exists and that its __init__ / from_pretrained expectations are compatible with how HFMistralTokenizer is constructed here (in particular, the mode kwarg and tokenizer_path argument).
  • (Minor) The phrase [`MistralCommonBackend.tokenization_mistral_common.save_pretrained`] in the docstring is slightly odd; consider simplifying to reference MistralCommonBackend.save_pretrained if that’s what you mean.

Also applies to: 133-181

tests/prompt_strategies/test_chat_templates_advanced.py (1)

24-42: Prefer xfail/skip over commenting out the phi35_tokenizer case

Commenting out the ("phi35_tokenizer", "phi_35", None, "<|end|>") param avoids the current breakage but also removes coverage and makes it easy to forget. Consider keeping the case in PARAMETRIZE_PARAMS and marking it with pytest.mark.xfail (or a conditional skip) explaining the transformers v5 incompatibility. That way CI still tracks when it starts passing again.

src/axolotl/monkeypatch/accelerate/parallelism_config.py (1)

86-105: Consider guarding against missing cp_handler attribute.

If cp_backend is not "deepspeed", line 94 accesses self.parallelism_config.cp_handler.cp_comm_strategy. If cp_handler is None, this will raise AttributeError.

Consider adding a guard:

     def patched_prepare_cp(self, *args):
         if self.parallelism_config.cp_backend == "deepspeed":
             return args
 
+        if self.parallelism_config.cp_handler is None:
+            return args
+
         from accelerate.big_modeling import _attach_context_parallel_hooks
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3f4aa1 and d483aae.

📒 Files selected for processing (37)
  • .github/workflows/preview-docs.yml (1 hunks)
  • .github/workflows/tests.yml (3 hunks)
  • cicd/multigpu.sh (1 hunks)
  • docs/amd_hpc.qmd (1 hunks)
  • docs/installation.qmd (1 hunks)
  • requirements.txt (1 hunks)
  • src/axolotl/cli/checks.py (1 hunks)
  • src/axolotl/core/builders/base.py (3 hunks)
  • src/axolotl/core/builders/causal.py (1 hunks)
  • src/axolotl/loaders/patch_manager.py (0 hunks)
  • src/axolotl/loaders/processor.py (1 hunks)
  • src/axolotl/monkeypatch/accelerate/parallelism_config.py (1 hunks)
  • src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (4 hunks)
  • src/axolotl/monkeypatch/transformers/trainer_context_parallel.py (1 hunks)
  • src/axolotl/processing_strategies.py (2 hunks)
  • src/axolotl/prompt_strategies/chat_template.py (1 hunks)
  • src/axolotl/utils/callbacks/perplexity.py (1 hunks)
  • src/axolotl/utils/mistral/mistral_tokenizer.py (6 hunks)
  • src/axolotl/utils/schemas/fsdp.py (1 hunks)
  • src/axolotl/utils/schemas/validation.py (1 hunks)
  • src/axolotl/utils/trainer.py (1 hunks)
  • tests/e2e/integrations/test_cut_cross_entropy.py (1 hunks)
  • tests/e2e/integrations/test_hooks.py (1 hunks)
  • tests/e2e/multigpu/test_fp8_fsdp2.py (2 hunks)
  • tests/e2e/multigpu/test_fsdp1.py (1 hunks)
  • tests/e2e/multigpu/test_fsdp2.py (1 hunks)
  • tests/e2e/multigpu/test_gemma3.py (1 hunks)
  • tests/e2e/utils.py (1 hunks)
  • tests/hf_offline_utils.py (1 hunks)
  • tests/monkeypatch/test_mistral_tokenizer_patch.py (0 hunks)
  • tests/prompt_strategies/test_alpaca.py (3 hunks)
  • tests/prompt_strategies/test_chat_templates.py (2 hunks)
  • tests/prompt_strategies/test_chat_templates_advanced.py (1 hunks)
  • tests/prompt_strategies/test_raw_io.py (2 hunks)
  • tests/test_normalize_config.py (2 hunks)
  • tests/test_tokenizers.py (3 hunks)
  • tests/utils/schemas/validation/test_fsdp.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tests/monkeypatch/test_mistral_tokenizer_patch.py
  • src/axolotl/loaders/patch_manager.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-31T11:48:46.313Z
Learnt from: SalmanMohammadi
Repo: axolotl-ai-cloud/axolotl PR: 2994
File: index.qmd:7-8
Timestamp: 2025-07-31T11:48:46.313Z
Learning: In documentation workflow testing PRs, temporary test content like "test test" may be intentionally added to trigger documentation preview actions and validate workflow fixes, rather than being accidental debug text.

Applied to files:

  • .github/workflows/preview-docs.yml
📚 Learning: 2025-09-09T13:50:26.197Z
Learnt from: djsaunde
Repo: axolotl-ai-cloud/axolotl PR: 3067
File: examples/llama-3/diffusion-3.2-1b-sft.yaml:12-18
Timestamp: 2025-09-09T13:50:26.197Z
Learning: Token ID 128002 in Llama 3.2 tokenizer corresponds to `<|reserved_special_token_0|>`, which is one of 250 reserved special tokens designed for custom usage in fine-tuning applications like diffusion masking.

Applied to files:

  • tests/test_tokenizers.py
📚 Learning: 2025-09-22T22:14:35.531Z
Learnt from: gholmes829
Repo: axolotl-ai-cloud/axolotl PR: 3167
File: src/axolotl/utils/schemas/validation.py:819-834
Timestamp: 2025-09-22T22:14:35.531Z
Learning: In the axolotl codebase, validation methods maintain separation of concerns - early validators focus on core logic while `check_fsdp_config_kwargs_prefix` handles deprecated prefix normalization. This pattern should be preserved for consistency rather than mixing prefix handling into individual validators.

Applied to files:

  • tests/test_normalize_config.py
  • src/axolotl/utils/schemas/validation.py
🧬 Code graph analysis (10)
src/axolotl/utils/trainer.py (1)
src/axolotl/monkeypatch/accelerate/parallelism_config.py (1)
  • patch_prepare_cp (80-107)
src/axolotl/loaders/processor.py (1)
src/axolotl/utils/mistral/mistral_tokenizer.py (1)
  • HFMistralTokenizer (14-227)
tests/test_tokenizers.py (1)
tests/test_datasets.py (1)
  • tokenizer (31-35)
tests/e2e/integrations/test_hooks.py (1)
tests/e2e/utils.py (1)
  • check_model_output_exists (199-220)
tests/e2e/multigpu/test_fp8_fsdp2.py (1)
tests/e2e/utils.py (3)
  • most_recent_subdir (35-42)
  • require_torch_2_7_0 (81-90)
  • supports_fp8 (170-174)
src/axolotl/core/builders/causal.py (1)
src/axolotl/integrations/base.py (2)
  • cfg (339-340)
  • cfg (343-344)
tests/e2e/integrations/test_cut_cross_entropy.py (1)
tests/e2e/utils.py (1)
  • check_model_output_exists (199-220)
src/axolotl/processing_strategies.py (1)
src/axolotl/utils/mistral/mistral3_processor.py (1)
  • Mistral3Processor (27-170)
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (2)
src/axolotl/utils/mistral/mistral_tokenizer.py (1)
  • apply_chat_template (74-94)
src/axolotl/utils/mistral/mistral3_processor.py (1)
  • apply_chat_template (74-148)
tests/utils/schemas/validation/test_fsdp.py (2)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/config/__init__.py (1)
  • validate_config (259-305)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.1, vllm, 2, true)
  • GitHub Check: test-axolotl-multigpu (128, 12.8.1, 3.11, 2.9.0, fbgemm-gpu, 2, true)
  • GitHub Check: test-axolotl-multigpu (128, 12.8.1, 3.11, 2.8.0, fbgemm-gpu, 2, true)
  • GitHub Check: preview
  • GitHub Check: PyTest (3.11, 2.7.1)
  • GitHub Check: PyTest from Source Dist (3.11, 2.9.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
  • GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
  • GitHub Check: PyTest (3.11, 2.8.0)
  • GitHub Check: PyTest (3.11, 2.9.0)
🔇 Additional comments (34)
tests/hf_offline_utils.py (1)

16-16: Remove the commented import and function call or provide an explanation.

The reset_sessions function from huggingface_hub.utils has not been removed—it remains available in the library. Commenting out working code without explanation (on lines 16 and 23) should be avoided. Either:

  • Uncomment and restore the original functionality with a note explaining why it's needed
  • Remove the lines entirely if this functionality is no longer required for the test

If there's a specific issue with reset_sessions in the updated huggingface-hub version, document it in the code or commit message.

Also applies to: 23-23

src/axolotl/core/builders/base.py (1)

524-531: Alias from config include_tokens_per_second to HF include_num_input_tokens_seen is consistent

The new arg_map entry correctly routes the existing config field include_tokens_per_second into the updated transformers kwarg include_num_input_tokens_seen, preserving backward compatibility while aligning with the v5 API surface. This looks good, assuming the HF arg name is exactly include_num_input_tokens_seen and there’s no need to support a new config key yet.

src/axolotl/utils/schemas/fsdp.py (1)

15-18: LGTM!

The new version field is well-defined with appropriate type annotation, default value, and schema metadata.

tests/test_normalize_config.py (2)

131-131: LGTM!

The assertion change correctly reflects the new behavior where version is preserved in fsdp_config after migration, following the prefix normalization and bidirectional synchronization logic.


198-198: LGTM!

Consistent with the previous test update, correctly asserting that version is retained in fsdp_config after migration.

tests/utils/schemas/validation/test_fsdp.py (2)

16-25: LGTM!

The new test appropriately validates that fsdp_version is correctly derived from fsdp_config.version, exercising the upward synchronization logic.


27-38: LGTM!

The updated test correctly validates bidirectional synchronization, ensuring that when fsdp_version is set at the top level, it propagates to fsdp_config.version.

src/axolotl/utils/schemas/validation.py (1)

858-858: LGTM!

The updated condition correctly includes fsdp_version in the prefix normalization, stripping the fsdp_ prefix from all keys including fsdp_version to transform it to version.

tests/e2e/integrations/test_cut_cross_entropy.py (1)

13-13: Import path change to shared test utility looks good

Using tests.e2e.utils.check_model_output_exists keeps imports consistent across e2e tests and reduces relative‑import fragility.

tests/e2e/integrations/test_hooks.py (1)

14-14: Consistent use of central test utility module

Updating the import to tests.e2e.utils.check_model_output_exists aligns this test with the rest of the suite and should aid discoverability.

tests/e2e/multigpu/test_fsdp2.py (1)

153-156: Explicitly disabling LoRA kernels in baseline test is a good stabilization

Making test_lora_sft explicitly set all LoRA kernel flags to False isolates it from auto‑enable behavior and leaves test_lora_sft_kernels to cover the kernel‑enabled path. This should reduce hardware/driver‑dependent flakiness in the baseline FSDP2 LoRA test.

tests/e2e/multigpu/test_fp8_fsdp2.py (1)

14-15: Switch to supports_fp8 keeps FP8 gating centralized

Using supports_fp8 instead of a Hopper‑specific decorator centralizes FP8 hardware gating in tests.e2e.utils while leaving the smoke test logic unchanged. This should make future FP8 capability tweaks easier without touching individual tests.

Also applies to: 51-53

docs/installation.qmd (1)

168-168: Authentication command updated consistently.

The login command has been updated to hf auth login, consistent with the change in src/axolotl/cli/checks.py. This aligns with the HF CLI v2 command structure.

Please verify this command works with the updated huggingface_hub>=1.1.7 dependency (same verification as requested for src/axolotl/cli/checks.py).

requirements.txt (3)

16-16: Transformers installed from git main for v5 testing.

The transformers library is now installed directly from the main branch of the git repository. This aligns with the PR title "[WIP] Fix SP tests" and the branch name "transformers-v5-fix-accelerate-cp", indicating this is intentional for testing transformers v5 compatibility.

Since this is a WIP PR, installing from git main is appropriate for testing. However, verify that this doesn't introduce breaking changes by running the test suite. Consider pinning to a specific commit hash for reproducibility once testing is complete.


13-13: Verify compatibility with major huggingface_hub version update.

The huggingface_hub requirement has been updated from >=0.36.0 to >=1.1.7, which is a significant version jump. Please verify that all huggingface_hub APIs used in the codebase are compatible with v1.1.7 and that this update aligns with the HF CLI v2 migration referenced in this PR.


17-22: Multiple dependency version updates require manual verification.

Several key dependencies have been updated:

  • accelerate: 1.11.0 → 1.12.0
  • deepspeed: >=0.17.0 → >=0.18.2
  • trl: 0.25.0 → 0.25.1
  • kernels: >=0.9.0 → >=0.11.2

These updates align with the transformers v5 compatibility work mentioned in the PR. Compatibility between these versions should be verified, particularly regarding the kernels constraint change (>=0.9.0 to >=0.11.2) and whether any hardcoded version checks in the codebase need adjustment.

.github/workflows/preview-docs.yml (1)

30-33: Standard CI disk space optimization.

This cleanup step removes unnecessary system tools to free up disk space on the GitHub Actions runner before building documentation. This is a common and recommended practice, especially for workflows that may download large models or datasets.

The same cleanup step is consistently applied in .github/workflows/tests.yml, which is good for maintainability.

cicd/multigpu.sh (1)

5-5: Fast-fail added for quicker CI feedback.

The --maxfail=3 flag has been added, which will stop the test run after 3 failures. This speeds up CI feedback, especially useful for this WIP PR.

Note that this only affects the first pytest run (line 5). The solo and patched test runs (lines 12 and 17) will continue to run all tests, ensuring full coverage when the initial tests pass.

.github/workflows/tests.yml (4)

112-112: HF CLI download command updated.

The dataset download command has been updated from huggingface-cli download to hf download, consistent with the HF CLI v2 migration across the repository.

Verify this command syntax works with huggingface_hub>=1.1.7 (same verification requested for docs/amd_hpc.qmd).


116-116: Pytest parallelism reduced from -n8 to -n4.

The test parallelism has been reduced in two places:

  • Line 116: Main test run
  • Line 195: Source dist test run

This reduction may be to address stability issues or resource constraints on CI runners.

Please verify whether this change is:

  1. Addressing flakiness/race conditions in tests with high parallelism
  2. Accommodating resource limitations on CI runners
  3. Or a temporary measure for debugging

If this is addressing test stability, consider documenting the reason in a comment for future reference.

Also applies to: 195-195


191-191: HF cache command updated.

The command to show HF cache has been updated from huggingface-cli scan-cache to hf cache ls, consistent with the HF CLI v2 migration.


200-200: Job dependency changes optimize CI execution flow.

The dependency structure has been adjusted:

  • Line 200: gate-skip-e2e now only depends on [pre-commit] (removed pytest)
  • Line 236: docker-e2e-tests-1st now depends on [pre-commit, pytest] (added pytest)

This change allows the gate-skip-e2e job to run in parallel with pytest, while ensuring that expensive E2E tests only run after unit tests pass. This is a good optimization for CI cost and execution time.

The logic ensures:

  1. Gate evaluation can happen quickly without waiting for pytest
  2. E2E tests wait for both pre-commit and pytest to pass
  3. Expensive GPU resources are only used after cheaper validation passes

Also applies to: 236-236

src/axolotl/cli/checks.py (1)

47-47: The error message correctly instructs users to use hf auth login, which is the official authentication command for huggingface_hub 1.1.7 and later as documented by HuggingFace. No changes needed.

tests/test_tokenizers.py (1)

20-42: Avoid hard‑coding tokenizer‑specific IDs in expectations

The updated expectation to [1, 32000, 1792] will be fragile to future tokenizer / added_tokens changes. To make this test more robust, derive the id dynamically from the tokenizer rather than hard‑coding 1792 (e.g., assert that the third element equals the tokenizer’s id for "<|im_start|>", or that len(tokenizer) increased by 1 and that encoding "<|im_start|>user" round‑trips as expected).

[ suggest_recommended_refactor ]

Also applies to: 78-88

src/axolotl/utils/callbacks/perplexity.py (1)

10-14: Conditional PreTrainedTokenizer import looks good

The try/except around transformers.tokenization_python with fallback to transformers.tokenization_utils is a reasonable way to straddle older and newer transformers versions without changing the rest of the logic. Please just double‑check against the exact transformers versions you intend to support that the symbol is available in one of these two locations.

src/axolotl/prompt_strategies/chat_template.py (1)

147-152: Tokenized apply_chat_template usage aligns with strategy expectations

Switching the non‑processor path to:

self.tokenizer.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=False,
    **chat_template_kwargs,
)

matches what _tokenize_single_prompt expects (a list[int] in the simple path) and is consistent with the updated tests. Just ensure all supported tokenizer backends (including mistral‑common via HFMistralTokenizer) support tokenize/return_dict in apply_chat_template under the transformers versions you target.

src/axolotl/utils/trainer.py (1)

622-640: Context‑parallel patch hook is wired correctly; validate against runtime stack

Hooking patch_prepare_cp() when context_parallel_size > 1 after setting:

  • PARALLELISM_CONFIG_CP_SIZE
  • ACCELERATE_ALLOW_CP_STANDALONE

is a reasonable place to integrate the custom _prepare_cp logic shown in the monkeypatch. Since this path depends on specific accelerate and torch.distributed.tensor.experimental APIs, please sanity‑check on the full range of versions/setups you intend to support (e.g., CP backend "deepspeed" vs "torch", and older torch without CP) to ensure no unexpected ImportError/attribute errors at runtime.

tests/prompt_strategies/test_alpaca.py (1)

71-120: Token ID updates are internally consistent across all three test cases.

The updated token IDs are consistent:

  • input_ids in test_no_double_im_end match labels in test_w_train_on_input
  • The -100 label counts in test_no_train_on_input correctly correspond to the token counts in the other tests

These changes align with the broader tokenization updates across the PR for transformers v5 compatibility.

src/axolotl/loaders/processor.py (1)

24-36: Patch target correctly updated to MistralCommonBackend.

The change from MistralCommonTokenizer to MistralCommonBackend is consistent with:

  1. HFMistralTokenizer now inheriting from MistralCommonBackend
  2. The monkeypatch in mistral_common_tokenizer.py targeting the same class

The docstring accurately describes the purpose for transformers v5 compatibility.

tests/prompt_strategies/test_chat_templates.py (1)

140-172: Phi-3.5 token ID updates are consistent between input_ids and labels.

The updated token IDs for the phi-3.5 test are internally consistent:

  • expected_labels line 159 correctly mirrors expected_input_ids line 149 for the trained assistant output
  • All non-trained positions correctly use -100

These changes align with the broader token ID updates across the PR.

src/axolotl/monkeypatch/accelerate/parallelism_config.py (1)

90-92: Experimental PyTorch APIs require version compatibility monitoring.

The imports torch.distributed.tensor.experimental.context_parallel and torch.distributed.tensor.experimental._attention.set_rotate_method are explicitly experimental/prototype APIs. Per PyTorch documentation, these are subject to change across versions, making this code sensitive to PyTorch version upgrades. Ensure PyTorch version is pinned in dependencies or add compatibility checks to handle potential API breaking changes in future releases.

src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (1)

14-19: Patch target updated to MistralCommonBackend - implementation details unverified.

The change from MistralCommonTokenizer to MistralCommonBackend aligns with the class hierarchy updates. MistralCommonBackend.apply_chat_template exists in the transformers library and does support image handling for multimodal inputs.

However, the exact string being patched (pixel_values = torch.tensor(images)) could not be verified against the transformers source code. If this string does not match the actual implementation in the specific transformers version being used, the patch will silently fail without raising an error (line 85's warning path). Verify the patch target string against the transformers version used in this project.

src/axolotl/processing_strategies.py (2)

431-431: Appropriate removal of type annotation to avoid import-time issues.

The removal of the Mistral3Processor type annotation aligns with moving the import to function scope (lines 463-464), which is a standard pattern for avoiding circular dependencies or optional dependency issues during module initialization.


463-464: LGTM: Local import pattern avoids import-time dependencies.

Moving the Mistral3Processor import to function scope is an appropriate pattern for deferring imports and avoiding circular dependencies or issues with optional dependencies. This change is consistent with the untyped processor parameter at line 431.

Comment thread docs/amd_hpc.qmd

```bash
huggingface-cli download meta-llama/Meta-Llama-3.1-8B --local-dir ~/hfdata/llama3.1-8B
hf download meta-llama/Meta-Llama-3.1-8B --local-dir ~/hfdata/llama3.1-8B

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd requirements.txt

Repository: axolotl-ai-cloud/axolotl

Length of output: 258


🌐 Web query:

huggingface_hub version 1.1.7 hf download command syntax documentation

💡 Result:

Syntax (CLI, huggingface_hub v1.1.7)

  • Basic: hf download <repo_id> [filename1 filename2 ...]
    • Example: hf download gpt2 config.json
  • Download multiple files (shows progress and returns snapshot path): hf download <repo_id> file1 file2
  • Download to a local folder: --local-dir
    • Example: hf download adept/fuyu-8b model-00001-of-00002.safetensors --local-dir fuyu. [1]

Common options

  • --cache-dir (specify custom cache directory).
  • --token <hf_token> (use explicit token for private/gated repos).
  • --quiet (only print final path).
  • Dry-run and other CLI helpers are available in the CLI guide. For download timeouts, use HF_HUB_DOWNLOAD_TIMEOUT environment variable. [1][2]

References
[1] Hugging Face Hub — Download guide and CLI (v1.1.7).
[2] Hugging Face Hub — CLI guide (hf download options).


Correct the hf download command syntax for huggingface_hub v1.1.7.

The command on line 89 omits required filenames. According to huggingface_hub v1.1.7 documentation, hf download requires specific filenames to be listed: hf download <repo_id> file1 file2 --local-dir <path>. Either specify the model files to download (e.g., model.safetensors, config.json) or verify if a different approach is needed for downloading entire repositories.

🤖 Prompt for AI Agents
In docs/amd_hpc.qmd around line 89, the hf download command is missing required
filenames for huggingface_hub v1.1.7; replace the single-repo argument with an
explicit list of files to fetch (for example the model binary like
model.safetensors or pytorch_model.bin, config.json, tokenizer files) or change
the approach to download the whole repo via the appropriate huggingface_hub
helper; update the command to include the exact filenames you want to download
and keep --local-dir <path> as the destination.

Comment thread tests/e2e/utils.py
Comment on lines +170 to +174
def supports_fp8(test_case):
compute_capability = torch.cuda.get_device_capability()
return unittest.skipUnless(
compute_capability >= (9, 0), "test requires h100 or newer GPU"
)(test_case)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard supports_fp8 against non‑CUDA environments to prevent errors

Currently supports_fp8 calls torch.cuda.get_device_capability() unconditionally. On CPU-only or non-CUDA PyTorch builds this will raise a RuntimeError instead of cleanly skipping the test, unlike the existing skip_if_no_cuda pattern in the file which guards with torch.cuda.is_available().

Align with the established guard pattern so FP8 tests are properly skipped when CUDA is unavailable.

Suggested change:

def supports_fp8(test_case):
+    if not torch.cuda.is_available():
+        return unittest.skip("CUDA not available")(test_case)
    compute_capability = torch.cuda.get_device_capability()
    return unittest.skipUnless(
        compute_capability >= (9, 0), "test requires h100 or newer GPU"
    )(test_case)
🤖 Prompt for AI Agents
In tests/e2e/utils.py around lines 170-174, the supports_fp8 decorator calls
torch.cuda.get_device_capability() unconditionally which raises on non-CUDA
builds; first guard with torch.cuda.is_available() like the existing
skip_if_no_cuda pattern, e.g. if not torch.cuda.is_available(): return
unittest.skip("CUDA not available")(test_case), otherwise fetch
compute_capability and then return unittest.skipUnless(compute_capability >= (9,
0), "test requires h100 or newer GPU")(test_case); this ensures tests are
cleanly skipped on CPU-only environments and still checks capability when CUDA
is present.

@github-actions

github-actions Bot commented Dec 8, 2025

Copy link
Copy Markdown
Contributor

📖 Documentation Preview: https://693747a8b9f13a5dc02143dd--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit d483aae

@codecov

codecov Bot commented Dec 8, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...olotl/monkeypatch/accelerate/parallelism_config.py 0.00% 18 Missing ⚠️
src/axolotl/utils/trainer.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian changed the base branch from main to transformers-v5 December 9, 2025 14:34
@winglian winglian merged commit adae23f into transformers-v5 Dec 16, 2025
20 of 21 checks passed
@winglian winglian deleted the transformers-v5-fix-accelerate-cp branch December 16, 2025 14:25
winglian pushed a commit that referenced this pull request Dec 17, 2025
winglian pushed a commit that referenced this pull request Dec 30, 2025
winglian added a commit that referenced this pull request Dec 30, 2025
* upgrade dependencies

* don't use reset sessions

* downgrade transformers, upgrade other deps

* upgrade bnb to 0.49.0

* restore s3 cache

* explicit use local files w hub

* decompress and strip top level dir

* use 2 levels for strip components

* try to preserve permissions for symlinks

* use updated tar

* fix #3293 for distributed

* downgrade bnb

* fast fail after 4

* fix total tokens device

* patch accelerate CP/SP (#3309)

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
@coderabbitai coderabbitai Bot mentioned this pull request Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants