CLI: add --launcher option, support launcher args, cleanup, refactor by djsaunde · Pull Request #2924 · axolotl-ai-cloud/axolotl

djsaunde · 2025-07-15T14:07:22Z

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

New Features
- CLI commands now support a unified --launcher option for selecting between "accelerate", "torchrun", or "python", with the ability to pass launcher-specific arguments using --.
- Enhanced documentation for multi-GPU and multi-node training, including clear instructions and examples for using launcher arguments.
- Improved handling and forwarding of extra CLI arguments to launchers.
Bug Fixes
- Resolved deprecated --no-accelerate usage in tests and documentation, ensuring compatibility with new launcher options.
Refactor
- Streamlined CLI command logic for training, evaluation, inference, and weight merging to use a centralized launcher mapping and argument handling.
- Internal restructuring of utility functions for argument parsing, command building, model loading, and GitHub file syncing.
- Removed deprecated environment variable loading and CUDA VRAM optimization calls from CLI scripts.
- Improved code readability and variable naming in configuration handling.
Documentation
- Updated CLI and multi-node training docs to clarify launcher usage and argument passing.
- Removed outdated examples and improved consistency across documentation.
- Reorganized CLI API documentation structure to include new modules and group utility submodules.
Tests
- Expanded test coverage for launcher argument handling, including various launcher types and argument combinations.
- Updated tests to validate new CLI interface and ensure backward compatibility.
- Added utility test helpers for launcher argument validation and rendezvous argument defaults.
Chores
- Removed obsolete utility module and replaced with organized utility submodules.
- Cleaned up import statements and type annotations for clarity and conciseness.

coderabbitai · 2025-07-15T14:07:28Z

📝 Walkthrough

Walkthrough

This update introduces a unified and extensible launcher argument interface for the Axolotl CLI, replacing the previous accelerate/no-accelerate boolean flag with a --launcher option supporting "accelerate", "torchrun", or "python" modes. The change is reflected in CLI code, documentation, cloud integration, and tests. Utilities for CLI argument handling and command building are refactored and modularized, and environment variable loading via dotenv is removed from CLI scripts.

Changes

Files/Groups	Change Summary
docs/cli.qmd, docs/multi-node.qmd	Documentation updated to explain new launcher argument interface, usage examples, and deprecate `--no-accelerate`.
src/axolotl/cli/args.py	Removes `main_process_port` and `num_processes` from `TrainerCliArgs`.
src/axolotl/cli/cloud/init.py, src/axolotl/cli/cloud/base.py, src/axolotl/cli/cloud/modal_.py	Refactors cloud training interface: replaces `accelerate` boolean with `launcher` (Literal), adds `launcher_args`, updates type annotations, and adapts command construction.
src/axolotl/cli/config.py	Refactors config update logic for clarity; no functional change.
src/axolotl/cli/main.py	Refactors CLI commands to use `--launcher`, centralizes launcher logic, updates argument forwarding, and moves environment patching and dotenv loading to CLI group level.
src/axolotl/cli/evaluate.py, src/axolotl/cli/inference.py, src/axolotl/cli/merge_lora.py, src/axolotl/cli/merge_sharded_fsdp_weights.py, src/axolotl/cli/preprocess.py, src/axolotl/cli/train.py, src/axolotl/cli/delinearize_llama4.py	Removes dotenv loading and, where applicable, CUDA VRAM optimization patching.
src/axolotl/cli/utils.py	Deleted: previously contained CLI utility functions, now modularized.
src/axolotl/cli/utils/init.py, src/axolotl/cli/utils/args.py, src/axolotl/cli/utils/fetch.py, src/axolotl/cli/utils/load.py, src/axolotl/cli/utils/train.py	New utility modules: provide argument decorators, command building, config generation, launcher dispatch, model loading, and GitHub fetch helpers.
tests/cli/test_cli_base.py	Updates test helpers and expectations for new launcher interface and argument formatting.
tests/cli/test_cli_evaluate.py, tests/cli/test_cli_inference.py, tests/cli/test_cli_merge_sharded_fsdp_weights.py, tests/cli/test_cli_train.py	Extends tests to cover new launcher argument patterns, argument passing, and backward compatibility.
tests/cli/test_cli_interface.py	Updates assertions for new CLI option formatting and error messages.
tests/cli/test_cli_sweeps.py	Updates import to new utility location.
tests/cli/test_utils.py	Adds new tests for launcher argument handling and rendezvous argument defaults.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Complexity: Complex. The changes span CLI architecture, cloud integration, utility refactoring, and extensive test updates, requiring careful review for interface consistency, backward compatibility, and correct argument handling.

Suggested labels

ready to merge

Suggested reviewers

winglian

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cli-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-07-15T14:12:45Z

📖 Documentation Preview: https://688279ae06edc29f170a8388--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit c9adb88

codecov · 2025-07-15T14:15:30Z

Codecov Report

❌ Patch coverage is 84.81013% with 36 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/cli/utils/train.py	82.92%	14 Missing ⚠️
src/axolotl/cli/cloud/modal_.py	18.18%	9 Missing ⚠️
src/axolotl/cli/utils/fetch.py	85.48%	9 Missing ⚠️
src/axolotl/cli/utils/load.py	88.88%	2 Missing ⚠️
src/axolotl/cli/cloud/__init__.py	66.66%	1 Missing ⚠️
src/axolotl/cli/config.py	80.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (7)

src/axolotl/cli/utils/args.py (3)
23-26: Fix potential type checking issue with NoneType instances.

The current logic uses isinstance(t, NoneType) which could be problematic. Consider using t is type(None) or t is NoneType for more reliable type checking.
-        field_type = next(
-            t for t in get_args(field_type) if not isinstance(t, NoneType)
-        )
+        field_type = next(
+            t for t in get_args(field_type) if t is not type(None)
+        )
68-68: Use isinstance() for type comparisons.

The static analysis correctly identifies that direct type comparison should use isinstance() instead of ==.
-        if field_type == bool:
+        if field_type is bool:
106-106: Use isinstance() for type comparisons.

Same issue as line 68 - use identity comparison for type checking.
-        if field_type == bool:
+        if field_type is bool:
src/axolotl/cli/utils/fetch.py (2)
58-59: Consider making timeout configurable.

The hardcoded 30-second timeout might not be suitable for all network conditions or file sizes.
-    response = requests.get(raw_url, timeout=30)
+    response = requests.get(raw_url, timeout=timeout or 30)
And add timeout: int | None = None to the function parameters.

87-89: Add response validation for GitHub API.

Consider validating the API response structure to handle potential API changes or errors more gracefully.
    response.raise_for_status()
    tree = json.loads(response.text)
+    
+    if "tree" not in tree:
+        raise click.ClickException("Invalid GitHub API response structure")
src/axolotl/cli/cloud/modal_.py (1)
239-240: Consider combining nested with statements.

The static analysis correctly suggests combining the nested with statements for cleaner code.
-        with modal.enable_output():
-            with self.app.run(detach=True):
+        with modal.enable_output(), self.app.run(detach=True):
src/axolotl/cli/utils/train.py (1)
14-43: Move UUID import to module level.

Consider moving the uuid import to the top of the file for better code organization and to avoid repeated imports if this function is called multiple times.
 import os
 import subprocess  # nosec
 import tempfile
+import uuid
 from pathlib import Path
 from typing import Any, Literal

 import yaml
Then remove line 38:
-        import uuid
-
         args.extend(["--rdzv_id", str(uuid.uuid4())[:8]])

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1407aac and 4aeb06c.

📒 Files selected for processing (29)

docs/cli.qmd (3 hunks)
docs/multi-node.qmd (1 hunks)
src/axolotl/cli/args.py (0 hunks)
src/axolotl/cli/cloud/__init__.py (4 hunks)
src/axolotl/cli/cloud/base.py (2 hunks)
src/axolotl/cli/cloud/modal_.py (3 hunks)
src/axolotl/cli/config.py (1 hunks)
src/axolotl/cli/delinearize_llama4.py (0 hunks)
src/axolotl/cli/evaluate.py (0 hunks)
src/axolotl/cli/inference.py (0 hunks)
src/axolotl/cli/main.py (11 hunks)
src/axolotl/cli/merge_lora.py (0 hunks)
src/axolotl/cli/merge_sharded_fsdp_weights.py (0 hunks)
src/axolotl/cli/preprocess.py (0 hunks)
src/axolotl/cli/train.py (0 hunks)
src/axolotl/cli/utils.py (0 hunks)
src/axolotl/cli/utils/__init__.py (1 hunks)
src/axolotl/cli/utils/args.py (1 hunks)
src/axolotl/cli/utils/fetch.py (1 hunks)
src/axolotl/cli/utils/load.py (1 hunks)
src/axolotl/cli/utils/train.py (1 hunks)
tests/cli/test_cli_base.py (3 hunks)
tests/cli/test_cli_evaluate.py (5 hunks)
tests/cli/test_cli_inference.py (3 hunks)
tests/cli/test_cli_interface.py (2 hunks)
tests/cli/test_cli_merge_sharded_fsdp_weights.py (1 hunks)
tests/cli/test_cli_sweeps.py (1 hunks)
tests/cli/test_cli_train.py (5 hunks)
tests/cli/test_utils.py (1 hunks)

💤 Files with no reviewable changes (9)

src/axolotl/cli/delinearize_llama4.py
src/axolotl/cli/merge_sharded_fsdp_weights.py
src/axolotl/cli/inference.py
src/axolotl/cli/merge_lora.py
src/axolotl/cli/preprocess.py
src/axolotl/cli/args.py
src/axolotl/cli/train.py
src/axolotl/cli/evaluate.py
src/axolotl/cli/utils.py

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

docs/multi-node.qmd (1)

Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

docs/cli.qmd (1)

Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

tests/cli/test_cli_inference.py (1)

Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

src/axolotl/cli/main.py (1)

Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

🧬 Code Graph Analysis (5)

tests/cli/test_cli_sweeps.py (1)

src/axolotl/cli/utils/sweeps.py (1)

generate_sweep_configs (8-77)

src/axolotl/cli/utils/load.py (5)

src/axolotl/loaders/processor.py (1)

load_processor (17-56)

src/axolotl/loaders/tokenizer.py (1)

load_tokenizer (122-303)

src/axolotl/loaders/model.py (1)

ModelLoader (63-804)

src/axolotl/utils/dict.py (1)

DictDefault (6-38)

src/axolotl/utils/logging.py (1)

get_logger (42-49)

tests/cli/test_cli_merge_sharded_fsdp_weights.py (2)

src/axolotl/cli/main.py (1)

cli (44-48)

tests/cli/conftest.py (2)

config_path (32-37)

cli_runner (22-23)

tests/cli/test_cli_inference.py (2)

tests/cli/conftest.py (2)

config_path (32-37)

cli_runner (22-23)

src/axolotl/cli/main.py (1)

cli (44-48)

src/axolotl/cli/cloud/modal_.py (1)

cicd/multigpu.py (1)

run_cmd (61-66)

🪛 Ruff (0.12.2)

src/axolotl/cli/utils/args.py

68-68: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

106-106: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

src/axolotl/cli/cloud/modal_.py

239-240: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: preview

🔇 Additional comments (58)

tests/cli/test_cli_sweeps.py (1)

5-5: LGTM: Import path correctly updated for refactored CLI utilities.

The import change aligns with the modularization where generate_sweep_configs was moved from axolotl.cli.main to axolotl.cli.utils.sweeps and re-exported through the utils package.

tests/cli/test_cli_interface.py (2)

21-25: LGTM: Test correctly reflects new CLI argument formatting.

The updated assertions properly test the new unified argument format using equals signs (e.g., --learning-rate=0.0001) instead of separate option/value pairs, which aligns with the CLI refactoring.

40-40: LGTM: Error message assertion updated for new CLI framework.

The expected error message correctly reflects the updated CLI error reporting where invalid options now show "does not exist" instead of "No such option".

src/axolotl/cli/config.py (1)

200-206: LGTM: Improved variable naming and comment formatting.

The refactoring enhances code readability by using descriptive variable names (key, value instead of k, _) while preserving the original logic for configuration parameter handling.

docs/cli.qmd (3)

26-38: LGTM: Excellent documentation for launcher argument forwarding.

The new section clearly explains the -- separator convention for passing launcher-specific arguments, with practical examples for both torchrun and accelerate. This will help users understand the new launcher interface.

97-101: LGTM: Train command examples updated for new launcher interface.

The documentation correctly shows the new --launcher python option replacing --no-accelerate and provides clear examples of launcher argument forwarding using the -- separator.

196-198: LGTM: Consistent launcher argument documentation for evaluate command.

The added example maintains consistency with the train command documentation and shows proper usage of launcher arguments for multi-GPU evaluation.

src/axolotl/cli/cloud/base.py (1)

6-6: LGTM: Well-designed abstract interface update for launcher flexibility.

The updated train method signature effectively replaces the boolean accelerate parameter with a flexible launcher parameter using proper Literal typing, while adding support for launcher arguments and additional configuration options. This provides a more extensible interface for cloud training implementations.

Also applies to: 19-26

docs/multi-node.qmd (2)

72-77: Documentation clearly presents the new launcher interface.

The updated documentation effectively introduces the new --launcher torchrun syntax while maintaining the legacy approach for reference. The structure with "Option 1 (Recommended)" and "Option 2 (Legacy)" provides clear guidance to users.

84-92: Good addition of variable substitution guidance.

The note about substituting placeholder variables helps prevent common user errors, and the explanation of the CLI approach benefits reinforces the value of the new launcher interface.

src/axolotl/cli/utils/load.py (1)

20-52: Well-designed utility function for model loading.

The load_model_and_tokenizer function provides a clean interface for loading models, tokenizers, and processors. The implementation follows good practices:

Clear type hints and documentation

Logical loading sequence with appropriate logging

Proper handling of multimodal models

Consistent with existing axolotl patterns

src/axolotl/cli/utils/__init__.py (1)

1-23: Clean package initialization with good modular organization.

The __init__.py effectively exposes utilities from focused submodules, replacing the previous monolithic approach. The __all__ list properly documents the public API, and the imports are well-organized by functionality.

tests/cli/test_cli_inference.py (4)

15-15: Properly updated to use new launcher interface.

The test correctly replaces the deprecated --no-accelerate flag with --launcher python, aligning with the CLI refactoring.

36-63: Comprehensive test for torchrun launcher arguments.

This test effectively verifies that launcher arguments are correctly passed to torchrun, including proper command construction and argument placement.

65-93: Thorough verification of accelerate launcher integration.

The test properly validates accelerate launcher argument handling, ensuring the accelerate launch command is constructed correctly with forwarded arguments.

125-148: Good backward compatibility test.

This test ensures that using a launcher without additional arguments doesn't introduce unintended parameters, which is important for maintaining clean command execution.

tests/cli/test_cli_base.py (2)

20-27: Correctly updated validation tests for new launcher interface.

The validation tests properly use --launcher python instead of the deprecated --no-accelerate flag, maintaining consistent error handling verification.

30-70: Well-designed enhancement to basic execution test.

The addition of the train parameter and updated expected command construction provides better test flexibility and more precise verification of subprocess calls. The explicit debug flags and conditional shard flag properly reflect the CLI implementation.

src/axolotl/cli/utils/args.py (1)

52-87: Well-structured decorator for dataclass CLI option generation.

The implementation correctly handles boolean flags with paired options and processes fields in reverse order for proper Click option ordering. The metadata usage for help text is a nice touch.

src/axolotl/cli/utils/fetch.py (2)

40-41: Git blob SHA calculation is correctly implemented.

The implementation properly follows Git's blob object format with the "blob {size}\0{content}" structure for SHA-1 calculation. Using usedforsecurity=False is appropriate for content integrity checking.

114-135: Well-implemented concurrent download with proper error handling.

The ThreadPoolExecutor usage is correct with proper resource management via context manager. Error handling covers both request exceptions and IO errors, with results properly collected and categorized.

tests/cli/test_cli_merge_sharded_fsdp_weights.py (4)

14-16: Good update to use new launcher syntax.

The test correctly replaces the deprecated --no-accelerate flag with --launcher python, maintaining the same functionality while adopting the new unified launcher interface.

23-52: Comprehensive test for torchrun launcher argument forwarding.

The test properly verifies that launcher arguments are correctly passed through to the torchrun command, including checking for the module flag and specific torchrun arguments.

46-51: Thorough command verification logic.

The assertions correctly verify the command structure: torchrun executable, launcher arguments, module flag, and target module. This ensures the subprocess command is built correctly.

110-111: Smart backward compatibility verification.

The test ensures no launcher arguments leak into the command when none are provided, preventing argument contamination. The slice logic correctly isolates the launcher section between 'launch' and '-m'.

tests/cli/test_cli_evaluate.py (4)

23-25: Good parameterization of base test method.

Adding the train=False parameter properly distinguishes evaluation tests from training tests, improving test organization and clarity.

41-43: Consistent adoption of new launcher syntax.

The test correctly updates from --no-accelerate to --launcher python, maintaining the same functionality while using the new unified interface.

76-108: Comprehensive launcher argument forwarding test.

The test thoroughly verifies that torchrun launcher arguments are properly passed through, including checking command structure and specific arguments. The verification logic is robust.

172-175: Effective backward compatibility verification.

The test correctly ensures no unintended launcher arguments are included when none are provided, using proper slice logic to isolate and verify the launcher section.

src/axolotl/cli/cloud/modal_.py (4)

233-235: Good type safety with Literal types.

The use of Literal["accelerate", "torchrun", "python"] provides excellent type safety for the launcher parameter, preventing invalid launcher values at the type level.

275-281: Updated function signature aligns with new launcher interface.

The function signature correctly adopts the new launcher parameter pattern, maintaining consistency with the broader CLI refactoring.

287-301: Secure command construction with proper argument handling.

The command construction properly handles different launcher types and safely incorporates launcher arguments. The use of launcher_args or [] provides good default handling.

302-306: Command invocation is safe from shell injection.

The run_cmd implementation uses subprocess.call(cmd.split(), …) without shell=True, so there’s no shell interpretation of the command string and no risk of injection via special characters.

• cmd.split() produces a list of arguments passed directly to the OS executable.
• Even if an argument contains characters like ; or &&, they’re treated as literals, not shell operators.

tests/cli/test_utils.py (3)

77-111: LGTM! Well-structured test utility for launcher verification.

The helper function provides clear validation for launcher arguments in subprocess calls with good error messages.

142-149: LGTM! Clean fixture for test data.

Good practice to centralize test data in a fixture for reuse across tests.

151-236: LGTM! Comprehensive test coverage for RDZV argument handling.

The tests thoroughly cover all scenarios:

Adding defaults when endpoint is present

Preserving existing backend/id

Not adding args when endpoint is missing

Handling when all args are already present

Good use of descriptive test names and assertions.

src/axolotl/cli/cloud/__init__.py (2)

6-6: LGTM! Modern type annotation syntax.

Good update to use the more concise pipe syntax for union types, which is cleaner and more readable.

Also applies to: 14-14, 22-25

33-54: LGTM! Clean integration of launcher parameters.

The function signature properly extends to support the new launcher options while maintaining backward compatibility with the default "accelerate" launcher.

tests/cli/test_cli_train.py (7)

44-46: LGTM! Correct migration to new launcher syntax.

The test properly updates from the deprecated --no-accelerate flag to the new --launcher python option.

67-79: LGTM! Test maintains its purpose with new syntax.

The test correctly validates CLI overrides while using the new launcher option.

81-113: LGTM! Well-structured test for torchrun launcher arguments.

The test properly validates that launcher-specific arguments are correctly passed to the subprocess command.

114-147: LGTM! Comprehensive test for accelerate launcher.

The test properly validates the accelerate launch command structure including the "launch" subcommand.

148-181: LGTM! Important backward compatibility test.

This test ensures that existing usage patterns continue to work without unintended launcher arguments.

182-216: LGTM! Validates mixed argument handling.

The test properly ensures that both regular CLI arguments and launcher-specific arguments are correctly included in the subprocess command.

217-251: LGTM! Cloud integration test.

The test ensures that cloud training properly receives and forwards launcher configuration.

src/axolotl/cli/utils/train.py (6)

45-66: LGTM! Clean command building utility.

The function properly handles None values and follows standard CLI formatting conventions.

93-112: LGTM! Well-structured dispatcher function.

Clean separation of concerns with appropriate delegation to launcher-specific functions.

114-135: LGTM! Clean cloud training integration.

The function properly handles the cloud training case with appropriate parameter forwarding.

137-164: LGTM! Good backward compatibility handling.

The function properly extracts legacy launcher arguments while supporting new launcher args. The use of check=True ensures errors are propagated.

Note: Ensure the # nosec B603 suppression is justified by validating that all inputs to the subprocess command are properly sanitized.

166-188: LGTM! Consistent launcher implementations.

Both _launch_torchrun_training and _launch_python_training follow the established pattern. Good use of the RDZV default arguments for torchrun.

4-4: Subprocess calls are safe with list‐form invocation (shell=False).

The build_command function constructs the cmd list and subprocess.run(cmd, check=True) is invoked without shell=True, so there’s no shell interpolation of user input. Each option is passed as a separate argument ("--key=value"), preventing injection via shell metacharacters. No further validation is required here.

src/axolotl/cli/main.py (7)

47-49: LGTM! Proper initialization order.

Good practice to load environment variables and apply optimizations at CLI group initialization.

57-76: LGTM! Cleaner implementation.

Good refactoring to remove redundant environment patching since it's now handled at the CLI group level.

78-135: LGTM! Well-implemented launcher support.

The train command properly:

Accepts launcher arguments after -- separator

Handles Ray launcher override appropriately

Delegates to utility functions for cleaner code

Maintains backward compatibility with default "accelerate"

Good error handling for sweep failures.

137-179: LGTM! Consistent implementation pattern.

The evaluate command follows the same pattern as train, properly handling launcher arguments and using the command mapping.

181-227: LGTM! Inference command properly implemented.

The command correctly handles both launcher arguments and the gradio flag.

229-273: LGTM! Consistent command implementation.

The merge_sharded_fsdp_weights command follows the established pattern for launcher support.

275-350: LGTM! Appropriate handling of single-GPU commands.

The remaining commands (merge_lora, fetch, vllm_serve, quantize, delinearize_llama4) correctly don't include launcher options as they don't require multi-GPU support. Type annotations have been properly updated.

djsaunde · 2025-07-24T18:16:41Z

FYI, would need to manually test multi-node for this to be merge ready.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

_quarto.yml (2)
34-46: Keep CLI contents alphabetically sorted for long-term maintainability

cli.art, cli.delinearize_llama4, and the repositioned cli.quantize break the otherwise alphabetical ordering in this block (cli.args → cli.art → cli.checks, …). While the generator does not require ordering, keeping it sorted makes future diffs cleaner and lowers merge-conflict frequency.

No functional impact, but consider re-ordering the new entries alphabetically.

50-55: Potential duplicate docs: listing both cli.utils and its sub-modules will create redundant pages

Quarto treats each dotted path as a module; documenting cli.utils plus cli.utils.args/fetch/load/sweeps/train will generate one page for the parent and another for each child.
If cli.utils.__init__ re-exports everything (common in namespace packages) the parent page will repeat the sub-modules’ content, leading to noisy, duplicated docs.

Verify the intended structure:
-        - cli.utils
-        - cli.utils.args
-        - cli.utils.fetch
-        - cli.utils.load
-        - cli.utils.sweeps
-        - cli.utils.train
+        # Either keep only the parent …
+        - cli.utils
+        # … or list only the granular modules, not both
Choose one style and keep it consistent across sections.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4aeb06c and c9adb88.

📒 Files selected for processing (1)

_quarto.yml (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: pre-commit
GitHub Check: preview
GitHub Check: pre-commit

winglian · 2025-07-25T01:22:43Z

manually triggered multi-gpu since that uses the cli to launch tests https://github.com/axolotl-ai-cloud/axolotl/actions/runs/16511480037

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

src/axolotl/cli/utils/args.py (2)
52-87: Fix type comparison for better practices.

The function logic for generating Click options from dataclass fields is solid, but the type comparison should be improved.
-            if field_type == bool:
+            if field_type is bool:
Using is for type comparisons is more explicit and follows Python best practices for singleton types like bool.

90-120: Fix type comparison and consider consistency with dataclass version.

The Pydantic model option generation follows good patterns, but has the same type comparison issue.
-            if field_type == bool:
+            if field_type is bool:
Also note that this function sets default=None for all options while the dataclass version uses field.default. Consider if this difference in behavior is intentional.
src/axolotl/cli/main.py (1)
280-280: Minor: Removal of return type annotations.

The return type annotations were removed from these function signatures. While not incorrect, it reduces type safety information that could be helpful for IDE support and static analysis.

Consider retaining return type annotations for better type safety:
-def merge_lora(config: str, **kwargs):
+def merge_lora(config: str, **kwargs) -> None:

-def fetch(directory: str, dest: Optional[str]):
+def fetch(directory: str, dest: Optional[str]) -> None:

-def delinearize_llama4(model: str, output: str):
+def delinearize_llama4(model: str, output: str) -> None:
Also applies to: 297-297, 335-335

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9adb88 and 7a8b995.

📒 Files selected for processing (30)

_quarto.yml (1 hunks)
docs/cli.qmd (3 hunks)
docs/multi-node.qmd (1 hunks)
src/axolotl/cli/args.py (0 hunks)
src/axolotl/cli/cloud/__init__.py (4 hunks)
src/axolotl/cli/cloud/base.py (2 hunks)
src/axolotl/cli/cloud/modal_.py (3 hunks)
src/axolotl/cli/config.py (1 hunks)
src/axolotl/cli/delinearize_llama4.py (0 hunks)
src/axolotl/cli/evaluate.py (0 hunks)
src/axolotl/cli/inference.py (0 hunks)
src/axolotl/cli/main.py (11 hunks)
src/axolotl/cli/merge_lora.py (0 hunks)
src/axolotl/cli/merge_sharded_fsdp_weights.py (0 hunks)
src/axolotl/cli/preprocess.py (0 hunks)
src/axolotl/cli/train.py (0 hunks)
src/axolotl/cli/utils.py (0 hunks)
src/axolotl/cli/utils/__init__.py (1 hunks)
src/axolotl/cli/utils/args.py (1 hunks)
src/axolotl/cli/utils/fetch.py (1 hunks)
src/axolotl/cli/utils/load.py (1 hunks)
src/axolotl/cli/utils/train.py (1 hunks)
tests/cli/test_cli_base.py (3 hunks)
tests/cli/test_cli_evaluate.py (5 hunks)
tests/cli/test_cli_inference.py (3 hunks)
tests/cli/test_cli_interface.py (2 hunks)
tests/cli/test_cli_merge_sharded_fsdp_weights.py (1 hunks)
tests/cli/test_cli_sweeps.py (1 hunks)
tests/cli/test_cli_train.py (5 hunks)
tests/cli/test_utils.py (1 hunks)

💤 Files with no reviewable changes (9)

src/axolotl/cli/merge_lora.py
src/axolotl/cli/merge_sharded_fsdp_weights.py
src/axolotl/cli/inference.py
src/axolotl/cli/delinearize_llama4.py
src/axolotl/cli/preprocess.py
src/axolotl/cli/train.py
src/axolotl/cli/args.py
src/axolotl/cli/evaluate.py
src/axolotl/cli/utils.py

✅ Files skipped from review due to trivial changes (2)

tests/cli/test_cli_sweeps.py
src/axolotl/cli/utils/init.py

🚧 Files skipped from review as they are similar to previous changes (12)

tests/cli/test_cli_interface.py
docs/cli.qmd
docs/multi-node.qmd
_quarto.yml
src/axolotl/cli/config.py
src/axolotl/cli/cloud/base.py
tests/cli/test_cli_evaluate.py
tests/cli/test_cli_base.py
tests/cli/test_utils.py
tests/cli/test_cli_merge_sharded_fsdp_weights.py
src/axolotl/cli/cloud/init.py
src/axolotl/cli/utils/load.py

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

tests/cli/test_cli_inference.py (1)

Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

src/axolotl/cli/main.py (1)

Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

🧬 Code Graph Analysis (4)

tests/cli/test_cli_inference.py (2)

tests/cli/conftest.py (2)

config_path (32-37)

cli_runner (22-23)

src/axolotl/cli/main.py (1)

cli (45-49)

src/axolotl/cli/utils/fetch.py (1)

src/axolotl/utils/logging.py (1)

get_logger (42-49)

tests/cli/test_cli_train.py (5)

tests/cli/test_cli_base.py (1)

_test_basic_execution (30-70)

tests/cli/conftest.py (3)

cli_runner (22-23)

valid_test_config (27-28)

config_path (32-37)

src/axolotl/cli/main.py (2)

train (99-134)

cli (45-49)

src/axolotl/cli/cloud/base.py (1)

train (19-27)

src/axolotl/cli/cloud/modal_.py (1)

train (230-247)

src/axolotl/cli/cloud/modal_.py (1)

cicd/multigpu.py (1)

run_cmd (61-66)

🪛 Ruff (0.12.2)

src/axolotl/cli/cloud/modal_.py

239-240: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

src/axolotl/cli/utils/args.py

68-68: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

106-106: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

src/axolotl/cli/utils/train.py

83-83: Use a context manager for opening files

(SIM115)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.7.0)

🔇 Additional comments (34)

tests/cli/test_cli_inference.py (7)

3-4: LGTM: Appropriate pylint suppression for test files.

The duplicate-code disable is justified for test files that naturally share similar patterns and structure.

15-15: LGTM: Correctly updated to use new launcher option.

The migration from --no-accelerate to --launcher python aligns with the broader CLI refactoring to support flexible launcher selection.

28-28: LGTM: Consistent launcher option usage.

The gradio test correctly uses the new --launcher python option, maintaining consistency with the updated CLI interface.

36-63: LGTM: Comprehensive test for torchrun launcher arguments.

The test properly verifies:

Subprocess invocation with correct launcher

Launcher-specific arguments are forwarded

Module execution format (-m axolotl.cli.inference)

Argument separator handling (--)

This provides good coverage for the new launcher argument functionality.

65-93: LGTM: Well-structured test for accelerate launcher arguments.

The test correctly verifies the accelerate launcher behavior including:

Proper accelerate launch command structure

Forwarding of accelerate-specific arguments

Module execution with correct target (axolotl.cli.inference)

The test coverage is comprehensive and follows good testing practices.

95-123: LGTM: Important test for option interaction.

This test ensures that the --gradio flag and launcher arguments work correctly together, which is crucial for verifying that the CLI argument parsing doesn't have conflicts between different option types.

125-148: LGTM: Excellent backward compatibility test.

This test is crucial for ensuring that existing CLI usage patterns continue to work without modification. The specific check on line 148 that verifies no extraneous launcher arguments are injected is particularly valuable for preventing regressions.

src/axolotl/cli/utils/args.py (2)

12-28: LGTM: Solid implementation for optional type handling.

The function correctly uses the typing utilities to extract the underlying type from Optional/Union annotations. The logic properly handles the Union[T, None] pattern commonly used in type hints.

31-49: LGTM: Clean and useful utility decorator.

The decorator correctly filters None values from kwargs while preserving function metadata with @wraps. This is a good utility for CLI argument processing.

src/axolotl/cli/utils/fetch.py (1)

70-142: LGTM: Well-implemented GitHub sync function with good concurrency handling.

The function demonstrates good practices:

Proper use of ThreadPoolExecutor for concurrent downloads

Comprehensive error handling and logging

Clear summary reporting of sync operations

Reasonable defaults for configuration

The implementation is solid and should work reliably for the intended use case.

src/axolotl/cli/cloud/modal_.py (3)

11-11: LGTM: Appropriate type import for launcher parameter.

The Literal import supports the new type-safe launcher parameter that restricts values to valid launcher options.

233-235: LGTM: Updated parameters align with refactored launcher interface.

The new launcher and launcher_args parameters replace the previous boolean accelerate flag, providing more flexibility for multi-GPU launcher configuration.

275-306: LGTM: Well-implemented command construction for new launcher system.

The updated _train function correctly:

Handles all three launcher types (accelerate, torchrun, python)

Properly formats launcher arguments with the -- separator

Uses safe defaults for empty launcher_args

Maintains clean command string with .strip()

The implementation aligns well with the CLI's expected command format.

tests/cli/test_cli_train.py (8)

3-4: LGTM: Consistent pylint suppression across test files.

The duplicate-code disable is appropriately applied for test files that share similar testing patterns.

23-25: LGTM: Correctly updated base test method call.

The addition of train=True parameter aligns with the base class method signature and ensures proper test behavior for the train command.

44-46: LGTM: Correctly migrated to new launcher option.

The change from --no-accelerate to --launcher python maintains the same test behavior while using the updated CLI interface.

67-71: LGTM: Consistent launcher option usage in CLI overrides test.

The test correctly combines the new --launcher python option with other CLI overrides, ensuring the argument parsing works correctly together.

81-147: LGTM: Comprehensive launcher argument tests.

These tests provide excellent coverage for both torchrun and accelerate launchers:

Proper subprocess mocking and verification

Correct command structure validation

Launcher-specific argument forwarding verification

Module execution format checking

The test patterns are consistent and thorough.

148-181: LGTM: Important backward compatibility verification.

This test ensures that existing usage patterns continue to work correctly without unintended launcher argument injection. The verification logic is sound and consistent with similar tests.

182-216: LGTM: Essential test for argument interaction.

This test ensures that regular CLI arguments and launcher arguments can be used together without conflicts, which is crucial for the flexibility of the new launcher system.

217-251: LGTM: Critical test for cloud launcher integration.

This test ensures that launcher arguments are properly forwarded to cloud training functions, which is essential for multi-GPU cloud deployments. The verification of both launcher type and launcher_args is thorough.

src/axolotl/cli/utils/train.py (7)

13-41: LGTM! Robust default argument handling for distributed training.

The function correctly handles missing rendezvous arguments for torchrun, providing sensible defaults (c10d backend and UUID-based rendezvous ID) when rdzv_endpoint is specified. The UUID truncation to 8 characters is appropriate for rendezvous IDs.

44-64: LGTM! Clean command building utility.

The function properly handles argument formatting, skips None values, and converts underscores to hyphens for CLI compatibility. The implementation is straightforward and follows good practices.

94-113: LGTM! Well-structured launcher dispatch logic.

The function provides a clean interface for launching training with different launchers, properly handling cloud vs local execution and delegating to appropriate launcher-specific functions.

115-135: LGTM! Proper cloud training integration.

The function correctly handles cloud training delegation with appropriate parameter passing and default launcher fallback.

138-164: LGTM! Comprehensive accelerate launcher implementation.

The function properly extracts legacy launcher arguments from kwargs for backward compatibility while supporting new launcher args. The subprocess execution with check=True ensures proper error propagation.

167-181: LGTM! Robust torchrun launcher with default argument injection.

The function correctly integrates with the _add_default_rdzv_args function to ensure proper distributed training setup and follows the same pattern as the accelerate launcher.

184-188: LGTM! Simple and direct python launcher.

The function correctly calls the training CLI directly without subprocess overhead, which is appropriate for single-process execution.

src/axolotl/cli/main.py (6)

37-40: LGTM! Clean launcher command mapping.

The dictionary provides a clean way to map launcher names to their base commands, improving maintainability and consistency across CLI commands.

48-49: LGTM! Centralized environment setup.

Moving load_dotenv() and patch_optimized_env() to the main CLI group initialization ensures consistent environment setup across all commands.

79-135: LGTM! Excellent refactoring of the train command.

The train command has been significantly improved:

Clean launcher argument handling with context extraction

Proper integration with the new utility functions

Simplified sweep configuration handling using generators

Proper file cleanup in finally block

Ray launcher override logic preserved

The use of generate_config_files and launch_training from the utility module provides better separation of concerns and code reuse.

137-178: LGTM! Consistent evaluate command implementation.

The evaluate command follows the same pattern as the train command with proper launcher argument extraction and subprocess handling. The fallback to direct CLI call for non-mapped launchers is appropriate.

181-226: LGTM! Well-implemented inference command.

The inference command maintains consistency with other commands while properly handling the additional --gradio flag in the subprocess command construction.

229-272: LGTM! Consistent merge_sharded_fsdp_weights command.

The command follows the same pattern as other launcher-enabled commands, ensuring consistency across the CLI interface.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/axolotl/cli/utils/train.py (1)

67-91: Temporary file handling correctly addresses previous critical issue.

The implementation properly addresses the previous critical issue by using NamedTemporaryFile with delete=False, ensuring files persist for external use while the consuming code handles cleanup in a finally block.

The static analysis hint about using a context manager is still valid for better resource management:

-        # pylint: disable=consider-using-with
-        temp_file = tempfile.NamedTemporaryFile(
-            mode="w",
-            suffix=".yaml",
-            delete=False,
-            encoding="utf-8",
-        )
-        yaml.dump(permutation, temp_file)
-        temp_file.close()
-        yield temp_file.name
+        with tempfile.NamedTemporaryFile(
+            mode="w",
+            suffix=".yaml",
+            delete=False,
+            encoding="utf-8",
+        ) as temp_file:
+            yaml.dump(permutation, temp_file)
+            yield temp_file.name

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97a06e2 and b1eabc2.

📒 Files selected for processing (2)

src/axolotl/cli/utils/train.py (1 hunks)
tests/cli/test_utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/cli/test_utils.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

src/axolotl/cli/utils/train.py (1)

Learnt from: winglian
PR: #2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.

🪛 Ruff (0.12.2)

src/axolotl/cli/utils/train.py

83-83: Use a context manager for opening files

(SIM115)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: pre-commit
GitHub Check: pre-commit

🔇 Additional comments (5)

src/axolotl/cli/utils/train.py (5)

13-41: LGTM! Well-structured rendezvous argument handling.

The function correctly handles default RDZV arguments for distributed training, with proper argument copying to avoid mutations and conditional UUID generation.

44-64: LGTM! Clean command building utility.

The function properly handles command construction with appropriate formatting and edge case handling for None values.

94-113: LGTM! Clean launcher dispatch logic.

The function provides clear separation of concerns with proper routing to launcher-specific implementations. The parameter types and default handling are well-structured.

115-135: LGTM! Proper cloud training integration.

The function correctly handles cloud training dispatch with appropriate lazy imports and parameter forwarding.

138-188: LGTM! Well-implemented launcher-specific functions.

The three launcher functions each handle their specific requirements correctly:

_launch_accelerate_training: Properly manages legacy kwargs and combines launcher arguments

_launch_torchrun_training: Correctly applies RDZV defaults and follows consistent patterns

_launch_python_training: Clean direct execution approach

The consistent use of subprocess.run with check=True ensures proper error propagation, and the security annotations indicate appropriate security review.

djsaunde force-pushed the cli-update branch from 9244045 to f58dc84 Compare July 24, 2025 13:54

djsaunde changed the title ~~CLI: add --launcher option, cleanup, refactor~~ CLI: add --launcher option, support launcher args, cleanup, refactor Jul 24, 2025

djsaunde marked this pull request as ready for review July 24, 2025 17:38

djsaunde requested review from NanoCode012, salmanmohammadi and winglian and removed request for salmanmohammadi and winglian July 24, 2025 17:38

djsaunde commented Jul 24, 2025

View reviewed changes

Comment thread src/axolotl/cli/utils/load.py

coderabbitai Bot reviewed Jul 24, 2025

View reviewed changes

Comment thread src/axolotl/cli/utils/train.py Outdated

Comment thread tests/cli/test_utils.py

djsaunde requested a review from salmanmohammadi July 24, 2025 17:53

salmanmohammadi marked this pull request as draft July 24, 2025 17:55

salmanmohammadi marked this pull request as ready for review July 24, 2025 17:55

coderabbitai Bot reviewed Jul 24, 2025

View reviewed changes

salmanmohammadi reviewed Jul 25, 2025

View reviewed changes

Comment thread docs/cli.qmd

salmanmohammadi reviewed Jul 25, 2025

View reviewed changes

Comment thread src/axolotl/cli/cloud/modal_.py

salmanmohammadi reviewed Jul 25, 2025

View reviewed changes

Comment thread src/axolotl/cli/train.py

djsaunde added 6 commits July 27, 2025 18:36

add --launcher option; explicit True/False bool args; small cleanup

08d2753

refactor

38a0771

add torchrun, accelerate cli args

67c3977

add rdzv arg default + tests

625c224

update _quarto

7bd4628

coderabbit

7a8b995

djsaunde force-pushed the cli-update branch from 2a648b3 to 7a8b995 Compare July 27, 2025 22:37

coderabbitai Bot reviewed Jul 27, 2025

View reviewed changes

Comment thread src/axolotl/cli/utils/fetch.py

Comment thread src/axolotl/cli/utils/train.py

djsaunde added 4 commits July 27, 2025 18:54

fix

61bdcae

we can't set rdvz_id independently across nodes

a4daffc

coderabbit

97a06e2

fix tests

b1eabc2

coderabbitai Bot reviewed Jul 28, 2025

View reviewed changes

salmanmohammadi reviewed Jul 28, 2025

View reviewed changes

Comment thread docs/cli.qmd

winglian approved these changes Jul 30, 2025

View reviewed changes

Merge branch 'main' into cli-update

bb47485

djsaunde merged commit bb1cae1 into main Jul 30, 2025
11 of 12 checks passed

djsaunde deleted the cli-update branch July 30, 2025 19:46

coderabbitai Bot mentioned this pull request Aug 15, 2025

[GPT-OSS] improve FSDP shard merging and documentation for GPT-OSS #3073

Merged

coderabbitai Bot mentioned this pull request Sep 13, 2025

Debug log, logging improvements #3159

Merged

3 tasks

Uh oh!

Conversation

djsaunde commented Jul 15, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

github-actions Bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

djsaunde commented Jul 24, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

winglian commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djsaunde commented Jul 15, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 15, 2025 •

edited

Loading

github-actions Bot commented Jul 15, 2025 •

edited

Loading

codecov Bot commented Jul 15, 2025 •

edited

Loading