CLI: add --launcher option, support launcher args, cleanup, refactor#2924
Conversation
📝 WalkthroughWalkthroughThis update introduces a unified and extensible launcher argument interface for the Axolotl CLI, replacing the previous accelerate/no-accelerate boolean flag with a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
📖 Documentation Preview: https://688279ae06edc29f170a8388--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit c9adb88 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (7)
src/axolotl/cli/utils/args.py (3)
23-26: Fix potential type checking issue with NoneType instances.The current logic uses
isinstance(t, NoneType)which could be problematic. Consider usingt is type(None)ort is NoneTypefor more reliable type checking.- field_type = next( - t for t in get_args(field_type) if not isinstance(t, NoneType) - ) + field_type = next( + t for t in get_args(field_type) if t is not type(None) + )
68-68: Use isinstance() for type comparisons.The static analysis correctly identifies that direct type comparison should use
isinstance()instead of==.- if field_type == bool: + if field_type is bool:
106-106: Use isinstance() for type comparisons.Same issue as line 68 - use identity comparison for type checking.
- if field_type == bool: + if field_type is bool:src/axolotl/cli/utils/fetch.py (2)
58-59: Consider making timeout configurable.The hardcoded 30-second timeout might not be suitable for all network conditions or file sizes.
- response = requests.get(raw_url, timeout=30) + response = requests.get(raw_url, timeout=timeout or 30)And add
timeout: int | None = Noneto the function parameters.
87-89: Add response validation for GitHub API.Consider validating the API response structure to handle potential API changes or errors more gracefully.
response.raise_for_status() tree = json.loads(response.text) + + if "tree" not in tree: + raise click.ClickException("Invalid GitHub API response structure")src/axolotl/cli/cloud/modal_.py (1)
239-240: Consider combining nested with statements.The static analysis correctly suggests combining the nested
withstatements for cleaner code.- with modal.enable_output(): - with self.app.run(detach=True): + with modal.enable_output(), self.app.run(detach=True):src/axolotl/cli/utils/train.py (1)
14-43: Move UUID import to module level.Consider moving the
uuidimport to the top of the file for better code organization and to avoid repeated imports if this function is called multiple times.import os import subprocess # nosec import tempfile +import uuid from pathlib import Path from typing import Any, Literal import yamlThen remove line 38:
- import uuid - args.extend(["--rdzv_id", str(uuid.uuid4())[:8]])
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (29)
docs/cli.qmd(3 hunks)docs/multi-node.qmd(1 hunks)src/axolotl/cli/args.py(0 hunks)src/axolotl/cli/cloud/__init__.py(4 hunks)src/axolotl/cli/cloud/base.py(2 hunks)src/axolotl/cli/cloud/modal_.py(3 hunks)src/axolotl/cli/config.py(1 hunks)src/axolotl/cli/delinearize_llama4.py(0 hunks)src/axolotl/cli/evaluate.py(0 hunks)src/axolotl/cli/inference.py(0 hunks)src/axolotl/cli/main.py(11 hunks)src/axolotl/cli/merge_lora.py(0 hunks)src/axolotl/cli/merge_sharded_fsdp_weights.py(0 hunks)src/axolotl/cli/preprocess.py(0 hunks)src/axolotl/cli/train.py(0 hunks)src/axolotl/cli/utils.py(0 hunks)src/axolotl/cli/utils/__init__.py(1 hunks)src/axolotl/cli/utils/args.py(1 hunks)src/axolotl/cli/utils/fetch.py(1 hunks)src/axolotl/cli/utils/load.py(1 hunks)src/axolotl/cli/utils/train.py(1 hunks)tests/cli/test_cli_base.py(3 hunks)tests/cli/test_cli_evaluate.py(5 hunks)tests/cli/test_cli_inference.py(3 hunks)tests/cli/test_cli_interface.py(2 hunks)tests/cli/test_cli_merge_sharded_fsdp_weights.py(1 hunks)tests/cli/test_cli_sweeps.py(1 hunks)tests/cli/test_cli_train.py(5 hunks)tests/cli/test_utils.py(1 hunks)
💤 Files with no reviewable changes (9)
- src/axolotl/cli/delinearize_llama4.py
- src/axolotl/cli/merge_sharded_fsdp_weights.py
- src/axolotl/cli/inference.py
- src/axolotl/cli/merge_lora.py
- src/axolotl/cli/preprocess.py
- src/axolotl/cli/args.py
- src/axolotl/cli/train.py
- src/axolotl/cli/evaluate.py
- src/axolotl/cli/utils.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
docs/multi-node.qmd (1)
Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
docs/cli.qmd (1)
Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
tests/cli/test_cli_inference.py (1)
Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
src/axolotl/cli/main.py (1)
Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
🧬 Code Graph Analysis (5)
tests/cli/test_cli_sweeps.py (1)
src/axolotl/cli/utils/sweeps.py (1)
generate_sweep_configs(8-77)
src/axolotl/cli/utils/load.py (5)
src/axolotl/loaders/processor.py (1)
load_processor(17-56)src/axolotl/loaders/tokenizer.py (1)
load_tokenizer(122-303)src/axolotl/loaders/model.py (1)
ModelLoader(63-804)src/axolotl/utils/dict.py (1)
DictDefault(6-38)src/axolotl/utils/logging.py (1)
get_logger(42-49)
tests/cli/test_cli_merge_sharded_fsdp_weights.py (2)
src/axolotl/cli/main.py (1)
cli(44-48)tests/cli/conftest.py (2)
config_path(32-37)cli_runner(22-23)
tests/cli/test_cli_inference.py (2)
tests/cli/conftest.py (2)
config_path(32-37)cli_runner(22-23)src/axolotl/cli/main.py (1)
cli(44-48)
src/axolotl/cli/cloud/modal_.py (1)
cicd/multigpu.py (1)
run_cmd(61-66)
🪛 Ruff (0.12.2)
src/axolotl/cli/utils/args.py
68-68: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
106-106: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
src/axolotl/cli/cloud/modal_.py
239-240: Use a single with statement with multiple contexts instead of nested with statements
Combine with statements
(SIM117)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: preview
🔇 Additional comments (58)
tests/cli/test_cli_sweeps.py (1)
5-5: LGTM: Import path correctly updated for refactored CLI utilities.The import change aligns with the modularization where
generate_sweep_configswas moved fromaxolotl.cli.maintoaxolotl.cli.utils.sweepsand re-exported through the utils package.tests/cli/test_cli_interface.py (2)
21-25: LGTM: Test correctly reflects new CLI argument formatting.The updated assertions properly test the new unified argument format using equals signs (e.g.,
--learning-rate=0.0001) instead of separate option/value pairs, which aligns with the CLI refactoring.
40-40: LGTM: Error message assertion updated for new CLI framework.The expected error message correctly reflects the updated CLI error reporting where invalid options now show "does not exist" instead of "No such option".
src/axolotl/cli/config.py (1)
200-206: LGTM: Improved variable naming and comment formatting.The refactoring enhances code readability by using descriptive variable names (
key, valueinstead ofk, _) while preserving the original logic for configuration parameter handling.docs/cli.qmd (3)
26-38: LGTM: Excellent documentation for launcher argument forwarding.The new section clearly explains the
--separator convention for passing launcher-specific arguments, with practical examples for both torchrun and accelerate. This will help users understand the new launcher interface.
97-101: LGTM: Train command examples updated for new launcher interface.The documentation correctly shows the new
--launcher pythonoption replacing--no-accelerateand provides clear examples of launcher argument forwarding using the--separator.
196-198: LGTM: Consistent launcher argument documentation for evaluate command.The added example maintains consistency with the train command documentation and shows proper usage of launcher arguments for multi-GPU evaluation.
src/axolotl/cli/cloud/base.py (1)
6-6: LGTM: Well-designed abstract interface update for launcher flexibility.The updated
trainmethod signature effectively replaces the booleanaccelerateparameter with a flexiblelauncherparameter using proper Literal typing, while adding support for launcher arguments and additional configuration options. This provides a more extensible interface for cloud training implementations.Also applies to: 19-26
docs/multi-node.qmd (2)
72-77: Documentation clearly presents the new launcher interface.The updated documentation effectively introduces the new
--launcher torchrunsyntax while maintaining the legacy approach for reference. The structure with "Option 1 (Recommended)" and "Option 2 (Legacy)" provides clear guidance to users.
84-92: Good addition of variable substitution guidance.The note about substituting placeholder variables helps prevent common user errors, and the explanation of the CLI approach benefits reinforces the value of the new launcher interface.
src/axolotl/cli/utils/load.py (1)
20-52: Well-designed utility function for model loading.The
load_model_and_tokenizerfunction provides a clean interface for loading models, tokenizers, and processors. The implementation follows good practices:
- Clear type hints and documentation
- Logical loading sequence with appropriate logging
- Proper handling of multimodal models
- Consistent with existing axolotl patterns
src/axolotl/cli/utils/__init__.py (1)
1-23: Clean package initialization with good modular organization.The
__init__.pyeffectively exposes utilities from focused submodules, replacing the previous monolithic approach. The__all__list properly documents the public API, and the imports are well-organized by functionality.tests/cli/test_cli_inference.py (4)
15-15: Properly updated to use new launcher interface.The test correctly replaces the deprecated
--no-accelerateflag with--launcher python, aligning with the CLI refactoring.
36-63: Comprehensive test for torchrun launcher arguments.This test effectively verifies that launcher arguments are correctly passed to torchrun, including proper command construction and argument placement.
65-93: Thorough verification of accelerate launcher integration.The test properly validates accelerate launcher argument handling, ensuring the
accelerate launchcommand is constructed correctly with forwarded arguments.
125-148: Good backward compatibility test.This test ensures that using a launcher without additional arguments doesn't introduce unintended parameters, which is important for maintaining clean command execution.
tests/cli/test_cli_base.py (2)
20-27: Correctly updated validation tests for new launcher interface.The validation tests properly use
--launcher pythoninstead of the deprecated--no-accelerateflag, maintaining consistent error handling verification.
30-70: Well-designed enhancement to basic execution test.The addition of the
trainparameter and updated expected command construction provides better test flexibility and more precise verification of subprocess calls. The explicit debug flags and conditional shard flag properly reflect the CLI implementation.src/axolotl/cli/utils/args.py (1)
52-87: Well-structured decorator for dataclass CLI option generation.The implementation correctly handles boolean flags with paired options and processes fields in reverse order for proper Click option ordering. The metadata usage for help text is a nice touch.
src/axolotl/cli/utils/fetch.py (2)
40-41: Git blob SHA calculation is correctly implemented.The implementation properly follows Git's blob object format with the "blob {size}\0{content}" structure for SHA-1 calculation. Using
usedforsecurity=Falseis appropriate for content integrity checking.
114-135: Well-implemented concurrent download with proper error handling.The ThreadPoolExecutor usage is correct with proper resource management via context manager. Error handling covers both request exceptions and IO errors, with results properly collected and categorized.
tests/cli/test_cli_merge_sharded_fsdp_weights.py (4)
14-16: Good update to use new launcher syntax.The test correctly replaces the deprecated
--no-accelerateflag with--launcher python, maintaining the same functionality while adopting the new unified launcher interface.
23-52: Comprehensive test for torchrun launcher argument forwarding.The test properly verifies that launcher arguments are correctly passed through to the torchrun command, including checking for the module flag and specific torchrun arguments.
46-51: Thorough command verification logic.The assertions correctly verify the command structure: torchrun executable, launcher arguments, module flag, and target module. This ensures the subprocess command is built correctly.
110-111: Smart backward compatibility verification.The test ensures no launcher arguments leak into the command when none are provided, preventing argument contamination. The slice logic correctly isolates the launcher section between 'launch' and '-m'.
tests/cli/test_cli_evaluate.py (4)
23-25: Good parameterization of base test method.Adding the
train=Falseparameter properly distinguishes evaluation tests from training tests, improving test organization and clarity.
41-43: Consistent adoption of new launcher syntax.The test correctly updates from
--no-accelerateto--launcher python, maintaining the same functionality while using the new unified interface.
76-108: Comprehensive launcher argument forwarding test.The test thoroughly verifies that torchrun launcher arguments are properly passed through, including checking command structure and specific arguments. The verification logic is robust.
172-175: Effective backward compatibility verification.The test correctly ensures no unintended launcher arguments are included when none are provided, using proper slice logic to isolate and verify the launcher section.
src/axolotl/cli/cloud/modal_.py (4)
233-235: Good type safety with Literal types.The use of
Literal["accelerate", "torchrun", "python"]provides excellent type safety for the launcher parameter, preventing invalid launcher values at the type level.
275-281: Updated function signature aligns with new launcher interface.The function signature correctly adopts the new launcher parameter pattern, maintaining consistency with the broader CLI refactoring.
287-301: Secure command construction with proper argument handling.The command construction properly handles different launcher types and safely incorporates launcher arguments. The use of
launcher_args or []provides good default handling.
302-306: Command invocation is safe from shell injection.The
run_cmdimplementation usessubprocess.call(cmd.split(), …)withoutshell=True, so there’s no shell interpretation of the command string and no risk of injection via special characters.•
cmd.split()produces a list of arguments passed directly to the OS executable.
• Even if an argument contains characters like;or&&, they’re treated as literals, not shell operators.tests/cli/test_utils.py (3)
77-111: LGTM! Well-structured test utility for launcher verification.The helper function provides clear validation for launcher arguments in subprocess calls with good error messages.
142-149: LGTM! Clean fixture for test data.Good practice to centralize test data in a fixture for reuse across tests.
151-236: LGTM! Comprehensive test coverage for RDZV argument handling.The tests thoroughly cover all scenarios:
- Adding defaults when endpoint is present
- Preserving existing backend/id
- Not adding args when endpoint is missing
- Handling when all args are already present
Good use of descriptive test names and assertions.
src/axolotl/cli/cloud/__init__.py (2)
6-6: LGTM! Modern type annotation syntax.Good update to use the more concise pipe syntax for union types, which is cleaner and more readable.
Also applies to: 14-14, 22-25
33-54: LGTM! Clean integration of launcher parameters.The function signature properly extends to support the new launcher options while maintaining backward compatibility with the default "accelerate" launcher.
tests/cli/test_cli_train.py (7)
44-46: LGTM! Correct migration to new launcher syntax.The test properly updates from the deprecated
--no-accelerateflag to the new--launcher pythonoption.
67-79: LGTM! Test maintains its purpose with new syntax.The test correctly validates CLI overrides while using the new launcher option.
81-113: LGTM! Well-structured test for torchrun launcher arguments.The test properly validates that launcher-specific arguments are correctly passed to the subprocess command.
114-147: LGTM! Comprehensive test for accelerate launcher.The test properly validates the accelerate launch command structure including the "launch" subcommand.
148-181: LGTM! Important backward compatibility test.This test ensures that existing usage patterns continue to work without unintended launcher arguments.
182-216: LGTM! Validates mixed argument handling.The test properly ensures that both regular CLI arguments and launcher-specific arguments are correctly included in the subprocess command.
217-251: LGTM! Cloud integration test.The test ensures that cloud training properly receives and forwards launcher configuration.
src/axolotl/cli/utils/train.py (6)
45-66: LGTM! Clean command building utility.The function properly handles None values and follows standard CLI formatting conventions.
93-112: LGTM! Well-structured dispatcher function.Clean separation of concerns with appropriate delegation to launcher-specific functions.
114-135: LGTM! Clean cloud training integration.The function properly handles the cloud training case with appropriate parameter forwarding.
137-164: LGTM! Good backward compatibility handling.The function properly extracts legacy launcher arguments while supporting new launcher args. The use of
check=Trueensures errors are propagated.Note: Ensure the
# nosec B603suppression is justified by validating that all inputs to the subprocess command are properly sanitized.
166-188: LGTM! Consistent launcher implementations.Both
_launch_torchrun_trainingand_launch_python_trainingfollow the established pattern. Good use of the RDZV default arguments for torchrun.
4-4: Subprocess calls are safe with list‐form invocation (shell=False).The
build_commandfunction constructs thecmdlist andsubprocess.run(cmd, check=True)is invoked withoutshell=True, so there’s no shell interpolation of user input. Each option is passed as a separate argument ("--key=value"), preventing injection via shell metacharacters. No further validation is required here.src/axolotl/cli/main.py (7)
47-49: LGTM! Proper initialization order.Good practice to load environment variables and apply optimizations at CLI group initialization.
57-76: LGTM! Cleaner implementation.Good refactoring to remove redundant environment patching since it's now handled at the CLI group level.
78-135: LGTM! Well-implemented launcher support.The train command properly:
- Accepts launcher arguments after
--separator- Handles Ray launcher override appropriately
- Delegates to utility functions for cleaner code
- Maintains backward compatibility with default "accelerate"
Good error handling for sweep failures.
137-179: LGTM! Consistent implementation pattern.The evaluate command follows the same pattern as train, properly handling launcher arguments and using the command mapping.
181-227: LGTM! Inference command properly implemented.The command correctly handles both launcher arguments and the gradio flag.
229-273: LGTM! Consistent command implementation.The merge_sharded_fsdp_weights command follows the established pattern for launcher support.
275-350: LGTM! Appropriate handling of single-GPU commands.The remaining commands (merge_lora, fetch, vllm_serve, quantize, delinearize_llama4) correctly don't include launcher options as they don't require multi-GPU support. Type annotations have been properly updated.
|
FYI, would need to manually test multi-node for this to be merge ready. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
_quarto.yml (2)
34-46: Keep CLI contents alphabetically sorted for long-term maintainability
cli.art,cli.delinearize_llama4, and the repositionedcli.quantizebreak the otherwise alphabetical ordering in this block (cli.args→cli.art→cli.checks, …). While the generator does not require ordering, keeping it sorted makes future diffs cleaner and lowers merge-conflict frequency.No functional impact, but consider re-ordering the new entries alphabetically.
50-55: Potential duplicate docs: listing bothcli.utilsand its sub-modules will create redundant pagesQuarto treats each dotted path as a module; documenting
cli.utilspluscli.utils.args/fetch/load/sweeps/trainwill generate one page for the parent and another for each child.
Ifcli.utils.__init__re-exports everything (common in namespace packages) the parent page will repeat the sub-modules’ content, leading to noisy, duplicated docs.Verify the intended structure:
- - cli.utils - - cli.utils.args - - cli.utils.fetch - - cli.utils.load - - cli.utils.sweeps - - cli.utils.train + # Either keep only the parent … + - cli.utils + # … or list only the granular modules, not bothChoose one style and keep it consistent across sections.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
_quarto.yml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: pre-commit
- GitHub Check: preview
- GitHub Check: pre-commit
|
manually triggered multi-gpu since that uses the cli to launch tests https://github.com/axolotl-ai-cloud/axolotl/actions/runs/16511480037 |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/axolotl/cli/utils/args.py (2)
52-87: Fix type comparison for better practices.The function logic for generating Click options from dataclass fields is solid, but the type comparison should be improved.
- if field_type == bool: + if field_type is bool:Using
isfor type comparisons is more explicit and follows Python best practices for singleton types likebool.
90-120: Fix type comparison and consider consistency with dataclass version.The Pydantic model option generation follows good patterns, but has the same type comparison issue.
- if field_type == bool: + if field_type is bool:Also note that this function sets
default=Nonefor all options while the dataclass version usesfield.default. Consider if this difference in behavior is intentional.src/axolotl/cli/main.py (1)
280-280: Minor: Removal of return type annotations.The return type annotations were removed from these function signatures. While not incorrect, it reduces type safety information that could be helpful for IDE support and static analysis.
Consider retaining return type annotations for better type safety:
-def merge_lora(config: str, **kwargs): +def merge_lora(config: str, **kwargs) -> None: -def fetch(directory: str, dest: Optional[str]): +def fetch(directory: str, dest: Optional[str]) -> None: -def delinearize_llama4(model: str, output: str): +def delinearize_llama4(model: str, output: str) -> None:Also applies to: 297-297, 335-335
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (30)
_quarto.yml(1 hunks)docs/cli.qmd(3 hunks)docs/multi-node.qmd(1 hunks)src/axolotl/cli/args.py(0 hunks)src/axolotl/cli/cloud/__init__.py(4 hunks)src/axolotl/cli/cloud/base.py(2 hunks)src/axolotl/cli/cloud/modal_.py(3 hunks)src/axolotl/cli/config.py(1 hunks)src/axolotl/cli/delinearize_llama4.py(0 hunks)src/axolotl/cli/evaluate.py(0 hunks)src/axolotl/cli/inference.py(0 hunks)src/axolotl/cli/main.py(11 hunks)src/axolotl/cli/merge_lora.py(0 hunks)src/axolotl/cli/merge_sharded_fsdp_weights.py(0 hunks)src/axolotl/cli/preprocess.py(0 hunks)src/axolotl/cli/train.py(0 hunks)src/axolotl/cli/utils.py(0 hunks)src/axolotl/cli/utils/__init__.py(1 hunks)src/axolotl/cli/utils/args.py(1 hunks)src/axolotl/cli/utils/fetch.py(1 hunks)src/axolotl/cli/utils/load.py(1 hunks)src/axolotl/cli/utils/train.py(1 hunks)tests/cli/test_cli_base.py(3 hunks)tests/cli/test_cli_evaluate.py(5 hunks)tests/cli/test_cli_inference.py(3 hunks)tests/cli/test_cli_interface.py(2 hunks)tests/cli/test_cli_merge_sharded_fsdp_weights.py(1 hunks)tests/cli/test_cli_sweeps.py(1 hunks)tests/cli/test_cli_train.py(5 hunks)tests/cli/test_utils.py(1 hunks)
💤 Files with no reviewable changes (9)
- src/axolotl/cli/merge_lora.py
- src/axolotl/cli/merge_sharded_fsdp_weights.py
- src/axolotl/cli/inference.py
- src/axolotl/cli/delinearize_llama4.py
- src/axolotl/cli/preprocess.py
- src/axolotl/cli/train.py
- src/axolotl/cli/args.py
- src/axolotl/cli/evaluate.py
- src/axolotl/cli/utils.py
✅ Files skipped from review due to trivial changes (2)
- tests/cli/test_cli_sweeps.py
- src/axolotl/cli/utils/init.py
🚧 Files skipped from review as they are similar to previous changes (12)
- tests/cli/test_cli_interface.py
- docs/cli.qmd
- docs/multi-node.qmd
- _quarto.yml
- src/axolotl/cli/config.py
- src/axolotl/cli/cloud/base.py
- tests/cli/test_cli_evaluate.py
- tests/cli/test_cli_base.py
- tests/cli/test_utils.py
- tests/cli/test_cli_merge_sharded_fsdp_weights.py
- src/axolotl/cli/cloud/init.py
- src/axolotl/cli/utils/load.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
tests/cli/test_cli_inference.py (1)
Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
src/axolotl/cli/main.py (1)
Learnt from: NanoCode012
PR: #2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the --ipc=host flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
🧬 Code Graph Analysis (4)
tests/cli/test_cli_inference.py (2)
tests/cli/conftest.py (2)
config_path(32-37)cli_runner(22-23)src/axolotl/cli/main.py (1)
cli(45-49)
src/axolotl/cli/utils/fetch.py (1)
src/axolotl/utils/logging.py (1)
get_logger(42-49)
tests/cli/test_cli_train.py (5)
tests/cli/test_cli_base.py (1)
_test_basic_execution(30-70)tests/cli/conftest.py (3)
cli_runner(22-23)valid_test_config(27-28)config_path(32-37)src/axolotl/cli/main.py (2)
train(99-134)cli(45-49)src/axolotl/cli/cloud/base.py (1)
train(19-27)src/axolotl/cli/cloud/modal_.py (1)
train(230-247)
src/axolotl/cli/cloud/modal_.py (1)
cicd/multigpu.py (1)
run_cmd(61-66)
🪛 Ruff (0.12.2)
src/axolotl/cli/cloud/modal_.py
239-240: Use a single with statement with multiple contexts instead of nested with statements
Combine with statements
(SIM117)
src/axolotl/cli/utils/args.py
68-68: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
106-106: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
src/axolotl/cli/utils/train.py
83-83: Use a context manager for opening files
(SIM115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
🔇 Additional comments (34)
tests/cli/test_cli_inference.py (7)
3-4: LGTM: Appropriate pylint suppression for test files.The
duplicate-codedisable is justified for test files that naturally share similar patterns and structure.
15-15: LGTM: Correctly updated to use new launcher option.The migration from
--no-accelerateto--launcher pythonaligns with the broader CLI refactoring to support flexible launcher selection.
28-28: LGTM: Consistent launcher option usage.The gradio test correctly uses the new
--launcher pythonoption, maintaining consistency with the updated CLI interface.
36-63: LGTM: Comprehensive test for torchrun launcher arguments.The test properly verifies:
- Subprocess invocation with correct launcher
- Launcher-specific arguments are forwarded
- Module execution format (
-m axolotl.cli.inference)- Argument separator handling (
--)This provides good coverage for the new launcher argument functionality.
65-93: LGTM: Well-structured test for accelerate launcher arguments.The test correctly verifies the accelerate launcher behavior including:
- Proper
accelerate launchcommand structure- Forwarding of accelerate-specific arguments
- Module execution with correct target (
axolotl.cli.inference)The test coverage is comprehensive and follows good testing practices.
95-123: LGTM: Important test for option interaction.This test ensures that the
--gradioflag and launcher arguments work correctly together, which is crucial for verifying that the CLI argument parsing doesn't have conflicts between different option types.
125-148: LGTM: Excellent backward compatibility test.This test is crucial for ensuring that existing CLI usage patterns continue to work without modification. The specific check on line 148 that verifies no extraneous launcher arguments are injected is particularly valuable for preventing regressions.
src/axolotl/cli/utils/args.py (2)
12-28: LGTM: Solid implementation for optional type handling.The function correctly uses the typing utilities to extract the underlying type from Optional/Union annotations. The logic properly handles the Union[T, None] pattern commonly used in type hints.
31-49: LGTM: Clean and useful utility decorator.The decorator correctly filters None values from kwargs while preserving function metadata with
@wraps. This is a good utility for CLI argument processing.src/axolotl/cli/utils/fetch.py (1)
70-142: LGTM: Well-implemented GitHub sync function with good concurrency handling.The function demonstrates good practices:
- Proper use of ThreadPoolExecutor for concurrent downloads
- Comprehensive error handling and logging
- Clear summary reporting of sync operations
- Reasonable defaults for configuration
The implementation is solid and should work reliably for the intended use case.
src/axolotl/cli/cloud/modal_.py (3)
11-11: LGTM: Appropriate type import for launcher parameter.The
Literalimport supports the new type-safe launcher parameter that restricts values to valid launcher options.
233-235: LGTM: Updated parameters align with refactored launcher interface.The new
launcherandlauncher_argsparameters replace the previous boolean accelerate flag, providing more flexibility for multi-GPU launcher configuration.
275-306: LGTM: Well-implemented command construction for new launcher system.The updated
_trainfunction correctly:
- Handles all three launcher types (accelerate, torchrun, python)
- Properly formats launcher arguments with the
--separator- Uses safe defaults for empty launcher_args
- Maintains clean command string with
.strip()The implementation aligns well with the CLI's expected command format.
tests/cli/test_cli_train.py (8)
3-4: LGTM: Consistent pylint suppression across test files.The duplicate-code disable is appropriately applied for test files that share similar testing patterns.
23-25: LGTM: Correctly updated base test method call.The addition of
train=Trueparameter aligns with the base class method signature and ensures proper test behavior for the train command.
44-46: LGTM: Correctly migrated to new launcher option.The change from
--no-accelerateto--launcher pythonmaintains the same test behavior while using the updated CLI interface.
67-71: LGTM: Consistent launcher option usage in CLI overrides test.The test correctly combines the new
--launcher pythonoption with other CLI overrides, ensuring the argument parsing works correctly together.
81-147: LGTM: Comprehensive launcher argument tests.These tests provide excellent coverage for both torchrun and accelerate launchers:
- Proper subprocess mocking and verification
- Correct command structure validation
- Launcher-specific argument forwarding verification
- Module execution format checking
The test patterns are consistent and thorough.
148-181: LGTM: Important backward compatibility verification.This test ensures that existing usage patterns continue to work correctly without unintended launcher argument injection. The verification logic is sound and consistent with similar tests.
182-216: LGTM: Essential test for argument interaction.This test ensures that regular CLI arguments and launcher arguments can be used together without conflicts, which is crucial for the flexibility of the new launcher system.
217-251: LGTM: Critical test for cloud launcher integration.This test ensures that launcher arguments are properly forwarded to cloud training functions, which is essential for multi-GPU cloud deployments. The verification of both launcher type and launcher_args is thorough.
src/axolotl/cli/utils/train.py (7)
13-41: LGTM! Robust default argument handling for distributed training.The function correctly handles missing rendezvous arguments for torchrun, providing sensible defaults (
c10dbackend and UUID-based rendezvous ID) whenrdzv_endpointis specified. The UUID truncation to 8 characters is appropriate for rendezvous IDs.
44-64: LGTM! Clean command building utility.The function properly handles argument formatting, skips None values, and converts underscores to hyphens for CLI compatibility. The implementation is straightforward and follows good practices.
94-113: LGTM! Well-structured launcher dispatch logic.The function provides a clean interface for launching training with different launchers, properly handling cloud vs local execution and delegating to appropriate launcher-specific functions.
115-135: LGTM! Proper cloud training integration.The function correctly handles cloud training delegation with appropriate parameter passing and default launcher fallback.
138-164: LGTM! Comprehensive accelerate launcher implementation.The function properly extracts legacy launcher arguments from kwargs for backward compatibility while supporting new launcher args. The subprocess execution with
check=Trueensures proper error propagation.
167-181: LGTM! Robust torchrun launcher with default argument injection.The function correctly integrates with the
_add_default_rdzv_argsfunction to ensure proper distributed training setup and follows the same pattern as the accelerate launcher.
184-188: LGTM! Simple and direct python launcher.The function correctly calls the training CLI directly without subprocess overhead, which is appropriate for single-process execution.
src/axolotl/cli/main.py (6)
37-40: LGTM! Clean launcher command mapping.The dictionary provides a clean way to map launcher names to their base commands, improving maintainability and consistency across CLI commands.
48-49: LGTM! Centralized environment setup.Moving
load_dotenv()andpatch_optimized_env()to the main CLI group initialization ensures consistent environment setup across all commands.
79-135: LGTM! Excellent refactoring of the train command.The train command has been significantly improved:
- Clean launcher argument handling with context extraction
- Proper integration with the new utility functions
- Simplified sweep configuration handling using generators
- Proper file cleanup in finally block
- Ray launcher override logic preserved
The use of
generate_config_filesandlaunch_trainingfrom the utility module provides better separation of concerns and code reuse.
137-178: LGTM! Consistent evaluate command implementation.The evaluate command follows the same pattern as the train command with proper launcher argument extraction and subprocess handling. The fallback to direct CLI call for non-mapped launchers is appropriate.
181-226: LGTM! Well-implemented inference command.The inference command maintains consistency with other commands while properly handling the additional
--gradioflag in the subprocess command construction.
229-272: LGTM! Consistent merge_sharded_fsdp_weights command.The command follows the same pattern as other launcher-enabled commands, ensuring consistency across the CLI interface.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/axolotl/cli/utils/train.py (1)
67-91: Temporary file handling correctly addresses previous critical issue.The implementation properly addresses the previous critical issue by using
NamedTemporaryFilewithdelete=False, ensuring files persist for external use while the consuming code handles cleanup in afinallyblock.The static analysis hint about using a context manager is still valid for better resource management:
- # pylint: disable=consider-using-with - temp_file = tempfile.NamedTemporaryFile( - mode="w", - suffix=".yaml", - delete=False, - encoding="utf-8", - ) - yaml.dump(permutation, temp_file) - temp_file.close() - yield temp_file.name + with tempfile.NamedTemporaryFile( + mode="w", + suffix=".yaml", + delete=False, + encoding="utf-8", + ) as temp_file: + yaml.dump(permutation, temp_file) + yield temp_file.name
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/axolotl/cli/utils/train.py(1 hunks)tests/cli/test_utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/cli/test_utils.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
src/axolotl/cli/utils/train.py (1)
Learnt from: winglian
PR: #2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🪛 Ruff (0.12.2)
src/axolotl/cli/utils/train.py
83-83: Use a context manager for opening files
(SIM115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: pre-commit
- GitHub Check: pre-commit
🔇 Additional comments (5)
src/axolotl/cli/utils/train.py (5)
13-41: LGTM! Well-structured rendezvous argument handling.The function correctly handles default RDZV arguments for distributed training, with proper argument copying to avoid mutations and conditional UUID generation.
44-64: LGTM! Clean command building utility.The function properly handles command construction with appropriate formatting and edge case handling for None values.
94-113: LGTM! Clean launcher dispatch logic.The function provides clear separation of concerns with proper routing to launcher-specific implementations. The parameter types and default handling are well-structured.
115-135: LGTM! Proper cloud training integration.The function correctly handles cloud training dispatch with appropriate lazy imports and parameter forwarding.
138-188: LGTM! Well-implemented launcher-specific functions.The three launcher functions each handle their specific requirements correctly:
_launch_accelerate_training: Properly manages legacy kwargs and combines launcher arguments_launch_torchrun_training: Correctly applies RDZV defaults and follows consistent patterns_launch_python_training: Clean direct execution approachThe consistent use of
subprocess.runwithcheck=Trueensures proper error propagation, and the security annotations indicate appropriate security review.
Description
Motivation and Context
How has this been tested?
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
New Features
--launcheroption for selecting between "accelerate", "torchrun", or "python", with the ability to pass launcher-specific arguments using--.Bug Fixes
--no-accelerateusage in tests and documentation, ensuring compatibility with new launcher options.Refactor
Documentation
Tests
Chores