Skip to content

Conversation

jskswamy
Copy link

What this PR does / why we need it:

This PR adds support for running a Python script directly in CustomTrainer by specifying a python_file argument. If python_file is set, the job will run the specified script as the main process (python myscript.py) instead of requiring a function. This is mutually exclusive with func; an error is raised if both or neither are set. The original function-based usage is unchanged and fully backward compatible.

  • Simplifies migration from script-based workflows (e.g., Slurm, bash, or direct Kubernetes Jobs) to Kubeflow Trainer.
  • Matches user expectations for direct script execution (python myscript.py).
  • Avoids the indirection and complexity of function serialization and wrapper scripts.
  • Ensures the script runs as the main process, improving signal handling and debugging.

Which issue(s) this PR fixes:

Fixes #47

Summary of changes:

  • CustomTrainer now accepts an optional python_file argument.
  • If python_file is set, the SDK sets the container entrypoint to ["python", python_file].
  • Mutually exclusive with func; validation is added.
  • No changes to existing function-based usage.

Additional context:
See kubeflow/sdk#47 for the original feature request and motivation.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jskswamy jskswamy force-pushed the feature/customtrainer-python-file branch from 5d1a192 to 4273971 Compare July 24, 2025 13:58
@eoinfennessy
Copy link
Member

I'm not sure about this approach as it assumes the training script to run is available within the training runtime image. This would require users to rebuild training images each time a change is made to a training script.

Having said that, I think it would be a big UX improvement to allow users to provide a training script to run instead of a train function, but this script should come from the user's workspace instead of being baked into the training runtime image. This is related to issue #48, which describes making a snapshot of a user's workspaces available in training pods so that [multiple] Python files can be used in training jobs.

@jskswamy
Copy link
Author

@eoinfennessy! I agree that this would be a big UX improvement. However, I'd like to clarify that the current implementation doesn't necessarily require rebuilding Docker images every time.

The python_file feature is designed to be flexible and can work with various code staging strategies:

Code Staging Options (No Image Rebuild Required)

  1. ConfigMaps: Training scripts can be mounted as ConfigMaps, allowing users to update scripts without rebuilding images
  2. Object Storage: Scripts can be stored in object storage (S3, GCS, etc.) and mounted as volumes
  3. Init Containers: An init container can clone repositories or download scripts before the training pod starts
  4. PVC with Git Repos: Persistent Volume Claims can mount git repositories that get updated
  5. Runtime Volume Mounts: Scripts can be mounted from the host or other storage systems.

UX Improvements This Feature Enables

The python_file approach provides significant UX benefits:

  • Simplified Migration: Users can easily migrate from existing script-based workflows (Slurm, bash, direct K8s Jobs) to Kubeflow Trainer
  • Intuitive Execution: Matches user expectations for direct script execution (python myscript.py)
  • Reduced Complexity: Eliminates the indirection and complexity of function serialization and wrapper scripts
  • Better Debugging: Scripts run as the main process, improving signal handling and debugging capabilities
  • Familiar Workflow: Users can continue using their existing training scripts without major modifications

Architecture Flexibility

The current implementation sets the container entrypoint to ["python", python_file], which is intentionally simple and allows the runtime to handle script availability through its own mechanisms. This separation of concerns means:

  • The SDK focuses on execution semantics
  • The runtime handles script staging and availability
  • Users can choose their preferred code staging strategy

Future Integration

You're absolutely right that workspace snapshotting (#48) would be an excellent complementary feature. When that's implemented, users could have their entire workspace available, making the python_file feature even more powerful. But the current implementation provides immediate value without blocking on workspace snapshotting.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 16498943138

Details

  • 2 of 9 (22.22%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.8%) to 63.939%

Changes Missing Coverage Covered Lines Changed/Added Lines %
python/kubeflow/trainer/utils/utils.py 2 9 22.22%
Totals Coverage Status
Change from base Build 16463414499: -0.8%
Covered Lines: 250
Relevant Lines: 391

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @jskswamy!
I am wondering what are the benefits of such API compare to just this:

from custom_script import run_pytorch

TrainerClient().train(
  trainer=CustomTrainer(
    func=run_pytorch,
    num_nodes=5
   )
)

If the TrainJob contains users' workspace, they should be able to execute the above script.
cc @shravan-achar

@jskswamy
Copy link
Author

jskswamy commented Aug 7, 2025

python_file feature gives us the following:

1. Framework Integration is Much Cleaner

# Clean framework API
trainer = CustomTrainer(python_file="train.py", python_args=["--epochs", "100"], num_nodes=5)

# Without it, frameworks need to do the following
def wrapper_function():
    import subprocess
    subprocess.run(["python", "train.py", "--epochs", "100"], check=True)

2. Natural Migration Path

# What users do now
python train.py --epochs 100 --batch-size 32
# Same thing, just wrapped
trainer = CustomTrainer(
    python_file="train.py",
    python_args=["--epochs", "100", "--batch-size", "32"],
    num_nodes=5
)

3. Better Debugging & Signal Handling

  • Script runs as main process (python train.py --args)
  • Proper signal handling (SIGTERM, SIGINT)
  • Direct access to sys.argv and command-line arguments
  • Native Python debugging and profiling

4. User Experience is Just Better

# User writes exactly what they want
trainer = CustomTrainer(
    python_file="train.py",
    python_args=["--epochs", "100", "--batch-size", "32"],
    num_nodes=5
)

# Function wrapping is a hard
def train_function():
    # Complex argument handling, signal handling, subprocess management
    pass

@jskswamy jskswamy force-pushed the feature/customtrainer-python-file branch from 4273971 to e96e53a Compare August 7, 2025 11:27
@google-oss-prow google-oss-prow bot added size/L and removed size/S labels Aug 7, 2025
@andreyvelich
Copy link
Member

@jskswamy A few questions:

  1. How you can ensure that Python file is inside the TrainJob workload ?
  2. As you can see, entrypoint depends on runtime Trainer. It could be torchrun, mpirun, or Python in case of PlainML.

@jskswamy
Copy link
Author

jskswamy commented Aug 8, 2025

@andreyvelich Thanks for pointing out the lack of framework-aware command support. I have added the following support now.

The python_file feature now uses the runtime's framework-specific commands instead of hardcoded ["python"]:

  • PyTorch runtime: Uses torchrun train.py --args
  • MPI runtime: Uses mpirun python train.py --args
  • PlainML runtime: Uses python train.py --args

On the Python file availability, we are making an assumption that the Python file will be made available by the consumer through:

  1. Embedded in container image - Scripts baked into the Docker image
  2. ConfigMaps - Scripts mounted as ConfigMaps
  3. Volume mounts - Scripts mounted via PVC or other volumes
  4. Init containers - Scripts downloaded by init containers

This design assumption provides flexibility for different deployment patterns while keeping the python_file feature simple and focused on direct script execution.

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jskswamy!

Have you considered how this affects the plugin architecture when someone writes a custom runtime?

Comment on lines 378 to 380
trainer_crd.command = ["python"]
# Combine python_file with python_args
args = [trainer.python_file]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still hardcodes command = ["python"] for all runtimes, which will not work for distributed training with MPI and PyTorch, right?

if trainer.python_args:
args.extend(trainer.python_args)
trainer_crd.args = args
return trainer_crd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich
Copy link
Member

Embedded in container image - Scripts baked into the Docker image

What would be the differences to use Python function vs allow user to bypass the Docker image directly in the custom trainer, i.e.

train(
  trainer=CustomTrainer(
    image=...
  )
)

ConfigMaps - Scripts mounted as ConfigMaps
Volume mounts - Scripts mounted via PVC or other volumes
Init containers - Scripts downloaded by init containers

I still think that we should prfioritize this feature to make it work: #48
It would be much easier to use if we snapshot the filesystem into TrainJob, so users don't need to worry about it, and just run:

from train import run_pytorch

TrainerClient().train(
  trainer=CustomTrainer(
    func=run_pytorch,
    num_nodes=5
   )
)

In that case train.py might import other functions from user's workspace, and it works.

WDYT @jskswamy ?

@kramaranya
Copy link
Contributor

Hey @jskswamy! Did you have a chance to check the latest comments?

@jskswamy
Copy link
Author

jskswamy commented Sep 2, 2025

I went through and making necessary changes to accommodate the inputs

@jskswamy jskswamy force-pushed the feature/customtrainer-python-file branch from e96e53a to f7179e8 Compare September 9, 2025 09:52
@jskswamy
Copy link
Author

I updated the original approach proposed in PR #49 to address the review concerns and make the feature more extensible and robust.

What changed

  • I replaced python_file/python_args with a new, explicit CommandTrainer that runs an arbitrary command with arguments under the runtime’s launcher.
  • The function path (CustomTrainer) remains intact, without special-casing or hardcoding.

Why this fits the plugin architecture

  • I reuse the runtime’s launcher template (torchrun/mpirun/python) and only replace the payload, so custom runtimes continue to control entrypoints and arguments.
  • No command=["python"] hardcoding; the launcher remains framework-aware. See discussion in PR #49.

Distributed runtimes

  • Torch uses torchrun, MPI uses mpirun (with --user pip installs), PlainML uses python.
  • This works for multi-node and device-aware training consistently.

Shared features are preserved (no early return)

  • Environment variables, resources, and pip installs are applied in the same builder path as functions.
  • The payload includes optional installs followed by the user command.

File availability assumption

Validation strategy

  • Consistent with CustomTrainer, I validate in the builder and error if command is missing/invalid.
  • I can add __post_init__ later for earlier failure if needed, but builder-level checks match current patterns.

Tests and docs

  • Helper tests (plain/MPI) assert exact launcher shape and multiline payloads.
  • CRD builder tests cover env/resources/num_nodes and MPI --user installs.
  • Backend routing test verifies CommandTrainer dispatch.
  • README includes a minimal CommandTrainer usage example.

Outcome

  • Clear, framework-aware command execution without bypassing shared features or limiting plugin extensibility, and a clean separation from the function-based path.

@kramaranya , @andreyvelich kindly review the changes and let me know what you think about the new approach

@jskswamy jskswamy force-pushed the feature/customtrainer-python-file branch from f7179e8 to cd8d184 Compare September 10, 2025 12:03
client = TrainerClient()
rt = client.get_runtime("torch") # or "mpi", "plainml"

trainer = types.CommandTrainer(
Copy link
Member

@andreyvelich andreyvelich Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea and somewhat aligned with what we discussed today at the Kubeflow SDK call with KubernetesTrainer proposed by @szaher: https://youtu.be/mv8GoWdefck?t=832

Since we distinguish the runtime trainers between CustomTrainer and BuiltinTrainer, I am wondering if we want to introduce CustomTrainerContainer() type which give users control to configure image, container, args instead of passing the training function.

Would that be helpful for integration between KFP and Trainer ?

Thoughts @kubeflow/kubeflow-sdk-team @mprahl @franciscojavierarceo @ederign @rudeigerc?

Introduce a new CommandTrainer dataclass to facilitate
the execution of arbitrary commands within the runtime's
launcher template, allowing for configuration of command
arguments, package installations, and environment variables.

Enhance the utility function to build runtime-aware
commands, preserving the launcher and incorporating
optional package installations. This change aims to
simplify the command execution process in various
runtime environments, including support for MPI.

Add corresponding tests to validate the new functionality.

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Add a new function `get_trainer_crd_from_command_trainer` to build
the Trainer Custom Resource Definition (CRD) for CommandTrainer.
This function preserves the environment variables and resource
settings while utilizing a runtime-aware command assembly helper.

Enhance unit tests to verify that the new function correctly
builds the CRD with the expected configuration, including
environment variables and resource allocation.

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Add support for CommandTrainer in the Kubernetes backend.
This change ensures that the CommandTrainer can be used with
the appropriate runtime. It raises a ValueError if a
CommandTrainer is used with an incompatible runtime type.

Additionally, update the error message to reflect the new
trainer type support, ensuring clearer communication for
users regarding valid trainer options.

Include a new test to verify that CommandTrainer is correctly
routed to its CRD builder during the training process.

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Add detailed instructions for using the CommandTrainer
to run custom commands in the runtime. This includes
code snippets for setup, specifying the command,
arguments, and environment variables.

Also, clarify that the launcher is runtime-aware and
provide notes on package installations and script
requirements within the container.

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Enhance the trainer parameter in both TrainerClient and
KubernetesBackend to include CommandTrainer.

Including CommandTrainer in the module's exports ensures
that it is readily available for use in other parts of
the application

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Refactor the CommandTrainer class to allow the command attribute to be
optional, enabling more flexible usage scenarios. If no command is
provided, defaults are chosen based on the runtime framework, with args
passed as-is.

Additionally, enhance the get_trainer_crd_from_command_trainer function
to use a bash-wrapped path if packages need to be installed or if no
explicit command is provided. This change preserves installation features
and runtime launcher behavior.

Add unit tests to verify behavior when no command is set and when
commands are passed without installations, ensuring the correct command
and arguments are returned in these scenarios.

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
@jskswamy jskswamy force-pushed the feature/customtrainer-python-file branch from ccf9090 to f1eea03 Compare September 17, 2025 01:49
@jskswamy
Copy link
Author

Update on CommandTrainer behavior (pass-through + launcher-aware fallback)

  • When command is provided and no packages_to_install: CRD uses command/args directly (no shell wrapper).
  • When packages_to_install is set OR command is not provided: runtime launcher is reused via get_command_using_user_command (bash-wrapped), preserving installs and launcher selection (torchrun/mpirun/python).
  • Mirrors BuiltinTrainer default launcher selection for the fallback path; explicit commands remain unchanged.
  • Tests updated to cover both pass-through and launcher-wrapped paths.

Examples:

# Direct (no wrapper)
trainer = types.CommandTrainer(command=["torchrun"], args=["--standalone", "train.py"])

# Install-aware (bash-wrapped)
trainer = types.CommandTrainer(
  command=["torchrun"],
  args=["--standalone", "train.py"],
  packages_to_install=["nemo", "deepspeed"],
)

Enhance the CommandTrainer class by adding a new attribute,
pip_extra_args, to accommodate additional pip installation flags.
This change improves flexibility in package management during
runtime.

Also update related utility functions to handle the new
parameter seamlessly, ensuring that users can specify extra
pip arguments when building command strings.

- Added pip_extra_args to CommandTrainer
- Updated get_script_for_python_packages and
  get_command_using_user_command functions to include
  pip_extra_args handling
- Included a test case to verify the functionality of
  pip_extra_args in command generation

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Refactor the command handling logic to always produce a bash-wrapped
command, ensuring shell interpolation and preserving runtime launcher
behavior. This change simplifies the logic by removing the conditional
check for package installations, thereby making the behavior consistent
regardless of whether installations are needed.

Update the test case to reflect this change, ensuring that the
command is always wrapped in bash, even when no packages are
installed. This improves predictability and reduces potential
issues related to command execution.

Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support running a Python script directly in CustomTrainer (e.g., python myscript.py)

5 participants