Allow CustomTrainer to run a Python script directly via python_file #49

jskswamy · 2025-07-24T13:57:11Z

What this PR does / why we need it:

This PR adds support for running a Python script directly in CustomTrainer by specifying a python_file argument. If python_file is set, the job will run the specified script as the main process (python myscript.py) instead of requiring a function. This is mutually exclusive with func; an error is raised if both or neither are set. The original function-based usage is unchanged and fully backward compatible.

Simplifies migration from script-based workflows (e.g., Slurm, bash, or direct Kubernetes Jobs) to Kubeflow Trainer.
Matches user expectations for direct script execution (python myscript.py).
Avoids the indirection and complexity of function serialization and wrapper scripts.
Ensures the script runs as the main process, improving signal handling and debugging.

Which issue(s) this PR fixes:

Fixes #47

Summary of changes:

CustomTrainer now accepts an optional python_file argument.
If python_file is set, the SDK sets the container entrypoint to ["python", python_file].
Mutually exclusive with func; validation is added.
No changes to existing function-based usage.

Additional context:
See kubeflow/sdk#47 for the original feature request and motivation.

google-oss-prow · 2025-07-24T13:57:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

eoinfennessy · 2025-07-25T09:57:48Z

I'm not sure about this approach as it assumes the training script to run is available within the training runtime image. This would require users to rebuild training images each time a change is made to a training script.

Having said that, I think it would be a big UX improvement to allow users to provide a training script to run instead of a train function, but this script should come from the user's workspace instead of being baked into the training runtime image. This is related to issue #48, which describes making a snapshot of a user's workspaces available in training pods so that [multiple] Python files can be used in training jobs.

jskswamy · 2025-07-26T02:08:23Z

@eoinfennessy! I agree that this would be a big UX improvement. However, I'd like to clarify that the current implementation doesn't necessarily require rebuilding Docker images every time.

The python_file feature is designed to be flexible and can work with various code staging strategies:

Code Staging Options (No Image Rebuild Required)

ConfigMaps: Training scripts can be mounted as ConfigMaps, allowing users to update scripts without rebuilding images
Object Storage: Scripts can be stored in object storage (S3, GCS, etc.) and mounted as volumes
Init Containers: An init container can clone repositories or download scripts before the training pod starts
PVC with Git Repos: Persistent Volume Claims can mount git repositories that get updated
Runtime Volume Mounts: Scripts can be mounted from the host or other storage systems.

UX Improvements This Feature Enables

The python_file approach provides significant UX benefits:

Simplified Migration: Users can easily migrate from existing script-based workflows (Slurm, bash, direct K8s Jobs) to Kubeflow Trainer
Intuitive Execution: Matches user expectations for direct script execution (python myscript.py)
Reduced Complexity: Eliminates the indirection and complexity of function serialization and wrapper scripts
Better Debugging: Scripts run as the main process, improving signal handling and debugging capabilities
Familiar Workflow: Users can continue using their existing training scripts without major modifications

Architecture Flexibility

The current implementation sets the container entrypoint to ["python", python_file], which is intentionally simple and allows the runtime to handle script availability through its own mechanisms. This separation of concerns means:

The SDK focuses on execution semantics
The runtime handles script staging and availability
Users can choose their preferred code staging strategy

Future Integration

You're absolutely right that workspace snapshotting (#48) would be an excellent complementary feature. When that's implemented, users could have their entire workspace available, making the python_file feature even more powerful. But the current implementation provides immediate value without blocking on workspace snapshotting.

coveralls · 2025-07-27T22:13:06Z

Pull Request Test Coverage Report for Build 16498943138

Details

2 of 9 (22.22%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.8%) to 63.939%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
python/kubeflow/trainer/utils/utils.py	2	9	22.22%

Totals
Change from base Build 16463414499:	-0.8%
Covered Lines:	250
Relevant Lines:	391

💛 - Coveralls

andreyvelich

Thank you for this @jskswamy!
I am wondering what are the benefits of such API compare to just this:

from custom_script import run_pytorch

TrainerClient().train(
  trainer=CustomTrainer(
    func=run_pytorch,
    num_nodes=5
   )
)

If the TrainJob contains users' workspace, they should be able to execute the above script.
cc @shravan-achar

jskswamy · 2025-08-07T11:26:53Z

python_file feature gives us the following:

1. Framework Integration is Much Cleaner

# Clean framework API
trainer = CustomTrainer(python_file="train.py", python_args=["--epochs", "100"], num_nodes=5)

# Without it, frameworks need to do the following
def wrapper_function():
    import subprocess
    subprocess.run(["python", "train.py", "--epochs", "100"], check=True)

2. Natural Migration Path

# What users do now
python train.py --epochs 100 --batch-size 32

# Same thing, just wrapped
trainer = CustomTrainer(
    python_file="train.py",
    python_args=["--epochs", "100", "--batch-size", "32"],
    num_nodes=5
)

3. Better Debugging & Signal Handling

Script runs as main process (python train.py --args)
Proper signal handling (SIGTERM, SIGINT)
Direct access to sys.argv and command-line arguments
Native Python debugging and profiling

4. User Experience is Just Better

# User writes exactly what they want
trainer = CustomTrainer(
    python_file="train.py",
    python_args=["--epochs", "100", "--batch-size", "32"],
    num_nodes=5
)

# Function wrapping is a hard
def train_function():
    # Complex argument handling, signal handling, subprocess management
    pass

andreyvelich · 2025-08-07T15:06:55Z

@jskswamy A few questions:

How you can ensure that Python file is inside the TrainJob workload ?
As you can see, entrypoint depends on runtime Trainer. It could be torchrun, mpirun, or Python in case of PlainML.

jskswamy · 2025-08-08T01:41:52Z

@andreyvelich Thanks for pointing out the lack of framework-aware command support. I have added the following support now.

The python_file feature now uses the runtime's framework-specific commands instead of hardcoded ["python"]:

PyTorch runtime: Uses torchrun train.py --args
MPI runtime: Uses mpirun python train.py --args
PlainML runtime: Uses python train.py --args

On the Python file availability, we are making an assumption that the Python file will be made available by the consumer through:

Embedded in container image - Scripts baked into the Docker image
ConfigMaps - Scripts mounted as ConfigMaps
Volume mounts - Scripts mounted via PVC or other volumes
Init containers - Scripts downloaded by init containers

This design assumption provides flexibility for different deployment patterns while keeping the python_file feature simple and focused on direct script execution.

kramaranya

Thank you @jskswamy!

Have you considered how this affects the plugin architecture when someone writes a custom runtime?

kramaranya · 2025-08-08T13:42:43Z

python/kubeflow/trainer/utils/utils.py

+        trainer_crd.command = ["python"]
+           # Combine python_file with python_args
+        args = [trainer.python_file]


This still hardcodes command = ["python"] for all runtimes, which will not work for distributed training with MPI and PyTorch, right?

kramaranya · 2025-08-08T13:54:26Z

python/kubeflow/trainer/utils/utils.py

+        if trainer.python_args:
+            args.extend(trainer.python_args)
+        trainer_crd.args = args
+        return trainer_crd


This early return skips several important features that users reasonably expect to work, such as

package installation
https://github.com/jskswamy/kubeflow-sdk/blob/feature/customtrainer-python-file/python/kubeflow/trainer/utils/utils.py#L391-L397

environment variable handling
https://github.com/jskswamy/kubeflow-sdk/blob/feature/customtrainer-python-file/python/kubeflow/trainer/utils/utils.py#L400-L404

andreyvelich · 2025-08-08T16:27:44Z

Embedded in container image - Scripts baked into the Docker image

What would be the differences to use Python function vs allow user to bypass the Docker image directly in the custom trainer, i.e.

train(
  trainer=CustomTrainer(
    image=...
  )
)

ConfigMaps - Scripts mounted as ConfigMaps
Volume mounts - Scripts mounted via PVC or other volumes
Init containers - Scripts downloaded by init containers

I still think that we should prfioritize this feature to make it work: #48
It would be much easier to use if we snapshot the filesystem into TrainJob, so users don't need to worry about it, and just run:

from train import run_pytorch

TrainerClient().train(
  trainer=CustomTrainer(
    func=run_pytorch,
    num_nodes=5
   )
)

In that case train.py might import other functions from user's workspace, and it works.

WDYT @jskswamy ?

kramaranya · 2025-09-01T16:42:46Z

Hey @jskswamy! Did you have a chance to check the latest comments?

jskswamy · 2025-09-02T06:34:40Z

I went through and making necessary changes to accommodate the inputs

jskswamy · 2025-09-10T12:03:18Z

I updated the original approach proposed in PR #49 to address the review concerns and make the feature more extensible and robust.

What changed

I replaced python_file/python_args with a new, explicit CommandTrainer that runs an arbitrary command with arguments under the runtime’s launcher.
The function path (CustomTrainer) remains intact, without special-casing or hardcoding.

Why this fits the plugin architecture

I reuse the runtime’s launcher template (torchrun/mpirun/python) and only replace the payload, so custom runtimes continue to control entrypoints and arguments.
No command=["python"] hardcoding; the launcher remains framework-aware. See discussion in PR #49.

Distributed runtimes

Torch uses torchrun, MPI uses mpirun (with --user pip installs), PlainML uses python.
This works for multi-node and device-aware training consistently.

Shared features are preserved (no early return)

Environment variables, resources, and pip installs are applied in the same builder path as functions.
The payload includes optional installs followed by the user command.

File availability assumption

The script/binaries must be present in the container (image/ConfigMap/volume/init). The workspace snapshot UX remains future work (issue [SDK] Snapshot users' workspace into distributed TrainJob workload #48 referenced in the PR thread).

Validation strategy

Consistent with CustomTrainer, I validate in the builder and error if command is missing/invalid.
I can add __post_init__ later for earlier failure if needed, but builder-level checks match current patterns.

Tests and docs

Helper tests (plain/MPI) assert exact launcher shape and multiline payloads.
CRD builder tests cover env/resources/num_nodes and MPI --user installs.
Backend routing test verifies CommandTrainer dispatch.
README includes a minimal CommandTrainer usage example.

Outcome

Clear, framework-aware command execution without bypassing shared features or limiting plugin extensibility, and a clean separation from the function-based path.

@kramaranya , @andreyvelich kindly review the changes and let me know what you think about the new approach

andreyvelich · 2025-09-11T01:52:50Z

README.md

+client = TrainerClient()
+rt = client.get_runtime("torch")  # or "mpi", "plainml"
+
+trainer = types.CommandTrainer(


This is an interesting idea and somewhat aligned with what we discussed today at the Kubeflow SDK call with KubernetesTrainer proposed by @szaher: https://youtu.be/mv8GoWdefck?t=832

Since we distinguish the runtime trainers between CustomTrainer and BuiltinTrainer, I am wondering if we want to introduce CustomTrainerContainer() type which give users control to configure image, container, args instead of passing the training function.

Would that be helpful for integration between KFP and Trainer ?

Thoughts @kubeflow/kubeflow-sdk-team @mprahl @franciscojavierarceo @ederign @rudeigerc?

Introduce a new CommandTrainer dataclass to facilitate the execution of arbitrary commands within the runtime's launcher template, allowing for configuration of command arguments, package installations, and environment variables. Enhance the utility function to build runtime-aware commands, preserving the launcher and incorporating optional package installations. This change aims to simplify the command execution process in various runtime environments, including support for MPI. Add corresponding tests to validate the new functionality. Signed-off-by: Krishnaswamy Subramanian <[email protected]>

Add a new function `get_trainer_crd_from_command_trainer` to build the Trainer Custom Resource Definition (CRD) for CommandTrainer. This function preserves the environment variables and resource settings while utilizing a runtime-aware command assembly helper. Enhance unit tests to verify that the new function correctly builds the CRD with the expected configuration, including environment variables and resource allocation. Signed-off-by: Krishnaswamy Subramanian <[email protected]>

Add support for CommandTrainer in the Kubernetes backend. This change ensures that the CommandTrainer can be used with the appropriate runtime. It raises a ValueError if a CommandTrainer is used with an incompatible runtime type. Additionally, update the error message to reflect the new trainer type support, ensuring clearer communication for users regarding valid trainer options. Include a new test to verify that CommandTrainer is correctly routed to its CRD builder during the training process. Signed-off-by: Krishnaswamy Subramanian <[email protected]>

Add detailed instructions for using the CommandTrainer to run custom commands in the runtime. This includes code snippets for setup, specifying the command, arguments, and environment variables. Also, clarify that the launcher is runtime-aware and provide notes on package installations and script requirements within the container. Signed-off-by: Krishnaswamy Subramanian <[email protected]>

Enhance the trainer parameter in both TrainerClient and KubernetesBackend to include CommandTrainer. Including CommandTrainer in the module's exports ensures that it is readily available for use in other parts of the application Signed-off-by: Krishnaswamy Subramanian <[email protected]>

Refactor the CommandTrainer class to allow the command attribute to be optional, enabling more flexible usage scenarios. If no command is provided, defaults are chosen based on the runtime framework, with args passed as-is. Additionally, enhance the get_trainer_crd_from_command_trainer function to use a bash-wrapped path if packages need to be installed or if no explicit command is provided. This change preserves installation features and runtime launcher behavior. Add unit tests to verify behavior when no command is set and when commands are passed without installations, ensuring the correct command and arguments are returned in these scenarios. Signed-off-by: Krishnaswamy Subramanian <[email protected]>

jskswamy · 2025-09-17T02:07:19Z

Update on CommandTrainer behavior (pass-through + launcher-aware fallback)

When command is provided and no packages_to_install: CRD uses command/args directly (no shell wrapper).
When packages_to_install is set OR command is not provided: runtime launcher is reused via get_command_using_user_command (bash-wrapped), preserving installs and launcher selection (torchrun/mpirun/python).
Mirrors BuiltinTrainer default launcher selection for the fallback path; explicit commands remain unchanged.
Tests updated to cover both pass-through and launcher-wrapped paths.

Examples:

# Direct (no wrapper)
trainer = types.CommandTrainer(command=["torchrun"], args=["--standalone", "train.py"])

# Install-aware (bash-wrapped)
trainer = types.CommandTrainer(
  command=["torchrun"],
  args=["--standalone", "train.py"],
  packages_to_install=["nemo", "deepspeed"],
)

Enhance the CommandTrainer class by adding a new attribute, pip_extra_args, to accommodate additional pip installation flags. This change improves flexibility in package management during runtime. Also update related utility functions to handle the new parameter seamlessly, ensuring that users can specify extra pip arguments when building command strings. - Added pip_extra_args to CommandTrainer - Updated get_script_for_python_packages and get_command_using_user_command functions to include pip_extra_args handling - Included a test case to verify the functionality of pip_extra_args in command generation Signed-off-by: Krishnaswamy Subramanian <[email protected]>

Refactor the command handling logic to always produce a bash-wrapped command, ensuring shell interpolation and preserving runtime launcher behavior. This change simplifies the logic by removing the conditional check for package installations, thereby making the behavior consistent regardless of whether installations are needed. Update the test case to reflect this change, ensuring that the command is always wrapped in bash, even when no packages are installed. This improves predictability and reduces potential issues related to command execution. Signed-off-by: Krishnaswamy Subramanian <[email protected]>

google-oss-prow bot requested review from Electronic-Waste, andreyvelich and astefanutti July 24, 2025 13:57

google-oss-prow bot added the size/S label Jul 24, 2025

jskswamy force-pushed the feature/customtrainer-python-file branch from 5d1a192 to 4273971 Compare July 24, 2025 13:58

andreyvelich reviewed Aug 7, 2025

View reviewed changes

jskswamy force-pushed the feature/customtrainer-python-file branch from 4273971 to e96e53a Compare August 7, 2025 11:27

google-oss-prow bot added size/L and removed size/S labels Aug 7, 2025

kramaranya reviewed Aug 8, 2025

View reviewed changes

jskswamy force-pushed the feature/customtrainer-python-file branch from e96e53a to f7179e8 Compare September 9, 2025 09:52

jskswamy force-pushed the feature/customtrainer-python-file branch from f7179e8 to cd8d184 Compare September 10, 2025 12:03

andreyvelich reviewed Sep 11, 2025

View reviewed changes

jskswamy added 6 commits September 17, 2025 07:17

jskswamy force-pushed the feature/customtrainer-python-file branch from ccf9090 to f1eea03 Compare September 17, 2025 01:49

jskswamy added 2 commits September 17, 2025 08:11

jskswamy force-pushed the feature/customtrainer-python-file branch from 3f4b938 to b06bb29 Compare September 18, 2025 00:18

This was referenced Sep 30, 2025

Training Options for TrainJob customization #92

Open

Support creating TrainJob using image entry point #116

Open

Allow CustomTrainer to run a Python script directly via python_file #49

Are you sure you want to change the base?

Allow CustomTrainer to run a Python script directly via python_file #49

Uh oh!

Conversation

jskswamy commented Jul 24, 2025

Uh oh!

google-oss-prow bot commented Jul 24, 2025

Uh oh!

eoinfennessy commented Jul 25, 2025

Uh oh!

jskswamy commented Jul 26, 2025

Code Staging Options (No Image Rebuild Required)

UX Improvements This Feature Enables

Architecture Flexibility

Future Integration

Uh oh!

coveralls commented Jul 27, 2025

Pull Request Test Coverage Report for Build 16498943138

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

jskswamy commented Aug 7, 2025

1. Framework Integration is Much Cleaner

2. Natural Migration Path

3. Better Debugging & Signal Handling

4. User Experience is Just Better

Uh oh!

andreyvelich commented Aug 7, 2025

Uh oh!

jskswamy commented Aug 8, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

kramaranya Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

kramaranya Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Aug 8, 2025

Uh oh!

kramaranya commented Sep 1, 2025

Uh oh!

jskswamy commented Sep 2, 2025

Uh oh!

jskswamy commented Sep 10, 2025

Uh oh!

andreyvelich Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jskswamy commented Sep 17, 2025

Update on CommandTrainer behavior (pass-through + launcher-aware fallback)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andreyvelich Sep 11, 2025 •

edited

Loading