-
Notifications
You must be signed in to change notification settings - Fork 38
Allow CustomTrainer to run a Python script directly via python_file #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Allow CustomTrainer to run a Python script directly via python_file #49
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
5d1a192
to
4273971
Compare
I'm not sure about this approach as it assumes the training script to run is available within the training runtime image. This would require users to rebuild training images each time a change is made to a training script. Having said that, I think it would be a big UX improvement to allow users to provide a training script to run instead of a train function, but this script should come from the user's workspace instead of being baked into the training runtime image. This is related to issue #48, which describes making a snapshot of a user's workspaces available in training pods so that [multiple] Python files can be used in training jobs. |
@eoinfennessy! I agree that this would be a big UX improvement. However, I'd like to clarify that the current implementation doesn't necessarily require rebuilding Docker images every time. The Code Staging Options (No Image Rebuild Required)
UX Improvements This Feature EnablesThe
Architecture FlexibilityThe current implementation sets the container entrypoint to
Future IntegrationYou're absolutely right that workspace snapshotting (#48) would be an excellent complementary feature. When that's implemented, users could have their entire workspace available, making the |
Pull Request Test Coverage Report for Build 16498943138Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @jskswamy!
I am wondering what are the benefits of such API compare to just this:
from custom_script import run_pytorch
TrainerClient().train(
trainer=CustomTrainer(
func=run_pytorch,
num_nodes=5
)
)
If the TrainJob contains users' workspace, they should be able to execute the above script.
cc @shravan-achar
1. Framework Integration is Much Cleaner# Clean framework API
trainer = CustomTrainer(python_file="train.py", python_args=["--epochs", "100"], num_nodes=5)
# Without it, frameworks need to do the following
def wrapper_function():
import subprocess
subprocess.run(["python", "train.py", "--epochs", "100"], check=True) 2. Natural Migration Path# What users do now
python train.py --epochs 100 --batch-size 32 # Same thing, just wrapped
trainer = CustomTrainer(
python_file="train.py",
python_args=["--epochs", "100", "--batch-size", "32"],
num_nodes=5
) 3. Better Debugging & Signal Handling
4. User Experience is Just Better# User writes exactly what they want
trainer = CustomTrainer(
python_file="train.py",
python_args=["--epochs", "100", "--batch-size", "32"],
num_nodes=5
)
# Function wrapping is a hard
def train_function():
# Complex argument handling, signal handling, subprocess management
pass |
4273971
to
e96e53a
Compare
@jskswamy A few questions:
|
@andreyvelich Thanks for pointing out the lack of framework-aware command support. I have added the following support now. The
On the Python file availability, we are making an assumption that the Python file will be made available by the consumer through:
This design assumption provides flexibility for different deployment patterns while keeping the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jskswamy!
Have you considered how this affects the plugin architecture when someone writes a custom runtime?
trainer_crd.command = ["python"] | ||
# Combine python_file with python_args | ||
args = [trainer.python_file] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still hardcodes command = ["python"]
for all runtimes, which will not work for distributed training with MPI and PyTorch, right?
if trainer.python_args: | ||
args.extend(trainer.python_args) | ||
trainer_crd.args = args | ||
return trainer_crd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This early return skips several important features that users reasonably expect to work, such as
- package installation
https://github.com/jskswamy/kubeflow-sdk/blob/feature/customtrainer-python-file/python/kubeflow/trainer/utils/utils.py#L391-L397 - environment variable handling
https://github.com/jskswamy/kubeflow-sdk/blob/feature/customtrainer-python-file/python/kubeflow/trainer/utils/utils.py#L400-L404
What would be the differences to use Python function vs allow user to bypass the Docker image directly in the custom trainer, i.e. train(
trainer=CustomTrainer(
image=...
)
)
I still think that we should prfioritize this feature to make it work: #48 from train import run_pytorch
TrainerClient().train(
trainer=CustomTrainer(
func=run_pytorch,
num_nodes=5
)
) In that case WDYT @jskswamy ? |
Hey @jskswamy! Did you have a chance to check the latest comments? |
I went through and making necessary changes to accommodate the inputs |
e96e53a
to
f7179e8
Compare
I updated the original approach proposed in PR #49 to address the review concerns and make the feature more extensible and robust. What changed
Why this fits the plugin architecture
Distributed runtimes
Shared features are preserved (no early return)
File availability assumption
Validation strategy
Tests and docs
Outcome
@kramaranya , @andreyvelich kindly review the changes and let me know what you think about the new approach |
f7179e8
to
cd8d184
Compare
client = TrainerClient() | ||
rt = client.get_runtime("torch") # or "mpi", "plainml" | ||
|
||
trainer = types.CommandTrainer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea and somewhat aligned with what we discussed today at the Kubeflow SDK call with KubernetesTrainer
proposed by @szaher: https://youtu.be/mv8GoWdefck?t=832
Since we distinguish the runtime trainers between CustomTrainer and BuiltinTrainer, I am wondering if we want to introduce CustomTrainerContainer()
type which give users control to configure image
, container
, args
instead of passing the training function.
Would that be helpful for integration between KFP and Trainer ?
Thoughts @kubeflow/kubeflow-sdk-team @mprahl @franciscojavierarceo @ederign @rudeigerc?
Introduce a new CommandTrainer dataclass to facilitate the execution of arbitrary commands within the runtime's launcher template, allowing for configuration of command arguments, package installations, and environment variables. Enhance the utility function to build runtime-aware commands, preserving the launcher and incorporating optional package installations. This change aims to simplify the command execution process in various runtime environments, including support for MPI. Add corresponding tests to validate the new functionality. Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Add a new function `get_trainer_crd_from_command_trainer` to build the Trainer Custom Resource Definition (CRD) for CommandTrainer. This function preserves the environment variables and resource settings while utilizing a runtime-aware command assembly helper. Enhance unit tests to verify that the new function correctly builds the CRD with the expected configuration, including environment variables and resource allocation. Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Add support for CommandTrainer in the Kubernetes backend. This change ensures that the CommandTrainer can be used with the appropriate runtime. It raises a ValueError if a CommandTrainer is used with an incompatible runtime type. Additionally, update the error message to reflect the new trainer type support, ensuring clearer communication for users regarding valid trainer options. Include a new test to verify that CommandTrainer is correctly routed to its CRD builder during the training process. Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Add detailed instructions for using the CommandTrainer to run custom commands in the runtime. This includes code snippets for setup, specifying the command, arguments, and environment variables. Also, clarify that the launcher is runtime-aware and provide notes on package installations and script requirements within the container. Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Enhance the trainer parameter in both TrainerClient and KubernetesBackend to include CommandTrainer. Including CommandTrainer in the module's exports ensures that it is readily available for use in other parts of the application Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Refactor the CommandTrainer class to allow the command attribute to be optional, enabling more flexible usage scenarios. If no command is provided, defaults are chosen based on the runtime framework, with args passed as-is. Additionally, enhance the get_trainer_crd_from_command_trainer function to use a bash-wrapped path if packages need to be installed or if no explicit command is provided. This change preserves installation features and runtime launcher behavior. Add unit tests to verify behavior when no command is set and when commands are passed without installations, ensuring the correct command and arguments are returned in these scenarios. Signed-off-by: Krishnaswamy Subramanian <[email protected]>
ccf9090
to
f1eea03
Compare
Update on CommandTrainer behavior (pass-through + launcher-aware fallback)
Examples: # Direct (no wrapper)
trainer = types.CommandTrainer(command=["torchrun"], args=["--standalone", "train.py"])
# Install-aware (bash-wrapped)
trainer = types.CommandTrainer(
command=["torchrun"],
args=["--standalone", "train.py"],
packages_to_install=["nemo", "deepspeed"],
) |
Enhance the CommandTrainer class by adding a new attribute, pip_extra_args, to accommodate additional pip installation flags. This change improves flexibility in package management during runtime. Also update related utility functions to handle the new parameter seamlessly, ensuring that users can specify extra pip arguments when building command strings. - Added pip_extra_args to CommandTrainer - Updated get_script_for_python_packages and get_command_using_user_command functions to include pip_extra_args handling - Included a test case to verify the functionality of pip_extra_args in command generation Signed-off-by: Krishnaswamy Subramanian <[email protected]>
Refactor the command handling logic to always produce a bash-wrapped command, ensuring shell interpolation and preserving runtime launcher behavior. This change simplifies the logic by removing the conditional check for package installations, thereby making the behavior consistent regardless of whether installations are needed. Update the test case to reflect this change, ensuring that the command is always wrapped in bash, even when no packages are installed. This improves predictability and reduces potential issues related to command execution. Signed-off-by: Krishnaswamy Subramanian <[email protected]>
3f4b938
to
b06bb29
Compare
What this PR does / why we need it:
This PR adds support for running a Python script directly in
CustomTrainer
by specifying apython_file
argument. Ifpython_file
is set, the job will run the specified script as the main process (python myscript.py
) instead of requiring a function. This is mutually exclusive withfunc
; an error is raised if both or neither are set. The original function-based usage is unchanged and fully backward compatible.python myscript.py
).Which issue(s) this PR fixes:
Fixes #47
Summary of changes:
CustomTrainer
now accepts an optionalpython_file
argument.python_file
is set, the SDK sets the container entrypoint to["python", python_file]
.func
; validation is added.Additional context:
See kubeflow/sdk#47 for the original feature request and motivation.