chore(sdk): check that requested memory less than or equals according limits #11240

ntny · 2024-09-23T21:16:18Z

Description of your changes: raise an error with a description if the requested resources (memory/CPU) exceed the selected limit while defining the pipeline.

examples:

from kfp.dsl import pipeline_task
task = pipeline_task.PipelineTask(...).set_memory_request('2M').set_memory_limit('1M')

google-oss-prow · 2024-09-23T21:16:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

sdk/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2024-09-23T21:16:30Z

Hi @ntny. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hbelmiro · 2024-09-26T11:48:00Z

/ok-to-test
/rerun-all

rimolive · 2024-10-01T10:26:04Z

@ntny Please take a look at https://github.com/kubeflow/pipelines/blob/master/sdk/CONTRIBUTING.md#code-style to fix the failling YAPF tests.

HumairAK · 2024-10-10T18:34:11Z

Looks like there are sdk ci failures, @ntny can you take a look?

ntny · 2024-10-10T22:06:52Z

Looks like there are sdk ci failures, @ntny can you take a look?

Sure, in a few days.

ntny · 2024-10-14T15:14:28Z

/rerun-all

ntny · 2024-10-14T15:33:58Z

/rerun-all

ntny · 2024-10-14T17:20:18Z

/rerun-all

ntny · 2024-10-14T17:49:06Z

For validation purposes, I had to restore the code fragment for converting requested resource memory labels to float (for comparing limits and requests).

Signed-off-by: ntny <[email protected]>

ntny · 2024-10-14T18:45:37Z

@HumairAK Latest CI Check failed due to timeout.
I suppose it might be unrelated to MR

ntny · 2024-10-14T19:30:18Z

/retest

gregsheremeta · 2024-10-14T20:03:58Z

Thanks for the PR.

This kind of thing -- baking Kubernetes logic into KFP -- makes me a bit nervous. I already don't like that the kfp-kubernetes SDK basically "chases" the official python client, and the slope here seems slippery.

Using a different example, today we allow mounting a secret as a volume. Is it KFP's responsibility to make sure that the secret name is a valid Kubernetes name, or should we just rely on the Kubernetes server to deal with that? Going further, should we be pre-checking that the secret exists, or should we just let Kubernetes tell us that the volume mount failed at pipeline run time?

I've been mostly thinking we should just leave these sorts of things up to Kubernetes, and treat kfp-kubernetes as basically a "dumb" yaml generator.

This change seems fairly innocuous, but it's more code to maintain, and we're on the hook if the Kubernetes behavior changes someday.

I'd be a little more receptive to the change if we could somehow use the official python client. Did you by any chance look into that?

What do others think?

ntny · 2024-10-15T15:59:50Z

Thanks for the PR.

This kind of thing -- baking Kubernetes logic into KFP -- makes me a bit nervous. I already don't like that the kfp-kubernetes SDK basically "chases" the official python client, and the slope here seems slippery.

Using a different example, today we allow mounting a secret as a volume. Is it KFP's responsibility to make sure that the secret name is a valid Kubernetes name, or should we just rely on the Kubernetes server to deal with that? Going further, should we be pre-checking that the secret exists, or should we just let Kubernetes tell us that the volume mount failed at pipeline run time?

I've been mostly thinking we should just leave these sorts of things up to Kubernetes, and treat kfp-kubernetes as basically a "dumb" yaml generator.

This change seems fairly innocuous, but it's more code to maintain, and we're on the hook if the Kubernetes behavior changes someday.

I'd be a little more receptive to the change if we could somehow use the official python client. Did you by any chance look into that?

What do others think?

Thanks, i agree with your points. My reasoning for this PR:

pipeline_task already performs similar memory checks on input parameters and it is also k8s specific validation https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/pipeline_task.py#L457 . I am not sure my check adds anything different.
Requesting resources beyond the limits looks abnormal, and this behavior isn’t specific to Kubernetes in my view. It seems to me that any API providing request/limit options should handle this validation.

But overall, I agree with your arguments about maintaining the code, and it’s up to you to decide here.
Regarding the Python-K8s client, thanks, I will take a look at it

gregsheremeta · 2024-10-21T20:13:32Z

@chensun do you have any thoughts on the above two comments? I'm not sure how much "Kubernetes logic" (for lack of a better term) we should be baking into KFP. Anton has some good points.

ntny · 2024-11-02T11:45:19Z

Hi @gregsheremeta!
Regarding the use of the official Python client for validation at the KFP sdk level, I might be mistaken, but in my view, it is not suitable here. This is because sdk generates its own specification (or worked directly with Argo in previous versions) and does not operate within the Kubernetes context.
The official client, on the other hand, works with Kubernetes entities.
Plus official Python client supports a dry-run pattern as an alternative for validation. I'm not sure if this would be useful for the entire pipeline, but it should be implemented in the KFP driver for my account rather than in the SDK, as a consequence, validation will occur at a different stage of the run..

chensun · 2024-11-06T17:57:54Z

I've been mostly thinking we should just leave these sorts of things up to Kubernetes, and treat kfp-kubernetes as basically a "dumb" yaml generator.

I personally lean towards keeping SDK as a "dumb" yaml generator, and not baking too much logic into it. Otherwise, SDK might become a bottleneck, and we would need to keep chasing trivial changes from Kubernetes side.

pipeline_task already performs similar memory checks on input parameters and it is also k8s specific validation https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/pipeline_task.py#L457 . I am not sure my check adds anything different.

This check is something we inherited from KFP v1 codebase, it's probably much easier to keep it as-is then reasoning the other way around.
While I agree that client side validation is nice to have and catching error earlier is often desirable UX, I do want to point out that such check has limited usage and the potential downside could be inconsistent UX. For input parameter validation, SDK can only check when the value is available at compilation or submission time, which is not always the case. For example,

some_component().set_memory_limit('8G') # SDK validation can check and throw early error on incorrect format

upstream_task = upstream()
downstream().set_memory_limit(upstream_task.output) # SDK validation cannot do anything with the runtime value

I personally prefer having consistent behavior and error message on both cases.

google-oss-prow bot added the size/M label Sep 23, 2024

google-oss-prow bot requested a review from DharmitD September 23, 2024 21:16

google-oss-prow bot requested a review from gregsheremeta September 23, 2024 21:16

google-oss-prow bot added the needs-ok-to-test label Sep 23, 2024

ntny force-pushed the validate-that-requested-resource-meet-limits branch from 2887cb9 to 0d429d2 Compare September 25, 2024 09:05

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Sep 26, 2024

HumairAK added this to the KFP SDK 2.10 milestone Oct 10, 2024

HumairAK modified the milestones: KFP SDK 2.10, KFP SDK 2.11 Oct 10, 2024

ntny force-pushed the validate-that-requested-resource-meet-limits branch from 0d429d2 to 80df2b0 Compare October 14, 2024 15:12

google-oss-prow bot added size/L and removed size/M labels Oct 14, 2024

ntny changed the title ~~chore(sdk): check that requested cpu/memory less than or equals according limits~~ chore(sdk): check that requested memory less than or equals according limits Oct 14, 2024

ntny force-pushed the validate-that-requested-resource-meet-limits branch from 80df2b0 to 15aef12 Compare October 14, 2024 15:33

check that requested cpu/memory less than or equals according limits

e760a06

Signed-off-by: ntny <[email protected]>

ntny force-pushed the validate-that-requested-resource-meet-limits branch from 15aef12 to e760a06 Compare October 14, 2024 17:52

github-actions bot added the ci-passed All CI tests on a pull request have passed label Oct 14, 2024

chensun closed this Nov 6, 2024

chensun reopened this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(sdk): check that requested memory less than or equals according limits #11240

chore(sdk): check that requested memory less than or equals according limits #11240

ntny commented Sep 23, 2024

google-oss-prow bot commented Sep 23, 2024

google-oss-prow bot commented Sep 23, 2024

hbelmiro commented Sep 26, 2024

rimolive commented Oct 1, 2024

HumairAK commented Oct 10, 2024

ntny commented Oct 10, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024 •

edited

Loading

ntny commented Oct 14, 2024

gregsheremeta commented Oct 14, 2024 •

edited

Loading

ntny commented Oct 15, 2024

gregsheremeta commented Oct 21, 2024

ntny commented Nov 2, 2024 •

edited

Loading

chensun commented Nov 6, 2024

chore(sdk): check that requested memory less than or equals according limits #11240

Are you sure you want to change the base?

chore(sdk): check that requested memory less than or equals according limits #11240

Conversation

ntny commented Sep 23, 2024

google-oss-prow bot commented Sep 23, 2024

google-oss-prow bot commented Sep 23, 2024

hbelmiro commented Sep 26, 2024

rimolive commented Oct 1, 2024

HumairAK commented Oct 10, 2024

ntny commented Oct 10, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024

ntny commented Oct 14, 2024 • edited Loading

ntny commented Oct 14, 2024

gregsheremeta commented Oct 14, 2024 • edited Loading

ntny commented Oct 15, 2024

gregsheremeta commented Oct 21, 2024

ntny commented Nov 2, 2024 • edited Loading

chensun commented Nov 6, 2024

ntny commented Oct 14, 2024 •

edited

Loading

gregsheremeta commented Oct 14, 2024 •

edited

Loading

ntny commented Nov 2, 2024 •

edited

Loading