Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): Remove resource from workflow-controller-configmap. Closes #9087 #9089

Merged
merged 2 commits into from
Apr 5, 2023

Conversation

gkcalat
Copy link
Member

@gkcalat gkcalat commented Apr 2, 2023

Description of your changes:

Note: I made this as a separate PR for easier tracking

Issue: #9087

Changes:

  • Removed resource requests and limits for CPU and Memory in workflow-controller-configmap

Context:

Tests:

  • Standalone deployment and full Kubeflow v1.7.0 @ GKE 1.24 with removed resources in workflow-controller-configmap
  • v1 sample pipelines - succeeded
  • v1 add-two-numbers pipeline compiled without resource limits and requests
    • empty spec.resources{} for other containers
  • v1 add-two-numbers pipeline compiled with custom resource limits and requests
    • correctly specified pod's resources in user container only (python:3.7)
    • empty spec.resources{} for other containers
  • v2 hello-world pipeline without resource limits and requests - succeeded
    • empty pod's spec.resources{}
  • v2 hello-world pipeline with resource limits and requests - succeeded
    • correctly specified pod's resources in user container only (python:3.7)
    • empty spec.resources{} for other containers

Checklist:

@gkcalat
Copy link
Member Author

gkcalat commented Apr 3, 2023

/test kubeflow-pipelines-samples-v2

@gkcalat
Copy link
Member Author

gkcalat commented Apr 4, 2023

/retest

2 similar comments
@gkcalat
Copy link
Member Author

gkcalat commented Apr 4, 2023

/retest

@gkcalat
Copy link
Member Author

gkcalat commented Apr 4, 2023

/retest

@google-oss-prow
Copy link

@gkcalat: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
kubeflow-pipelines-samples-v2 17352b4 link true /test kubeflow-pipelines-samples-v2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@chensun chensun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Apr 4, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chensun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 59688a3 into kubeflow:master Apr 5, 2023
@jsitu777
Copy link
Contributor

jsitu777 commented Apr 6, 2023

@chensun @gkcalat
I remove the resource quota in the workflow-controller configmap and specifies the resource quota in the profile like this:

apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: kubeflow-user-example-com
spec:
  owner:
    kind: User
    name: [email protected]

  resourceQuotaSpec:    # resource quota can be set optionally
   hard:
    limits.cpu: "4"
    requests.cpu: "2"
    limits.memory: "4Gi"
    requests.memory: "2Gi"

I get a error message like this when running a XGBoost pipeline:
This step is in Error state with this message: task 'train-until-good-pipeline-pwsbk.chicago-taxi-trips-dataset' errored: pods "train-until-good-pipeline-pwsbk-1892129756" is forbidden: failed quota: kf-resource-quota: must specify limits.cpu for: main; limits.memory for: main

@gkcalat
Copy link
Member Author

gkcalat commented Apr 6, 2023

Hi @jsitu777,

Can you provide more info:

  1. Are you using a full Kubeflow deployment?
  2. Where did you deploy it?
  3. How did you deploy it?
  4. What SDK version are you using?
  5. Can you share you pipeline and component definition (SDK and yaml/json)?

@jsitu777
Copy link
Contributor

jsitu777 commented Apr 6, 2023

@gkcalat

  1. yes, full kubeflow deployment (with other components such as Katib Kserve etc)
  2. Kubeflow on AWS
  3. kustomize manifest
  4. v1
  5. I was just running the Demo pipeline: [Demo] XGBoost - Iterative model training

@jsitu777
Copy link
Contributor

jsitu777 commented Apr 6, 2023

The workflow-controller-configmap after removing the quota:
kubectl get configmaps workflow-controller-configmap -n kubeflow --output=yaml > workflow-controller-configmap.yaml

apiVersion: v1
data:
  artifactRepository: |
    archiveLogs: true
    s3:
      endpoint: "minio-service.kubeflow:9000"
      bucket: "mlpipeline"
      # keyFormat is a format pattern to define how artifacts will be organized in a bucket.
      # It can reference workflow metadata variables such as workflow.namespace, workflow.name,
      # pod.name. Can also use strftime formating of workflow.creationTimestamp so that workflow
      # artifacts can be organized by date. If omitted, will use `{{workflow.name}}/{{pod.name}}`,
      # which has potential for have collisions, because names do not guarantee they are unique
      # over the lifetime of the cluster.
      # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/names/.
      #
      # The following format looks like:
      # artifacts/my-workflow-abc123/2018/08/23/my-workflow-abc123-1234567890
      # Adding date into the path greatly reduces the chance of {{pod.name}} collision.
      keyFormat: "artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}"
      # insecure will disable TLS. Primarily used for minio installs not configured with TLS
      insecure: true
      accessKeySecret:
        name: mlpipeline-minio-artifact
        key: accesskey
      secretKeySecret:
        name: mlpipeline-minio-artifact
        key: secretkey
  containerRuntimeExecutor: emissary
  executor: |
    imagePullPolicy: IfNotPresent
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"artifactRepository":"archiveLogs: true\ns3:\n  endpoint: \"minio-service.kubeflow:9000\"\n  bucket: \"mlpipeline\"\n  # keyFormat is a format pattern to define how artifacts will be organized in a bucket.\n  # It can reference workflow metadata variables such as workflow.namespace, workflow.name,\n  # pod.name. Can also use strftime formating of workflow.creationTimestamp so that workflow\n  # artifacts can be organized by date. If omitted, will use `{{workflow.name}}/{{pod.name}}`,\n  # which has potential for have collisions, because names do not guarantee they are unique\n  # over the lifetime of the cluster.\n  # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/names/.\n  #\n  # The following format looks like:\n  # artifacts/my-workflow-abc123/2018/08/23/my-workflow-abc123-1234567890\n  # Adding date into the path greatly reduces the chance of {{pod.name}} collision.\n  keyFormat: \"artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}\"\n  # insecure will disable TLS. Primarily used for minio installs not configured with TLS\n  insecure: true\n  accessKeySecret:\n    name: mlpipeline-minio-artifact\n    key: accesskey\n  secretKeySecret:\n    name: mlpipeline-minio-artifact\n    key: secretkey\n","containerRuntimeExecutor":"emissary","executor":"imagePullPolicy: IfNotPresent\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"application-crd-id":"kubeflow-pipelines"},"name":"workflow-controller-configmap","namespace":"kubeflow"}}
  creationTimestamp: "2023-04-05T23:28:52Z"
  labels:
    application-crd-id: kubeflow-pipelines
  name: workflow-controller-configmap
  namespace: kubeflow
  resourceVersion: "3717030"
  uid: d5580735-ad70-495c-9384-724da62a7783

@gkcalat
Copy link
Member Author

gkcalat commented Apr 6, 2023

Which version of Kubeflow are you using?

@gkcalat
Copy link
Member Author

gkcalat commented Apr 6, 2023

Could you also provide the manifest of the failed pod train-until-good-pipeline-pwsbk-1892129756?

kubectl describe -n <namespace> pod train-until-good-pipeline-pwsbk-1892129756

@jsitu777
Copy link
Contributor

jsitu777 commented Apr 6, 2023

Which version of Kubeflow are you using?

1.7.0

@jsitu777
Copy link
Contributor

jsitu777 commented Apr 6, 2023

Could you also provide the manifest of the failed pod train-until-good-pipeline-pwsbk-1892129756?

kubectl describe -n <namespace> pod train-until-good-pipeline-pwsbk-1892129756

it will not describe. the pod is killed at the beginning. example: kubectl describe pod train-until-good-pipeline-sb8tz-702192212 -n kubeflow-user-example-com Error from server (NotFound): pods "train-until-good-pipeline-sb8tz-702192212" not found

@jsitu777
Copy link
Contributor

jsitu777 commented Apr 6, 2023

Another note is that if I do not specify quota in profile and also remove the quota part in workflow-controller configmap as in this PR. pipeline will run:

apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: kubeflow-user-example-com
spec:
  owner:
    kind: User
    name: [email protected]

@jsitu777
Copy link
Contributor

jsitu777 commented Apr 6, 2023

described the pod when finishes. Seems like the pipeline didn't specify a limits field, is it the reason that I should also not specify the limits.cpu and limits.memory in my profile?

Name:         train-until-good-pipeline-pq8vx-2571807482
Namespace:    kubeflow-user-example-com
Priority:     0
Node:         ip-192-168-74-129.ca-central-1.compute.internal/192.168.74.129
Start Time:   Thu, 06 Apr 2023 13:29:50 -0700
Labels:       pipeline/runid=3b5629c1-a45f-4202-adaa-3bd639c08290
              pipelines.kubeflow.org/cache_enabled=true
              pipelines.kubeflow.org/cache_id=337
              pipelines.kubeflow.org/enable_caching=true
              pipelines.kubeflow.org/kfp_sdk_version=1.8.10
              pipelines.kubeflow.org/metadata_execution_id=1
              pipelines.kubeflow.org/metadata_written=true
              pipelines.kubeflow.org/pipeline-sdk-type=kfp
              pipelines.kubeflow.org/reused_from_cache=true
              workflows.argoproj.io/completed=true
              workflows.argoproj.io/workflow=train-until-good-pipeline-pq8vx
Annotations:  author: Alexey Volkov <[email protected]>
              kubectl.kubernetes.io/default-container: main
              pipelines.kubeflow.org/arguments.parameters:
                {"Format": "csv", "Limit": "10000", "Select": "tips,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tolls,extras...
              pipelines.kubeflow.org/component_ref:
                {"digest": "ecf2f2840c57bd9cb2778c8f529da9b938b81f59294b3f7271cb23b363640343", "url": "https://raw.githubusercontent.com/kubeflow/pipeline...
              pipelines.kubeflow.org/component_spec:
                {"description": "City of Chicago Taxi Trips dataset: https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew\n\nThe input param...
              pipelines.kubeflow.org/execution_cache_key: 0eab51df0100658542f19a2ccbc511a16d583732bb9c4f13df1493b19bb2c417
              sidecar.istio.io/inject: false
              workflows.argoproj.io/node-id: train-until-good-pipeline-pq8vx-2571807482
              workflows.argoproj.io/node-name: train-until-good-pipeline-pq8vx.chicago-taxi-trips-dataset
              workflows.argoproj.io/outputs:
                {"artifacts":[{"name":"chicago-taxi-trips-dataset-Table","path":"/tmp/outputs/Table/data","s3":{"key":"artifacts/train-until-good-pipeline...
Status:       Succeeded
IP:           192.168.71.171
IPs:
  IP:           192.168.71.171
Controlled By:  Workflow/train-until-good-pipeline-pq8vx
Containers:
  main:
    Container ID:  containerd://e7574c75123e288d3d75c431f15cce9e493fa54c8d3eb23fab430f5d3de792ae
    Image:         gcr.io/google-containers/busybox
    Image ID:      sha256:36a4dca0fe6fb2a5133dc11a6c8907a97aea122613fa3e98be033959a0821a1f
    Port:          <none>
    Host Port:     <none>
    Command:
      echo
      "This step output is taken from cache."
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 06 Apr 2023 13:29:52 -0700
      Finished:     Thu, 06 Apr 2023 13:29:52 -0700
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     16Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bfb7s (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  var-run-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mlpipeline-minio-artifact:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mlpipeline-minio-artifact
    Optional:    false
  kube-api-access-bfb7s:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m36s  default-scheduler  Successfully assigned kubeflow-user-example-com/train-until-good-pipeline-pq8vx-2571807482 to ip-192-168-74-129.ca-central-1.compute.internal
  Normal  Pulling    3m35s  kubelet            Pulling image "gcr.io/google-containers/busybox"
  Normal  Pulled     3m35s  kubelet            Successfully pulled image "gcr.io/google-containers/busybox" in 405.598782ms (405.615571ms including waiting)
  Normal  Created    3m35s  kubelet            Created container main
  Normal  Started    3m34s  kubelet            Started container main

@gkcalat
Copy link
Member Author

gkcalat commented Apr 6, 2023

@jsitu777 thank you for reporting this.

Profile controls user workloads to be within the quota. So, if you have a hard quota on CPU and memory it will reject all workloads that do no satisfy it. Unfortunately, it seems that Containers without resources will also be rejected.

Adding resources to the Argo's workflow-controller-configmap results in all containers being overwritten and having the same resources, which is probably a wrong way to fix the issue described above.

I feel we may have better solutions:

  1. Allowing default values in Profile, which will be used to fill resources for containers that do not have them. This needs to be done in Profile component. /cc @kimwnasptd
  2. Advising users to create a LimitRange whenever a Profile gets created. We need to update the documentation.

Example LimitRange:

apiVersion: v1
kind: LimitRange
metadata:
  name: <user>-limit-range
spec:
  limits:
  - default:
      cpu: 0.5
      memory: 512Mi
    defaultRequest:
      cpu: 0.01
      memory: 32Mi
    type: Container

Which can be applied a user namespace:

kubectl apply -f <yaml> -n <user>

/cc @chensun

@google-oss-prow google-oss-prow bot requested a review from chensun April 6, 2023 21:47
@jsitu777
Copy link
Contributor

jsitu777 commented Apr 7, 2023

@gkcalat
so the bug was never resolved basically? or does pipelines have a way to specify those constraints when creating pipelines or runs?

@gkcalat
Copy link
Member Author

gkcalat commented Apr 7, 2023

@gkcalat so the bug was never resolved basically? or does pipelines have a way to specify those constraints when creating pipelines or runs?

kfp SDK has methods to specify resources (requests are not available yet in kfp v2). The full Kubeflow 1.7.0 (the latest release) use KFP v2.0.0-alpha.7, which has the workflow-controller-configmap with the default values for resources. So by default, whatever user specifies via SDK's method will be overwritten. To fix if, one needs to change the config map.

As we approach KFP v2 GA, we will update user documentation (including instructions on creating Profile). Issue similar to what we have in KFP was reported for TensorBoard: kubeflow/kubeflow#7077. I created a feature request: kubeflow/kubeflow#7086

@jsitu777
Copy link
Contributor

@gkcalat I tried to use the kfp v1 sdk to specify resources and keep the resource quota in the profile, but pipeline still can not be run.

apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: kubeflow-user-example-com
spec:
  owner:
    kind: User
    name: [email protected]

  resourceQuotaSpec:    # resource quota can be set optionally
   hard:
    limits.cpu: "0.5"
    limits.memory: "512Mi"
    requests.cpu: "0.01"
    requests.memory: "32Mi"

same error message:
This step is in Error state with this message: task 'xgboost-p299q.chicago-taxi-trips-dataset' errored: pods "xgboost-p299q-3112739778" is forbidden: failed quota: kf-resource-quota: must specify limits.cpu for: init,main,wait; limits.memory for: init,main,wait; requests.cpu for: init,main,wait; requests.memory for: init,main,wait

I use the ContainerOp api (https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html) to set specify the resources such as in https://github.com/kubeflow/pipelines/blob/master/samples/core/XGBoost/xgboost_sample.py#L40 to ).set_memory_limit('1Gi').set_memory_request('32Mi').set_cpu_request('0.01').set_cpu_limit('0.5').outputs['model']

then dsl compile and upload the pipeline to run

@gkcalat
Copy link
Member Author

gkcalat commented Apr 19, 2023

@jsitu777

Take a look at the option 2 in #9089 (comment).

The hard quota in Profile requires all containers to have resources. Adding a LimitRange should resolve the issue with containers without resources.

Also, note that your containers must comply to whatever you set in Profile. In your example set_memory_limit('1Gi') is above the limits.memory: "512Mi". I don't know if that will result in the pod being deleted, but keep that in mind if you encounter other issues.

@jsitu777
Copy link
Contributor

@jsitu777

Take a look at the option 2 in #9089 (comment).

The hard quota in Profile requires all containers to have resources. Adding a LimitRange should resolve the issue with containers without resources.

Also, note that your containers must comply to whatever you set in Profile. In your example set_memory_limit('1Gi') is above the limits.cpu: "0.5". I don't know if that will result in the pod being deleted, but keep that in mind if you encounter other issues.

Why are you comparing memory and cpu? I thought they are two different fields

@gkcalat
Copy link
Member Author

gkcalat commented Apr 19, 2023

@jsitu777
Take a look at the option 2 in #9089 (comment).
The hard quota in Profile requires all containers to have resources. Adding a LimitRange should resolve the issue with containers without resources.
Also, note that your containers must comply to whatever you set in Profile. In your example set_memory_limit('1Gi') is above the limits.cpu: "0.5". I don't know if that will result in the pod being deleted, but keep that in mind if you encounter other issues.

Why are you comparing memory and cpu? I thought they are two different fields

It was a typo. Corrected.

@jsitu777
Copy link
Contributor

So I apply the limit range and it's still not resolved:

apiVersion: v1
kind: LimitRange
metadata:
  name: kubeflow-limit-range
  namespace: kubeflow
spec:
  limits:
  - default:
      cpu: 0.5
      memory: 512Mi
    defaultRequest:
      cpu: 0.01
      memory: 256Mi
    type: Container

I have also set my profile quota to be much higher:

apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: $(profile-name)
spec:
  owner:
    kind: User
    name: $(user)
  resourceQuotaSpec:    # resource quota can be set optionally
    hard:
      limits.cpu: "10"
      limits.memory: "2Gi"
      requests.cpu: "0.1"
      requests.memory: "32Mi"

@jsitu777
Copy link
Contributor

never mind. I was setting the limit range in the wrong namespace.
It can be run if it's in the profile namespace

apiVersion: v1
kind: LimitRange
metadata:
  name: kubeflow-limit-range
  namespace: kubeflow-user-example-com
spec:
  limits:
  - default:
      cpu: 0.5
      memory: 512Mi
    defaultRequest:
      cpu: 0.01
      memory: 256Mi
    type: Container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants