Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to disable cache on a specific pipeline (created through component yml) #4857

Closed
carlosbertoncelli opened this issue Dec 2, 2020 · 15 comments
Assignees

Comments

@carlosbertoncelli
Copy link

carlosbertoncelli commented Dec 2, 2020

What did you expect to happen:

Is there a way to disable cache on a specific pipeline (created through component yml) using Kubeflow Pipelines on GCP?

I have a pipeline that must run once in a week, because of the cache behavior some nodes are not being executed again, the inputs / parameters are always the same, although, internally it does a Select inside Big Query (which will get the updated data to preprocess). If i can disable this behavior it would work as expected.

PS: I've tried this steps: https://www.kubeflow.org/docs/pipelines/caching/ but they didnt worked out with GCP Pipelines.

Any ideas?

Environment:

Google Cloud Platform

How did you deploy Kubeflow Pipelines (KFP)?
Through Google Cloud Platform ( AI Platform -> Pipelines)

/kind question

@rui5i
Copy link
Contributor

rui5i commented Dec 2, 2020

Have you tried set max_cache_staleness to 0 on certain step? https://www.kubeflow.org/docs/pipelines/caching/#managing-caching-staleness

@carlosbertoncelli
Copy link
Author

carlosbertoncelli commented Dec 2, 2020

Have you tried set max_cache_staleness to 0 on certain step? https://www.kubeflow.org/docs/pipelines/caching/#managing-caching-staleness

How can i specify it on a yml pipeline? i didnt found an example.

My yaml file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: project-pipeline
  annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.1.1, pipelines.kubeflow.org/pipeline_compilation_time: '2020-12-01T17:23:37.893312',
    pipelines.kubeflow.org/pipeline_spec: '{"description": "Kubeflow pipeline for
      stock availability project", "name": "Project pipeline"}'}
  labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.1.1}
spec:
  entrypoint: kedro-pipeline
  templates:
  - name: computing-loss-corrected-montly-mape
    container:
      args: [run, --node, computing_loss_corrected_montly_mape]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: computing-mape
    container:
      args: [run, --node, computing_mape]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: computing-montly-mape
    container:
      args: [run, --node, computing_montly_mape]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: creating-master
    container:
      args: [run, --node, creating_master]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: dcm-sku-query
    container:
      args: [run, --node, dcm_sku_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: estimating-loss-meta
    container:
      args: [run, --node, estimating_loss_meta]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: estimating-loss-total
    container:
      args: [run, --node, estimating_loss_total]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: estimating-ts
    container:
      args: [run, --node, estimating_ts]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: generating-daily-measurements
    container:
      args: [run, --node, generating_daily_measurements]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: kedro-pipeline
    dag:
      tasks:
      - name: computing-loss-corrected-montly-mape
        template: computing-loss-corrected-montly-mape
        dependencies: [posprocessing-measurements]
      - name: computing-mape
        template: computing-mape
        dependencies: [posprocessing-measurements]
      - name: computing-montly-mape
        template: computing-montly-mape
        dependencies: [posprocessing-measurements]
      - name: creating-master
        template: creating-master
        dependencies: [resampling-orders, resampling-stock]
      - {name: dcm-sku-query, template: dcm-sku-query}
      - name: estimating-loss-meta
        template: estimating-loss-meta
        dependencies: [dcm-sku-query, estimating-loss-total, preprocessing-orders,
          sku-filter-query]
      - name: estimating-loss-total
        template: estimating-loss-total
        dependencies: [posprocessing-measurements]
      - name: estimating-ts
        template: estimating-ts
        dependencies: [creating-master]
      - name: generating-daily-measurements
        template: generating-daily-measurements
        dependencies: [posprocessing-measurements]
      - {name: orders-query, template: orders-query}
      - name: posprocessing-measurements
        template: posprocessing-measurements
        dependencies: [estimating-ts, sku-target-query]
      - name: preprocessing-orders
        template: preprocessing-orders
        dependencies: [orders-query]
      - name: preprocessing-stock
        template: preprocessing-stock
        dependencies: [stock-query]
      - name: resampling-orders
        template: resampling-orders
        dependencies: [preprocessing-orders]
      - name: resampling-stock
        template: resampling-stock
        dependencies: [preprocessing-stock]
      - {name: sku-filter-query, template: sku-filter-query}
      - {name: sku-target-query, template: sku-target-query}
      - {name: stock-query, template: stock-query}
  - name: orders-query
    container:
      args: [run, --node, orders_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: posprocessing-measurements
    container:
      args: [run, --node, posprocessing_measurements]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: preprocessing-orders
    container:
      args: [run, --node, preprocessing_orders]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: preprocessing-stock
    container:
      args: [run, --node, preprocessing_stock]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: resampling-orders
    container:
      args: [run, --node, resampling_orders]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: resampling-stock
    container:
      args: [run, --node, resampling_stock]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: sku-filter-query
    container:
      args: [run, --node, sku_filter_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: sku-target-query
    container:
      args: [run, --node, sku_target_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: stock-query
    container:
      args: [run, --node, stock_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  arguments:
    parameters: []
  serviceAccountName: pipeline-runner

Thanks in advance,

@rui5i
Copy link
Contributor

rui5i commented Dec 3, 2020

Can you try to add "pipelines.kubeflow.org/cache_enabled:false" to your pipeline yaml's labels and see if this works?

@carlosbertoncelli
Copy link
Author

Can you try to add "pipelines.kubeflow.org/cache_enabled:false" to your pipeline yaml's labels and see if this works?

I did as you suggested, and the pipeline was loaded as expected, but when i try to run it throws the following error:

Run creation failed
{"error":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec","message":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec","code":3,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Please provide a valid pipeline spec","error_details":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec"}]}

@rui5i
Copy link
Contributor

rui5i commented Dec 3, 2020

Can you also try to set "pipelines.kubeflow.org/max_cache_staleness: 'P0D'" on the yaml annotations and remove the labels?

@carlosbertoncelli
Copy link
Author

Can you also try to set "pipelines.kubeflow.org/max_cache_staleness: 'P0D'" on the yaml annotations and remove the labels?

The nodes keep using the cached executions, even after the modifications

@rui5i
Copy link
Contributor

rui5i commented Dec 4, 2020

Hi Alexey, can you help take a look at this issue?

/assign @Ark-kun

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 8, 2020

@cabjr Hello.
Let me help you with this issue.
The supported way to disable caching for a certain step is described in the documentation:

def some_pipeline():
      # task is a target step in a pipeline
      task_never_use_cache = some_op()
      task_never_use_cache.execution_options.caching_strategy.max_cache_staleness = "P0D"

Please try this and tell us whether this helps.

The format of produced workflow files is an implementation detail and is subject to change.
KFP supports pipelines produced by the KFP SDK. As you see by the errors ("Invalid input error: Please provide a valid pipeline spec"), manually editing the compiled workflow files can lead to incorrect Kubernetes object format.
If you're interested, you can observe the changes in the compiled pipeline YAML file when you set .execution_options.caching_strategy.max_cache_staleness = "P0D". It results in adding "pipelines.kubeflow.org/max_cache_staleness: 'P0D' to the metadata annotations section of the corresponding workflow template.

  - name: some-name
    metadata:
      annotations:
        "pipelines.kubeflow.org/max_cache_staleness": P0D
    container: ...

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 8, 2020

P.S. I've noticed that your pipeline does not use any data passing. I se no components, not inputs and outputs and no argument passing. System-managed data passing is one of the most important features of KFP and is important for getting value. The caching system relies on the data passing information to decide when to reuse an execution (the cached value are reused when all input arguments are the same and the component is the same).

Perhaps you can create KFP components with inputs and outputs for your pipeline steps and create a pipeline where they pass data explicitly. Then the caching will start working better for you without needing tweaks.

Please check the following tutorial: https://github.com/Ark-kun/kfp_samples/blob/ae1a5b6/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb

@carlosbertoncelli
Copy link
Author

pipelines.kubeflow.org/max_cache_staleness": P0D

Hi @Ark-kun , I've tried adding the "pipelines.kubeflow.org/max_cache_staleness": P0D specification inside annotations, but it doesnt seem to work.

I cannot specify the "task_never_use_cache.execution_options.caching_strategy.max_cache_staleness = "P0D" because my pipelines are generated (yaml) from a kedro pipeline (ml framework for experimentation), thats the same reason as why i cannot specify/use the data inputs/outputs from the kubeflow pipelines itself, internally my image already uses a data catalog that points out to GCS and BigQuery. Thats why i'm trying to disable the cache behavior.

The main idea about my project is to allow the DS team to prototype with kedro, then deploy it in kubeflow pipelines with the minimal (none if possible) modification. As it is suposed to be running in recurring runs (jobs), the cache behavior is a problem for us.

@Bobgy
Copy link
Contributor

Bobgy commented Dec 17, 2020

@cabjr you might want to follow instructions in https://www.kubeflow.org/docs/pipelines/caching/#disabling-caching-in-your-kubeflow-pipelines-deployment to disable caching for your KFP instance, so that all pipelines are not cached.

@Bobgy
Copy link
Contributor

Bobgy commented Dec 17, 2020

but reminder that running arbitrary argo workflows with KFP may not keep working in the future, KFP has its own sdk for building workflows.

@carlosbertoncelli
Copy link
Author

I've done as Bobgy suggested, disabled cache for the entire KFP Instance, might not be the ideal solution but it works as expected,

Thanks for the help

@augustovictor
Copy link

disabling cache for pipelines v2

def some_pipeline():
      # task is a target step in a pipeline
      task_never_use_cache = some_op()
      task_never_use_cache.enable_caching = False

@tjhorner
Copy link

tjhorner commented Jul 9, 2024

For those running into this and going insane because even though you've disabled caching the cache executions are still appearing... there's a bug in the SDK. See my comment here: #10966 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants