A pipeline comprises one or more nodes that are (in many cases) connected to define execution dependencies. Each node is implemented by a component and typically performs only a single task, such as loading data, processing data, training a model, or sending an email.
A generic pipeline comprises nodes that are implemented using generic components. In the current release Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow.
The Introduction to generic pipelines tutorial outlines how to create a generic pipeline using the Visual Pipeline Editor.
In this intermediate tutorial you will learn how to run a generic pipeline on Kubeflow Pipelines, monitor pipeline execution using the Kubeflow Central Dashboard, and access the outputs.
The tutorial instructions were last updated using Elyra v3.0 and Kubeflow v1.3.0.
- JupyterLab 3.x with the Elyra extension v3.x (or newer) installed.
- Access to a local or cloud Kubeflow Pipelines deployment.
Collect the following information for your Kubeflow Pipelines installation:
- API endpoint, e.g.
http://kubernetes-service.ibm.com/pipeline
- Namespace, for a multi-user, auth-enabled Kubeflow installation, e.g.
mynamespace
- Username, for a multi-user, auth-enabled Kubeflow installation, e.g.
jdoe
- Password, for a multi-user, auth-enabled Kubeflow installation, e.g.
passw0rd
- Workflow engine type, which should be
Argo
orTekton
. Contact your administrator if you are unsure which engine your deployment utilizes.
Elyra utilizes S3-compatible cloud storage to make data available to Jupyter notebooks and R or Python scripts while they are executed. Any kind of cloud storage should work (e.g. IBM Cloud Object Storage or Minio) as long as it can be accessed from the machine where JupyterLab is running and from the Kubeflow Pipelines cluster.
Elyra also puts the STDOUT (including STDERR) run output into a file when env var ELYRA_GENERIC_NODES_ENABLE_SCRIPT_OUTPUT_TO_S3
is set to true
or not present in the runtime container, which is the default.
This happens in addition to logging and writing to STDOUT and STDERR at runtime.
ipynb
file execution run/STDOUT output is written to S3-compatible object storage in the following files:
<notebook name>-output.ipynb
<notebook name>.html
.r and .py file execution run/STDOUT output is written to to S3-compatible object storage in the following files:
<r or python filename>.log
Note: If you prefer to use S3-compatible storage for transfer of files between pipeline steps only and not for logging information / run output of R, Python and Jupyter Notebook files,
either set env var ELYRA_GENERIC_NODES_ENABLE_SCRIPT_OUTPUT_TO_S3
to false
in runtime container builds or pass that env value explicitely in the env section of the pipeline editor,
either at Pipeline Properties - Generic Node Defaults - Environment Variables or at
Node Properties - Additional Properties - Environment Variables.
Collect the following information:
- S3 compatible object storage endpoint, e.g.
http://minio-service.kubernetes:9000
- S3 object storage username, e.g.
minio
- S3 object storage password, e.g.
minio123
- S3 object storage bucket, e.g.
pipelines-artifacts
This tutorial uses the run-generic-pipelines-on-kubeflow-pipelines
sample from the https://github.com/elyra-ai/examples GitHub repository.
-
Launch JupyterLab.
-
Open the Git clone wizard (Git > Clone A Repository).
-
Enter
https://github.com/elyra-ai/examples.git
as Clone URI. -
In the File Browser navigate to
examples/pipelines/run-generic-pipelines-on-kubeflow-pipelines
.The cloned repository includes a set of Jupyter notebooks and a Python script that download a weather data set from an open data directory called the Data Asset Exchange, cleanse the data, analyze the data, and perform time-series predictions. The repository also includes a pipeline named
hello-generic-world
that runs the files in the appropriate order.
You are ready to start the tutorial.
-
Open the
hello-generic-world
pipeline file. -
Right click generic node
Load weather data
and select Open Properties to review its configuration.A generic node configuration identifies the runtime environment, input artifacts (file to be executed, file dependencies and environment variables), and output files.
Each generic node is executed in a separate container, which is instantiated using the configured runtime image.
All nodes in this tutorial pipeline are configured to utilize a pre-configured public container image that has Python and the
Pandas
package preinstalled. For your own pipelines you should always utilize custom-built container images that have the appropriate prerequisites installed. Refer to the runtime image configuration topic in the User Guide for more information.If the container requires a specific minimum amount of resources during execution, you can specify them. For example, to speed up model training, you might want to make GPUs available.
If no custom resource requirements are defined, the defaults in the Kubeflow Pipeline environment are used.
Containers in which the notebooks or scripts are executed don't share a file system. Elyra utilizes S3-compatible cloud storage to facilitate the transfer of files from the JupyterLab environment to the containers and between containers.
Therefore you must declare files that the notebook or script requires and declare files that are being produced. The node you are inspecting does not have any file input dependecies but it does produce an output file.
Notebooks and scripts can be parameterized using environment variables. The node you are looking at requires a variable that identifies the download location of a data file.
Refer to Best practices for file-based pipeline nodes in the User Guide to learn more about considerations for each configuration setting.
A runtime configuration in Elyra contains connectivity information for a Kubeflow Pipelines instance and S3-compatible cloud storage. In this tutorial you will use the GUI to define the configuration, but you can also use the CLI.
-
From the pipeline editor tool bar (or the JupyterLab sidebar on the left side) choose Runtimes to open the runtime management panel.
-
Click + and New Kubeflow Pipelines runtime to create a new configuration for your Kubeflow Pipelines deployment.
-
Enter a name and a description for the configuration and optionally assign tags to support searching.
-
Enter the connectivity information for your Kubeflows Pipelines deployment:
- Kubeflow Pipelines API endpoint, e.g.
https://kubernetes-service.ibm.com/pipeline
Do not specify the namespace in the API endpoint.
- User namespace used to run pipelines, e.g.
mynamespace
- User credentials if the deployment is multi-user, auth enabled using Dex.
- Workflow engine type, which is either
Argo
(default) orTekton
. Check with an administrator if you are unsure which workflow engine your deployment utilizes.
Refer to the runtime configuration documentation for a description of each input field.
- Kubeflow Pipelines API endpoint, e.g.
-
Enter the connectivity information for your S3-compatible cloud storage:
- The cloud object storage endpoint URL, e.g.
https://minio-service.kubeflow:9000
- Username, e.g.
minio
- Password, e.g.
minio123
- Bucket name, where Elyra will store the pipeline input and output artifacts, e.g.
test-bucket
Refer to this topic for important information about the optional credentials secret.
- The cloud object storage endpoint URL, e.g.
-
Save the runtime configuration.
-
Expand the twistie in front of the configuration entry.
The displayed links provide access to the configured Kubeflow Pipelines Central Dashboard and the cloud storage UI (if one is available at the specified URL). Open the links to confirm connectivity.
If you are accessing the Kubeflow Pipelines Dashboard for the first time an error might be raised (e.g. "
Failed to retrieve list of pipelines
") if namespaces are configured. To resolve this issue, manually open the Kubeflow Pipelines Central Dashboard (e.g.https://kubernetes-service.ibm.com/
instead ofhttps://kubernetes-service.ibm.com/pipeline
) and select a namespace, and then try opening the link again.
You can run pipelines from the Visual Pipeline Editor or using the elyra-pipeline
command line interface.
-
Open the run wizard.
-
The Pipeline Name is pre-populated with the pipeline file name. The specified name is used to name the pipeline and experiment in Kubeflow Pipelines.
-
Select
Kubeflow Pipelines
as Runtime platform. -
From the Runtime configuration drop down select the runtime configuration you just created.
-
Start the pipeline run. The pipeline artifacts (notebooks, Python scripts and file input dependencies) are gathered, packaged, and uploaded to cloud storage. The pipeline is compiled, uploaded to Kubeflow Pipelines, and executed in an experiment.
Elyra automatically creates a Kubeflow Pipelines experiment using the pipeline name. For example, if you named the pipeline
hello-generic-world
, Elyra creates an experiment namedhello-generic-world
.Each time you run a pipeline with the same name, it is uploaded as a new version, allowing for comparison between pipeline runs.
The confirmation message contains two links:
- Run details: provides access to the Kubeflow Pipelines UI where you monitor the pipeline execution progress.
- Object storage: provides access to the cloud storage where you access the input artifacts and output artifacts. (This link might not work if the configured cloud storage does not have a GUI interface or if the URL is different from the endpoint URL you've configured.)
Elyra does not provide a monitoring interface for Kubeflow Pipelines. However, it does provide a link to the Kubeflow Central Dashboard's pipeline runs panel.
-
Open the Run Details link. The runs panel is displayed, depicting the in-progress execution graph for the pipeline. Only nodes that are currently executing or have already executed are displayed. Note that the run name is derived from the pipeline name and a timestamp, e.g.
hello-generic-world-0716111722
. -
Select the first node. A side panel opens, displaying information about the node.
-
Open the Logs tab to access the node's execution log file.
Output that notebooks or scripts produce is captured by Elyra and automatically uploaded to the cloud storage bucket you've specified in the runtime configuration.
If desired, you can visualize results directly in the Kubeflow Pipelines UI. For example, if a notebook trains a classification model, you could visualize its accuracy using a confusion matrix by producing metadata in Kubeflow Pipelines output viewer compatible format.
-
Wait until node processing has completed before continuing.
-
Open the Visualizations tab. For illustrative purposes the first node that runs the
load_data
notebook produces metadata in markdown format, which identifies the data set download location.The code that produces the metadata is located in the notebook's last code cell.
Refer to Visualizing output from your notebooks or Python scripts in the Kubeflow Pipelines UI to learn more about adding visualizations.
-
Wait for the pipeline run to finish.
Pipelines that execute on Kubeflow Pipelines store the pipeline run outputs (completed notebooks, script output, and declared output files) in the cloud storage bucket you've configured in the runtime configuration.
-
Open the object storage link and, if required, log in.
-
Navigate to the bucket that you've specified in the runtime configuration to review the content. Note that the bucket contains a "directory" with the pipeline's Kubeflow run name, e.g.
hello-generic-world-0716111722
.The bucket contains for each node the following artifacts:
- a
tar.gz
archive containing the original notebook or script and, if applicable, its declared file dependencies - if the node is associated with a notebook, the artifacts include the completed notebook with it's populated output cells and an HTML version of the completed notebook
- if the node is associated with a script, the artifacts include the console output that the script produced
- if applicable, the declared output files
For example, for the
load_data
notebook the following artifacts should be present:load_data-<UUID>.tar.gz
(input artifacts)load_data.ipynb
(output artifact)load_data.html
(output artifact)data/noaa-weather-data-jfk-airport/jfk_weather.csv
(output artifact)
- a
-
Download the output artifacts to your local machine and inspect them.
When you run a pipeline from the pipeline editor, Elyra compiles the pipeline, uploads the compiled pipeline, creates an experiment, and runs the experiment. If you want to run the pipeline at a later time outside of Elyra, you can export it.
-
Open the pipeline in the Pipeline Editor.
-
Click the Export Pipeline button.
-
Select Kubeflow Pipelines as the runtime platform, then select the runtime configuration you've created. Export the pipeline.
An exported pipeline comprises of two parts: the pipeline definition and the input artifact archives that were uploaded to cloud storage.
In order to run the exported pipeline, the generated YAML-formatted static configuration file must be manually uploaded using the Kubeflow Pipelines Dashboard. Elyra compiles the pipeline for the engine (
Argo
orTekton
) that you've defined in the runtime configuration. -
Locate the generated
hello-generic-world.yaml
configuration file in the File Browser. -
Open the exported file and briefly review the content.
Note the references to the artifacts archives and the embedded cloud storage connectivity information. (Not shown in the screen capture above). If you've preconfigured a Kubernetes secret and specified its name in the runtime configuration, only the secret name is stored in the exported files.
This concludes the Run generic pipelines on Kubeflow Pipelines tutorial. You've learned how to
- create a Kubeflow Pipelines runtime configuration
- run a pipeline on Kubeflow Pipelines
- monitor the pipeline run progress in the Kubeflow Pipelines Central Dashboard
- review output visualizations a notebook or script produces
- access the pipeline run output on cloud storage
- export a pipeline to a Kubeflow Pipelines native format
- Pipelines topic in the Elyra User Guide
- Pipeline components topic in the Elyra User Guide
- Best practices for file-based pipeline nodes topic in the Elyra User Guide
- Runtime configuration topic in the Elyra User Guide
- Runtime image configuration topic in the Elyra User Guide
- Command line interface topic in the Elyra User Guide