Skip to content

Latest commit

 

History

History
208 lines (164 loc) · 10.2 KB

README.md

File metadata and controls

208 lines (164 loc) · 10.2 KB

Data Fusion CI/CD

Coming soon detailed article....

TODO architecture overview of the deployment pipeline

Repository structure

├── artifact
│   ├── <artifact-name>
│   │   ├── <artifact-name>-<artifact-version>.jar
│   │   └── <artifact-name>-<artifact-version>.json
├── pipeline
├── plugin
├── profile 
├── cloudbuild.yaml
├── datafusion.py
├── trigger.py
├── deploy.py
├── df_parameter.yaml
├── README.md
├── requirements.txt
├── trigger.yaml
  • artifact folder containing driver and plugin (.jar and .json file) deployed on data fusion, as today multiple artifact versions are supported only with multiple deployments, MUST exist only one version for each artifact stored in this folder
  • pipeline folder containing export of pipeline (.json file) deployed on data fusion
  • profile folder containing export of compute profile (.json file) deployed on data fusion, for better management of secrets this folder can be retrieved from GCS based on the namespace
  • cloudbuild.yaml google cloud build deployment configuration file
  • datafusion.py python utility class fo data fusion communication
  • trigger.py python script for automated creation of trigger.yaml script (contains list of pipelines in order of trigger)
  • deploy.py python script for data fusion pipeline and trigger deployment
  • df_parameter.yaml yaml file containing data fusion secrets, for better management of secrets this folder can be retrieved from GCS based on the namespace
  • README.md it's me
  • requirements.txt python dependencies for deploy.py and datafusion.py
  • trigger.yaml configuration file for schedules/triggers deployment

Deployment

As today deployment script manage:

  • namespace
  • compute profile
  • artifact
  • parameter
  • secure parameter
  • pipeline
  • trigger

If a resource is not available the resource will be automatically created by the deployment script.

Deployment is managed by a python script, which entrypoint is deploy.py, that can be executed locally or by cloud build base on cloudbuild.yaml configuration. Following the steps performed by deployment script

  • check if namespace exist otherwise create it
  • check if user (not system) compute profile in folder profile exist otherwise create it
  • check if artifact in artifact folder exist otherwise create create it (new version or new version and artifact)
  • updated parameter and secure parameter based on df_paramenter.yaml
  • Get pipeline deployed on data fusion
  • Get pipeline pipeline stored in the repository in pipeline folder
  • Diff of pipeline deployed on data fusion and stored in the repository to identify action on pipeline Create, Update Delete
  • [OPTIONAL] update pipeline versions
  • Execution of Create, Update, Delete actions for each pipeline based on the output of previous step
  • For each pipeline Delete all trigger except for default one
  • Base on configuration (trigger.yaml file) Create trigger

Artifact configuration

Artifact folder is compose by a list of folder each of them containing jar and json file for source code and configuration

├── artifact
│   ├── <artifact-name>
│   │   ├── <artifact-name>-<artifact-version>.jar
│   │   └── <artifact-name>-<artifact-version>.json
│   ├── <artifact-name>
...

json file has the following keys, for details refer to the official documentaion

field description
properties dict of artifact definition
parents list of parents for the artifact
plugins dict of plugin components

Trigger configuration

trigger.yaml file contains definition of schedule/trigger for each pipeline, It's a YAML with a list of object each of them containing a trigger definition according to the following structure

sourcePipelineName: name of the pipeline triggering the pipeline (this is not the pipeline for which we are deploying the trigger but the one that trigger it)
targetPipelineName: name of the pipeline triggered by the trigger (this is the pipeline for which we are deploying the trigger not the one triggerinf it)
pipelineStatus: list of status triggering the event [COMPLETED, FAILED, KILLED]
macrosMapping: OPTIONAL list of source/target macros mapping

For example following the definition of a trigger on target_pipeline that will be triggered by the finish of the execution of pipeline source_pipeline with COMPLETED status passing my_parameter macro.

- sourcePipelineName: source_pipeline
  targetPipelineName: target_pipeline
  pipelineStatus:
    - COMPLETED
  macrosMapping:
    - source: my_parameter
      target: my_parameter

Script environment variables

var description
DF_ENDPOINT data fusion endpoint https://<df instance name>-<gcp project>-dot-<gcp region short notation (ex. euw3)>.datafusion.googleusercontent.com/api
NAMESPACE namespace where pipeline and trigger should be deployed
PIPELINE_FOLDER folder containing pipeline .json export, default pipeline
UPGRADE_PIPELINE whether to update or not pipeline version
OVERWRITE_PIPELINE whether to reflect or not pipeline upgrade on local json file
DF_VERSION data fusion version (may be retrieved directly from data fusion)
FORCE_DELETE flag that state if on pipeline update should be deleted and recreated instead of only updated, this is useful to force system preference update on pipeline parameter
LOG_LEVEL script log level allowed valued are python log level

For local deployment gcloud CLI should be set up

# login with provided GCP credential
gcloud auth login

# set default project
gcloud config set project <PROJECT_ID>

# generate simulated service account for python SDK authentication
gcloud auth application-default login

# execute script
python deploy.py

Cloud Build

Cloud build helps automating process of release creating start pipeline CLoud Function and synchronizing data fusion pipelines and trigger.

Env var
Env Var Description
_DF_ENDPOINT data fusion endpoint https://<df instance name>-<gcp project>-dot-<gcp region short notation (ex. euw3)>.datafusion.googleusercontent.com/api
_DF_VERSION data fusion version
_NAMESPACE namespace where pipeline and trigger should be deployed [dev, test, prod]
_PIPELINE_FOLDER folder containing pipeline .json export, default pipeline
_SECRET_BUCKET bucket managing secrets, compute profile and data fusion parameters
_UPGRADE_PIPELINE flag that state if upgrade pipeline version or not
_FORCE_DELETE flag that state if on pipeline update should be deleted and recreated instead of only updated, this is useful to force system preference update on pipeline parameter

Secret management

For a better management of secrets profile folder and df_paramenter.yaml file will be stored in GCS in the relative namespace folder. For example GCS bucket <SECRET_BUCKET> will have the following strucuture

├── namespace1
│   ├── profile
│   │   ├── profile1.json
│   │   └── profile2.json
│   └── df_parameter.yaml
├── namespace2
...

For local test this file should be downloaded or created manually. If the number of artifact increase needs to be evaluated to download the artifact from GCS as well or to build locally from source maybe creating two release pipelines one for artifact and one for pipelines

Naming convention

  • pipeline export json <name of the pipeline>-cdap-data-pipeline.json

    name of the pipeline should be with snake_case convention

  • trigger/schedule <target pipeline name>.<namespace>.<source pipeline name>.<namespace>

    where <source pipeline name> is the pipeline that fire the event to start the <target pipeline name>

  • artifact <artifact name>-

    name of the artifact should be with kebab-case convention

References