Coming soon detailed article....
TODO architecture overview of the deployment pipeline
├── artifact
│ ├── <artifact-name>
│ │ ├── <artifact-name>-<artifact-version>.jar
│ │ └── <artifact-name>-<artifact-version>.json
├── pipeline
├── plugin
├── profile
├── cloudbuild.yaml
├── datafusion.py
├── trigger.py
├── deploy.py
├── df_parameter.yaml
├── README.md
├── requirements.txt
├── trigger.yaml
artifact
folder containing driver and plugin (.jar
and.json
file) deployed on data fusion, as today multiple artifact versions are supported only with multiple deployments, MUST exist only one version for each artifact stored in this folderpipeline
folder containing export of pipeline (.json
file) deployed on data fusionprofile
folder containing export of compute profile (.json
file) deployed on data fusion, for better management of secrets this folder can be retrieved from GCS based on the namespacecloudbuild.yaml
google cloud build deployment configuration filedatafusion.py
python utility class fo data fusion communicationtrigger.py
python script for automated creation of trigger.yaml script (contains list of pipelines in order of trigger)deploy.py
python script for data fusion pipeline and trigger deploymentdf_parameter.yaml
yaml file containing data fusion secrets, for better management of secrets this folder can be retrieved from GCS based on the namespaceREADME.md
it's merequirements.txt
python dependencies fordeploy.py
anddatafusion.py
trigger.yaml
configuration file for schedules/triggers deployment
As today deployment script manage:
- namespace
- compute profile
- artifact
- parameter
- secure parameter
- pipeline
- trigger
If a resource is not available the resource will be automatically created by the deployment script.
Deployment is managed by a python script, which entrypoint is deploy.py
, that can be executed locally or by cloud build
base on cloudbuild.yaml
configuration. Following the steps performed by deployment script
- check if namespace exist otherwise create it
- check if user (not system) compute profile in folder
profile
exist otherwise create it - check if artifact in
artifact
folder exist otherwise create create it (new version or new version and artifact) - updated parameter and secure parameter based on
df_paramenter.yaml
- Get pipeline deployed on data fusion
- Get pipeline pipeline stored in the repository in
pipeline
folder - Diff of pipeline deployed on data fusion and stored in the repository to identify action on pipeline Create, Update Delete
- [OPTIONAL] update pipeline versions
- Execution of Create, Update, Delete actions for each pipeline based on the output of previous step
- For each pipeline Delete all trigger except for default one
- Base on configuration (
trigger.yaml
file) Create trigger
Artifact folder is compose by a list of folder each of them containing jar and json file for source code and configuration
├── artifact
│ ├── <artifact-name>
│ │ ├── <artifact-name>-<artifact-version>.jar
│ │ └── <artifact-name>-<artifact-version>.json
│ ├── <artifact-name>
...
json
file has the following keys, for details refer to the official documentaion
field | description |
---|---|
properties | dict of artifact definition |
parents | list of parents for the artifact |
plugins | dict of plugin components |
trigger.yaml
file contains definition of schedule/trigger for each pipeline, It's a YAML with a list of object each of them
containing a trigger definition according to the following structure
sourcePipelineName: name of the pipeline triggering the pipeline (this is not the pipeline for which we are deploying the trigger but the one that trigger it)
targetPipelineName: name of the pipeline triggered by the trigger (this is the pipeline for which we are deploying the trigger not the one triggerinf it)
pipelineStatus: list of status triggering the event [COMPLETED, FAILED, KILLED]
macrosMapping: OPTIONAL list of source/target macros mapping
For example following the definition of a trigger on target_pipeline that will be triggered by the finish of the execution of pipeline source_pipeline with COMPLETED status passing my_parameter macro.
- sourcePipelineName: source_pipeline
targetPipelineName: target_pipeline
pipelineStatus:
- COMPLETED
macrosMapping:
- source: my_parameter
target: my_parameter
var | description |
---|---|
DF_ENDPOINT | data fusion endpoint https://<df instance name>-<gcp project>-dot-<gcp region short notation (ex. euw3)>.datafusion.googleusercontent.com/api |
NAMESPACE | namespace where pipeline and trigger should be deployed |
PIPELINE_FOLDER | folder containing pipeline .json export, default pipeline |
UPGRADE_PIPELINE | whether to update or not pipeline version |
OVERWRITE_PIPELINE | whether to reflect or not pipeline upgrade on local json file |
DF_VERSION | data fusion version (may be retrieved directly from data fusion) |
FORCE_DELETE | flag that state if on pipeline update should be deleted and recreated instead of only updated, this is useful to force system preference update on pipeline parameter |
LOG_LEVEL | script log level allowed valued are python log level |
For local deployment gcloud CLI should be set up
# login with provided GCP credential
gcloud auth login
# set default project
gcloud config set project <PROJECT_ID>
# generate simulated service account for python SDK authentication
gcloud auth application-default login
# execute script
python deploy.py
Cloud build helps automating process of release creating start pipeline CLoud Function and synchronizing data fusion pipelines and trigger.
Env Var | Description |
---|---|
_DF_ENDPOINT | data fusion endpoint https://<df instance name>-<gcp project>-dot-<gcp region short notation (ex. euw3)>.datafusion.googleusercontent.com/api |
_DF_VERSION | data fusion version |
_NAMESPACE | namespace where pipeline and trigger should be deployed [dev, test, prod] |
_PIPELINE_FOLDER | folder containing pipeline .json export, default pipeline |
_SECRET_BUCKET | bucket managing secrets, compute profile and data fusion parameters |
_UPGRADE_PIPELINE | flag that state if upgrade pipeline version or not |
_FORCE_DELETE | flag that state if on pipeline update should be deleted and recreated instead of only updated, this is useful to force system preference update on pipeline parameter |
For a better management of secrets profile
folder and df_paramenter.yaml
file will be stored in GCS in the relative namespace
folder. For example GCS bucket <SECRET_BUCKET> will have the following strucuture
├── namespace1
│ ├── profile
│ │ ├── profile1.json
│ │ └── profile2.json
│ └── df_parameter.yaml
├── namespace2
...
For local test this file should be downloaded or created manually. If the number of artifact increase needs to be evaluated to download the artifact from GCS as well or to build locally from source maybe creating two release pipelines one for artifact and one for pipelines
-
pipeline export json <name of the pipeline>-cdap-data-pipeline.json
name of the pipeline should be with snake_case convention
-
trigger/schedule <target pipeline name>.<namespace>.<source pipeline name>.<namespace>
where <source pipeline name> is the pipeline that fire the event to start the <target pipeline name>
-
artifact <artifact name>-
name of the artifact should be with kebab-case convention
- CDAP documentation
- CDAP medium documentation
- CDAP REST API documentation
- Data Fusion documentation
- CI/CD CDAP pt.1
- CI/CD CDAP pt.2
- CI/CD CDAP pt.3
- CI/CD CDAP pt.4 TBD
- REST API CDAP pipeline deploy
- Python GCP Bearer token
- Data Fusion CDAP API documentation
- CDAP secure parameter
- Resuse dataproc cluster
- Terraform support for Data Fusion