Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: cdk toolchain #2093

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions design/toolchain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# The AWS CDK Toolchain

The AWS CDK toolchain is the chain of tasks used to take a CDK app and deploy it into the AWS cloud. This document describes the various components in the toolchain and how they interact together.

## Current State

CDK apps can be deployed today using the CDK CLI (or CDK Toolkit). The toolkit is implemented as a monolithic program with no clear boundaries between the various stages. We would like to break up the monolithic process executed by the toolkit in order to synthesize, package and deploy a CDK app.

There a few reasons why we want to do this:

1. __Security__: isolate the "deploy" step such that no user code needs to run. This is very important from a security perspective because deployment commonly require administrator privileges on the AWS account, and we need to reduce the attack surface during that time (see #2073, which currently has to run both build and deploy together in the same CodeBuild task).
2. __Modularity__: Allow tools to utilize the various steps used to deploy a CDK app inside other tools such as IDEs, deployment tools, etc.
3. __Code Quality__: the CLI's code base needs a fresh rewrite, along with complete unit test coverage and this is an opportunity to do that.

## Requirements

* It should be possible to execute each component in the toolchain individually by feeding it the output from the previous step.
* Given a specific input, the output from each step must be completely reproducible (no side effects).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the initial step (what currently is synth), this assumes the CDK App is fully deterministic. This is not always true (and I don't think it's an issue, really).

* Different components may require different execution environments and/or permissions to run. For example `cdk-synth` may need to be able to query the target AWS account in order to resolve environmental context, `cdk-bundle` may need to build docker images, `cdk-deploy` will need admin permissions in order to deploy the app.
eladb marked this conversation as resolved.
Show resolved Hide resolved
* It should be possible to invoke each component as a jsii library from all supported languages.

## Design

Let's start by formally defining the states and steps in a CDK application's lifecycle:

```
+-----+ +---------+ +----------+ +------------+
| App |-- synthesis -->| Staging |-- bundle -->| Assembly |-- deploy -->| Deployment |
+-----+ +---------+ +----------+ +------------+
```

### cdk-synth

We start from a ready-to-run (__compiled__) CDK app.

Some languages require compilation (such as Java, Go, TypeScript) and in others (JavaScript, Ruby, Python), there is no need to compile. Compiling code is __not in scope__ for the CDK toolchain in order to allow developers to use their favorite IDEs and build toolchain idiomatically.

The first component in the toolchain is called `cdk-synth`. It's purpose is to execute the CDK app so it can synthesize the CDK tree (through the internal app phases which are "init", "prepare", "validate", "synthesize"). Eventually the CDK tree can synthesize both final artifacts such as CloudFormation templates and intermediate artifacts such as unpackaged assets.

The protocol between the CDK app `cdk-synth`:

- Context is passed into the app via a JSON map serialized into an environment variable called `CONTEXT_ENV`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't help feel those are clumsy. Also, all environment variables used by the CDK should have names starting with CDK_ in my opinion.

- The `CDK_OUTDIR` points to a directory into which artifacts should be emitted by the app.

The app will emit into `CDK_OUTDIR` a directory structure as follows:

```
.
├── missing.json
├── bundle.json
├── assembly
│ ├── manifest.json
│ └── stack1.template.json
└── staging
└── intermediate-asset
└── file1.txt
```

If the file `missing.json` exists in the output, it means that the synthesis output cannot be used and relies on missing context information, which must be populated in the context map. `cdk-synth` is responsible to resolve this context information through the environmental context provider system (for example, query the account for which availability zones are available in a specific region, read SSM parameters, etc). Then, `cdk-synth` must re-invoke the app with this context. Bear in mind that normally context information is cached in `cdk.context.json`, so subsequent executions will not require resolution.
eladb marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can relax this. Missing data should be an attribute of a stack. If context info for Stack A in account 12345 is missing, that shouldn't mean I can't deploy fully synthesized Stack B in account 67890.


`cdk-synth` may repeat this process up to 5 times (theoretically there could be missing context that the app will only know that it needs after it used some information from a resolved context from previous iteration).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is possibly more than 5 times then. Can be any number of times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 5 seems to come out of nowhere. Are you trying to say something like "the toolkit may impose a limit on the amount of reinvocations to guard against programming errors." ?


Apart from reading `missing.json`, `cdk-synth` does not care about the structure of the output directory.

The output directory is expected to be *self-contained* in a sense that it does not rely on any files that may have only been available during the CDK app execution (e.g. files that are embedded as resources in library dependencies and cannot be accessed from outside the runtime (see [#1716](https://github.com/awslabs/aws-cdk/issues/1716)). It should be possible to pick up the output directory produced by "cdk-synth" and execute "cdk-bundle" (the next step) on a separate machine, possible with a different environment.
eladb marked this conversation as resolved.
Show resolved Hide resolved

### cdk-bundle

The next component in the toolchain is called `cdk-pack`. The purpose of this step is to take the synthesis output and produce a ready-to-deploy, self-contained [cloud assembly](https://github.com/awslabs/aws-cdk/blob/master/design/cloud-assembly.md),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cdk-bundle or cdk-pack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundle. fixed.


The input to `cdk-bundle` is the output directory produced by `cdk-synth`.

It reads the `bundle.json` file from the root of the output directory and executes the instructions described there. The bundle file describes
a set of bundling steps and dependencies between them. The file is produced by the CDK app with the goal to turn the intermediate CDK output directory into a final cloud assembly under the `assembly/` directory.

The following `bundle.json` file will archive the files from `staging/dir1` into `assembly/aaabbb.zip` and then duplicate `assembly/aaabbb.zip` to `assembly/dup.zip`:

```json
{
"steps": {
"zip_dir1": {
"type": "zip",
"parameters": {
"src": "staging/dir1",
"dest": "assembly/aaabbb.zip"
}
},
"dup_zip_result": {
"type": "copy",
"parameters": {
"src": "assembly/aaabbb.zip",
"dest": "assembly/dup.zip"
},
"depends": [ "zip_dir1" ]
}
}
}
```

The bundler will support a growing set of step types. Initially:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook for pluggability?


- **zip**: archive a directory
- **copy**: copy files
- **docker**: build a docker image
- **lambda**: produce an AWS Lambda runtime bundle for a specific language

### cdk-deploy

The last component in the toolchain is to deploy the CDK app into the AWS Cloud.

The input to this stage is a self-contained cloud assembly as produced from `cdk-bundle`. See the [cloud assembly specification](https://github.com/awslabs/aws-cdk/blob/master/design/cloud-assembly.md) for details on the structure and capabilities of the cloud assembly.

At a high level, the assembly defines a set of artifacts with dependencies and `cdk-deploy` is responsible to deploy artifacts in topological order. Artifacts could include CloudFormation stacks (deployed through CloudFormation), files (uploaded to S3), Docker images (pushed to ECR), etc.

`cdk-deploy` should also accept an optional list of artifacts to deploy, in which case it will only deploy those artifacts (and optionally their dependencies).