Skip to content

Deployment

Joseph Cosentino edited this page Dec 14, 2023 · 7 revisions

Overview

Deployment is either the most or second-most complicated part of Greengrass Nucleus, second only to service lifecycle management. Deployment has essentially one job; wait for a deployment job then execute the deployment job to get into the state described by the job. Deployment jobs are not deltas, rather, they describe the desired state of the device and the device needs to figure out how to get from its current state into the desired state.

Wait for deployments

Deployments can come from 3 sources currently: 1) local from Greengrass CLI (local deployment), 2) AWS cloud via IoT Shadow (individual device deployment), 3) AWS cloud via IoT Jobs (group deployment). Internally, Shadow and Jobs are implemented by separate classes which will both ultimately insert their deployment into the DeploymentQueue.

Shadow listener

Implemented by ShadowDeploymentListener. First, subscribes to shadow topics and retries forever in a while loop with a 2 minute + [0, 10) seconds jitter backoff. Once it subscribes, it requests the current shadow state. When it receives the shadow state (or an updated state comes in) it determines if it should ignore the shadow state. It then cancels the current deployment if necessary and inserts the new deployment to the queue.

Jobs listener

Implemented by IotJobsHelper begins by trying to subscribe to the jobs topics. Each subscription is made using a while loop and will retry on errors every 2 minutes + [0, 10) seconds jitter backoff. Once it has subscribed to the appropriate topics, it will check for any missed jobs. If it receives a notification about a job it will cancel a current deployment if needed or request details about the newly queued job. When it receives the details about the next job to run we check that it isn't a duplicate, possibly cancel the current deployment, and finally queue the new deployment job.

DeploymentService contains the logic to wait for deployment jobs for any source by polling a queue of deployments. The loop first checks if the currently executing deployment has finished which it would then persist and notify any listeners. When it receives a deployment in the queue it will check to see if the deployment is cancelled and take action if needed. If not cancelled, it will create a new DeploymentTask.

A deployment will initially start with the DEFAULT stage, so we will create a new default deployment task. The deployment stage will not be default when Nucleus restarts while executing a deployment, rather, Nucleus will read the deployment information from disk and insert the deployment to the queue with the correct stage (either KERNEL_ACTIVATION or KERNEL_ROLLBACK). This way, Greengrass can continue the deployment after it restarts.

Greengrass will mark the new deployment (or resumed deployment) as in progress, persist the deployment information to disk so that it can be resumed later, check that the current Greengrass version is able to execute the deployment based on required capabilities, copy artifacts and recipes for local deployments, then begins deployment task execution in a separate thread. DeploymentTask is implemented by DefaultDeploymentTask and KernelUpdateDeploymentTask used after restarting Nucleus to complete the deployment or rollback.

Default deployment execution

Deployments begin by resolving what components will be deployed this is done based on the "root" components which are provided directly in the deployment in addition to all the root components deployed to the device by different "groups". We use the term "group" here because this is primarily for combining deployments from multiple IoT Thing Groups (since a device could be a member of multiple groups). But, this is more generically useful because we also treat deployments to the individual device (via Shadow) as a different group, as well as local deployments. Deployments to the same physical device, but via different groups are all required to not have version conflicts, or else the deployment which would cause a conflict would be rejected. To put this in a different way, I can create a deployment to my individual device deployment component A=1. I can also create a deployment to my device, but using a thing group and try to deploy component A=2. But since both deployments target the same physical device, one of those deployments is going to fail due to a version conflict because we cannot have both A=1 and A=2 on the same physical device at the same time. So, while I was able to create those deployments, the deployments will not succeed. To resolve this, the conflict would need to be addressed by either deploying compatible versions (A=2 and A=2) or by removing the component version requirement from one of the groups (eg. create a deployment to the individual device that does not contain component A, thus freeing the device to accept the deployment from the thing group which has A=2).

Dependency resolution

Dependencies will be resolved such that all component version requirements from all groups on the device are satisfied. If this is impossible, then the deployment will fail with a dependency conflict error which will need to be fixed. Dependency resolution is implemented by DependencyResolver. The dependency resolver resolves which version of each component to deploy as well as what version of their dependencies to deploy. It does this one component at a time. Since only one component is considered at a time, this means that resolution may fail even though there exists some combination of versions which could satisfy all requirements. We do this to avoid complex and costly backtracking. Resolution starts by identifying if the currently installed version or version in the local component store satisfies the requirements, then it will using the cloud API ResolveComponentCandidates to resolve the version based on what versions exist. If the cloud API returns a valid response then that version will always be used (even if the local version is "better") (this isn't ideal and should be configurable to allow for better local development). Assuming that some version of the component satisfies the version and platform requirements, then we persist some metadata including the recipe and continue to resolve the rest of the components.

Download config

After resolving components, the deployment downloads the full deployment configuration if it was larger than 7KB (Shadow) or 31KB (Jobs). Next, deployment checks that all prerequisites are satisfied. Currently this is only used if any component specifies a Docker artifact, it will check that Docker is installed on the device. The deployment then moves to the "prepare packages" step which which will download all artifacts for each component version that was resolved.

Prepare packages

While "preparing packages", we check that the artifact will not exceed the minimum disk space (20MB) or the user configured maximum size. It then sets the permissions correctly based on the artifact permission configuration from the recipe, unarchives the artifact if specified, and sets permissions again on the unarchived files.

Resolve configuration

The deployment now enters a pretty critical phase, it will "resolve" the new configuration to be applied. Remember from configuration that configuration is the source of truth for absolutely everything Greengrass is doing on the device, so resolving the correct configuration is absolutely essential. Using KernelConfigResolver we build a completely new configuration map which we will later merge with the configuration that the device is currently using. This is where all the interpolation happens to replace placeholder such as {iot:thingName} or {<component name>:configuration:<path>}. Between resolving the configuration and applying the configuration, it is possible that the "active" configuration would change and those changes may be lost depending on what part of the configuration they're in. Suffice it to say that this is rather tricky to do right.

resolveConfigurationToApply is the method which determines the final state of a component's configuration based on the component's defaults, current state (if any), "MERGE", and "RESET" options. It works by creating a new Configuration object initialized with the current state from the component if the component is already deployed to the device. Any portions of the configuration which are "RESET" is then removed so that the current value would be reset to the defaults (or "MERGE") in the next step. Component default values are merged into the configuration using a timestamp of 1 so that they will be added if not present, but they will not override any current value which is present. Then the "MERGE" configuration is merged in using the timestamp which is when the deployment was created (not the time when the deployment gets to the device, but the time it is created in AWS). This timestamp merge is done so that no matter what order deployments reach the device, the end state would always be the same. If we have deployment A and B created in that order but B reaches the device first for whatever reason, then deployment A would not remove any config added by deployment B because its timestamp would be older than the timestamp of the config applied by B.

Merge in new configuration

Once we've resolved the new configuration, we need to apply that config to the device. DeploymentConfigMerger is responsible to do this work correctly. It starts by getting a "deployment activator" which can be either a default activator or a kernel update activator if bootstrapping is required. If the deployment is configured to notify components about deployments (and give them the opportunity to delay the deployment) we use the UpdateSystemPolicyService. The deployment then checks with components that have config validators registered and performs some basic sanity checks for AWS region and endpoints, then calls activate() in the deployment activator.

Default activate

The default activator begins by taking a snapshot of the current config for rollback if rollback is requested. Then it calls updateConfiguration to apply the new configuration. The new configuration is merged with the existing configuration using some rules for what to merge and what to replace. After merging the new configuration, we startup new services, replace "unloadable" services, and reinstall broken services. Then we wait for all services to be running, finished, or broken. This wait can take a long time and cannot currently be cancelled (it ought to be cancellable). For example, if a component's install blocks forever and the customer has set the timeout for install to be one hour, then it will take at least 3 hours for the deployment to fail because we will retry the install 3 times before considering the component to be broken, and the "wait for services to start" has no timeout of its own.

Finally, we remove any services which were removed by this deployment, and complete the deployment successfully.

If any step along the way during this process fails, it will rollback if rollback was requested. Rollback looks exactly the same as a deployment, except that we're now applying the "old" config which we saved before executing, instead of the "new" config. Rollback similarly has waitForServicesToStart which has the same problems as roll-forward regarding timeouts and canceling.

Kernel update activate

If a deployment requires a bootstrap, then it will go through this path. Rollback is not configurable for kernel update type deployments since they are considered to be high risk. This is not ideal and we should be able to offer rollback or not-rollback depending on what broke (ie. if Nucleus breaks, then yes rollback, but for anything else, do nothing).

Since rollback is not configurable, we start by taking a snapshot of the current config to rollback to. We then validate the "launch" directory setup so that we're reasonably sure that we can shutdown and be restarted correctly. We then do a "soft shutdown" with 30 second timeout. Soft shutdown will close all services and close the configuration transaction log. We do this so that no more changes will be written to the transaction log. We then update configuration which is the same as for the default activate. But, since we closed the transaction log, the configuration changes only exist in memory and not on disk. So if Greengrass died at this point and restarted, the changes would not be there and it would pick up as if this deployment never happened and it would attempt to execute the deployment again.

Now that config changes are in memory, we write them to disk as a "target" configuration. This target will be loaded when we restart in KERNEL_ACTIVATION phase. Then it persists a list of bootstrap tasks to execute and sets up the launch directory for bootstrap. With everything persisted properly, we now execute each bootstrap step one at a time. If any bootstrap step requested a reboot or Nucleus restart, then we exit with the appropriate exit code to make that happen.

If anything went wrong along the way here, it would rollback. Since we did a soft shutdown early on, we must restart in order to rollback and get everything running again.

Once Nucleus restarts, it will determine the deployment stage it is in. This works by looking at the launch directory and checking what links exist and what files exist within the directories. If old is present then it is either in KERNEL_ACTIVATION or BOOTSTRAP. It will be in BOOTSTRAP if Nucleus died before it could execute all bootstrap steps or if any of the bootstrap steps exited with a non-zero exit code which isn't 100 or 101 which have assigned meanings.

If it is in the BOOTSTRAP phase, then it will again execute the bootstrap steps which were missed or errored and restart. If bootstrapping fails again, then it will be rollback.

If it is in KERNEL_ACTIVATION or KERNEL_ROLLBACK instead, then it will add the deployment to the deployment queue which will then be executed by KernelUpdateDeploymentTask described here.

"Launch" directory

The "launch" directory is /greengrass/v2/alts by default. It contains one or more symlinks which are called "old", "new", "current", "broken", and "init". SystemD (or equivalent) is configured to start Greengrass by using the "current" symlink. Greengrass can then point this symlink to different directories to easily perform upgrades and rollbacks simply by writing files to a new directory and then changing the link. This way, updates are safe because all the files are still present on disk; it is only that "current" may not point to them if a deployment is in progress.

On install, alts will have init which is a directory and current -> init. Inside init is distro which is a link to to Nucleus artifact.

To do a kernel update, the launch directory is modified here so that <deployment id> is a directory whose contents are copied from current. old -> <previous successful deployment id> and current -> <deployment id>.

After a successful kernel update, launch directory is modified here so that old is deleted and the directory old pointed to is also deleted.

To prepare for a kernel update rollback, launch directory is modified here so that broken -> <deployment id> and current -> <previous deployment id>.

After a successful kernel update rollback, launch directory is modified here so that what broken pointed at is deleted and broken is deleted.

Kernel update deployment execution

KernelUpdateDeploymentTask executes during KERNEL_ACTIVATION and KERNEL_ROLLBACK. It will waitForServicesToStart just like the default deployment task and then return an appropriate result. If this task fails during KERNEL_ACTIVATION then it will rollback. If however it fails during KERNEL_ROLLBACK we can only log and return a failure result.

Deployment status keeper

DeploymentStatusKeeper enables registering callbacks for deployment status events based on the deployment source (local, shadow, or jobs). It will ensure that data about the deployment is properly persisted to disk and to the source of the deployment (ex. update IoT Jobs to let it know that the deployment failed with some specific error message).