Skip to content

Services

patrzhan edited this page Feb 23, 2024 · 10 revisions

Internal to Greengrass Nucleus each Greengrass component is implemented by the GreengrassService class, so internally they are known as "services" rather than components. I'll try and be consistent and clear when talking about components versus services.

GreengrassService

GreengrassService class is the base class for all services including external components, internal services (ex: DeploymentService), and lambdas.

GreengrassService does not implement the state machine for actually moving the service between lifecycle states, it instead is the interface to request state changes and implements the state logic like install or shutdown. The state machine is separated out in the Lifecycle class.

GreengrassService details

Dependencies

When created, a GreengrassService begins by identifying its dependencies and configuring a callback for when the list dependencies changes. The service then identifies any change to the list of dependencies. For any removed dependency, the dependency state listener is removed. All remaining existing or new dependencies are then configured with a dependency state listener. A dependency state listener will receive state change events for all services, identify if it is the dependency service we're interested in, and then restart this service if the dependency service is not in a good state and this service is either currently running or starting up. If all dependencies are in a healthy state, then the state listener will also send a notification to the dependencyReadyLock which will unblock the lifecycle thread which may be waiting for dependencies to be ready prior to starting this service.

Lifecycle startup

After loading dependencies in the constructor, the dependency injector will call postInject where we initialize the lifecycle. This starts the lifecycle thread running in preparation to get the service running.

Lifecycle commands

Each service has a set of lifecycle commands: bootstrap(), install(), startup(), handleError(), shutdown(), and close().

bootstrap() will be executed when the Nucleus performs a deployment when isBootstrapRequired(Map<String, Object>) is true for this service and the Nucleus enters the BOOTSTRAP phase where no other services are running. Bootstrap is used to make significant changes to the OS and system level packages. What is special about bootstrap is 1. no other services are running at the same time, 2. the bootstrap step can request either a Nucleus restart or a system reboot after executing.

install() is used to install any requirements such as Python packages before executing the main service lifecycle. Greengrass does not persistently track what state services are in, therefore, a service's install() will execute every time Greengrass starts up even if everything was previously installed and running. install() runs in parallel for all services, disregarding any dependencies. Dependencies are only used to control when a service is allowed to enter the STARTING state. There is a default 120s timeout for install() to complete and transition to STARTING. If the timeout expires then Greengrass will move the service into the ERRORED state and retry the install.

startup() is used to get the service into the RUNNING state. A service should only report that it is in the RUNNING state when it is truly ready and dependencies could successfully use it at that point in time. There is a default 120s timeout to move to the RUNNING state. If the timeout expires then Greengrass will move the service into the ERRORED state and retry the startup.

handleError() is called when the service enters the ERRORED state. This method may be used to attempt some sort of error recovery procedure to prevent the error from reoccurring.

shutdown() is used to stop the service when the service is in the STOPPING state. There is a default 15s timeout to move to the FINISHED state from STOPPING. If the timeout expires then Greengrass will move the service into the ERRORED state and then move to the desired state or FINISHED if nothing else was desired.

close() is used to completely shutdown a service when Nucleus is shutting down or when the component implemented by this service is being removed during a deployment. close() returns a future which completes when this service and all services that depend on it are all closed.

Each service also has some query methods which services may choose to override: isBootstrapRequired(Map<String, Object>), shouldAutoStart(), and isBuiltin().

isBootstrapRequired(Map<String, Object>) executes during a deployment to determine, given a new service definition, if a bootstrap is needed.

shouldAutoStart() is used during deployments and Nucleus startup to avoid starting services that do not need to start. The only use of this is in Lambda for on-demand lambdas which should only start when there is work for them to do.

isBuiltin() determines if this implementor of GreengrassService is builtin to Nucleus or not. This is used to skip some logic particularly during deployments which would not apply to builtin services such as DeploymentService.

Lifecycle

Lifecycle implements the state machine for services. There is one instance per service and each instance runs its own thread. This lifecycle thread executes the state machine by identifying what state the service is currently in, and what state it wants to get to, if it isn't already in the desired state.

To move between states, lifecycle tracks a list of desired states so that it can perform more complicated actions like restarting which require going through multiple states. In the restart case, the desired states would be FINISHED, RUNNING so that the service first needs to get to the finished state where it is no longer running, then get into the running state again.

The lifecycle thread will block on the stateEventQueue waiting for events such as a notification that we'd like to move the service to a different state.

Lifecycle details

initLifecycleThread begins the state machine by submitting a task to the executor. This task will run forever until the service is closed.

The task executes startStateTransition in a loop. startStateTransition itself is also a loop. It works by getting the current state of the service and then checking for any new desired state from the desiredStateList. If the service is currently in the desired state, then that state is removed from the desiredStateList so that the service can then move to the next desired state (if any). Now that the service knows what state it is currently in and what state it desires to be in, it enters a switch based on the current state. For each possible current state, the service would call into handleCurrentState<X> or handleCurrentState<X>Async. That method is then resposible to take the required action to get the service from the current state into the desired state.

After "handling" the current state, lifecycle then enters another loop which determines when the lifecycle loop can execute again to handle the next state transition. To simplify a bit, this loop is what actually moves the service into the new state, but only when it is ready to do so, since some of the current state handlers are asynchronous. So that this isn't just a busy loop, lifecycle has a stateEventQueue which it will block on events. Some current state handlers will set an asyncFinishAction which is an AtomicReference<Predicate<Object>>. This is used to execute a task when the state transition is happening to do things like cleanup actions such as canceling a timeout task.

State generation is an AtomicLong which is used to prevent race conditions in asynchronous tasks. The most important usage is for process execution timeouts. For example, when starting up, a service has 120s to get from STARTING to RUNNING and when this timeout expires, it needs to interrupt the service. This timeout is asynchronous, so it is possible that the timeout fires after the service is already in the RUNNING state. This is easy enough to handle by just checking the current state before interrupting any processes, but that is not sufficient. The service may have actually timed out and restarted by the time the timeout fires. This would then mean that we're interrupting the service because the first attempt timed out, but it is already onto its second attempt. To solve this problem, we use the state generation as a counter which is incremented every time the service enters NEW or RUNNING. To properly use the state generation, read the value before registering the asynchronous task and then compare that value to the current value when the task is executed. If they are different, then do not continue to execute the task as it is being run too late and the service has already restarted. See an example of correct usage in GenericExternalService.

States

Services can have the following states: STATELESS (this state is never used), NEW, INSTALLED, STARTING, RUNNING, STOPPING, ERRORED, BROKEN, FINISHED. A normal service would start as NEW, then INSTALLED, STARTING, RUNNING, FINISHED.

The BROKEN state means that Greengrass is giving up on restarting the service due to the service erroring 3 times within 1 hour. There is no way to opt out of this behavior. A service will get out of BROKEN if Greengrass restarts or the service moves itself to NEW (reinstallation). An external service will reinstall itself when the version, install script, runwith, or resource limits change or when any other lifecycle changes and the service is in BROKEN state.

Clone this wiki locally