Make it possible to reactivate a draining worker node by brybacki · Pull Request #24444 · trinodb/trino

brybacki · 2024-12-11T11:06:44Z

Description

Adds new node states to enable full control over shutdown and reactivation of workers.

state: DRAINING - a reversible shutdown,
state: DRAINED - all tasks are finished, server can be safely and quickly stopped. Can still go back to ACTIVE.

For backwards compatibility reasons and to cooperate with shutdown initiated manually or by kubernetes two new states are added DRAINING and DRAINED. The state SHUTTING_DOWN is left as is, without changes. With two new states it is now possible to have a fine grained control over worker shutdown. Trino worker can go from DRAINING or DRAINED to ACTIVE.

Any tool (for example a kubernetes operator) can now initiate the draining phase before executing a shutdown. Worker immediately goes to a DRAINING phase and goes through almost all the steps that the graceful shutdown requires. During the DRAINNIG state the worker can transition back to ACTIVE if requested to do so. Upon finishing all the tasks it does not terminate the thread pools and connections (like SHUTTING_DOWN), but transitions to a DRAINED state. In this state it can be safely and quickly terminated by being requested to go to SHUTTING_DOWN or it can transition back to ACTIVE (also by request).

Below is a state transition diagram (with sources in plantuml):

@startuml
    [*] --> ACTIVE
    note "state INACTIVE is not used internally\nis only used when the service is unavailable " as a
    ACTIVE --> SHUTTING_DOWN : shutdown
    ACTIVE --> DRAINING : drain
    DRAINING --> ACTIVE: reactivate
    DRAINING --> DRAINED
    DRAINING --> SHUTTING_DOWN : gracefulShutdown
    DRAINED --> ACTIVE: reactivate
    DRAINED --> SHUTTING_DOWN : terminate
    SHUTTING_DOWN --> [*]
@enduml

Additional context and related issues

It solves the issue: #9976, and is similar to previous PRs (but has a smaller scoper and is simpler, does not add new logic to manage workers to coordinator):

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Add new node states(`DRAINING`, `DRAINED`) to make it possible to reactivate a draining worker node ({issue}`24444 `)

losipiuk · 2024-12-11T17:48:51Z

Did you think of using StateMachine as implementation block here. Like we do in many places. E.g QueryStateMachine

I would low level method as this one as private and expose specific transitionTo... methods publically

Regarding the low level method method and exposing specific transitionTo... methods publicly. This will require adding the switch ...case in the io.trino.server.ServerInfoResource#updateState. Do you think it brings us any value?

Regarding the StateMachine - I have not seen it before. Interesting idea, but would require me learning how to use it, and rewriting NodeStateManager.

Is it much better or easier to use StateMachine instead of hand written code in NodeStateManager? Can it be done later as a refactoring task?

The thing is, the code presented here is already in production environments and is well tested and battle-proven. Any changes would require careful analysis and testing.

StateMachine is kinda trivial. It is just a building block wrapping atomic with a state with nicer interface. It also adds tooling for attaching state change listeners, but you do not need that here I guess

A refactor - rename to prepare for adding new logic.

Adds new node states to enable full control over shutdown and reactivation of workers. - state: DRAINING - a reversible shutdown, - state: DRAINED - all tasks are finished, server can be safely and quickly stopped. Can still go back to ACTIVE.

mosabua · 2024-12-16T22:19:53Z

Couple of questions:

Does this show up in the UI or CLI or somewhere?
Is this a breaking change?
Do we need to add it as a state to any docs?

mosabua · 2024-12-16T22:21:34Z

Personally I think we should add for these states somewhere..

losipiuk · 2024-12-17T09:16:10Z

I do not think it is visible anywhere outside of internal communication between coordinator and workers (/v1/info/state). So far we were not documenting those.
And should not be breaking anything - was built with that in mind.

brybacki · 2024-12-17T09:47:59Z

Couple of questions:

Thank you for your questions. These are all valid concerns.

Does this show up in the UI or CLI or somewhere?

Are we showing SHUTTING_DOWN anywhere in UI or CLI? I can verify that. I know that it can show in SELECT * FROM system.runtime.nodes;. But this is just a string/varchar.

Is this a breaking change?

It is not a breaking change. To observe new STATE you need to explicitly force the transition by calling a worker API /v1/info/state. The standard shutdown hook only executes transition to SHUTTING_DOWN.

If someone executes the transition to a new DRAINING state, then he/she should be aware that a new state can be observed by any tool that uses coordinator or worker API's for observing node state or lifecycle management.

Do we need to add it as a state to any docs?

No idea, do we have such docs? So far @losipiuk thinks we do not need to update any docs.

mosabua · 2024-12-17T17:44:07Z

Thanks @losipiuk and @brybacki - in the longer run I want to add this to the docs but for now we can leave it out. Have a look at the release notes entry .. and I will also get review from @martint on that part

cla-bot bot added the cla-signed label Dec 11, 2024

brybacki requested review from losipiuk and wendigo December 11, 2024 11:07