Skip to content

Make it possible to reactivate a draining worker node#24444

Merged
losipiuk merged 2 commits intotrinodb:masterfrom
brybacki:brybacki/worker-drain-reactivation
Dec 16, 2024
Merged

Make it possible to reactivate a draining worker node#24444
losipiuk merged 2 commits intotrinodb:masterfrom
brybacki:brybacki/worker-drain-reactivation

Conversation

@brybacki
Copy link
Copy Markdown
Contributor

@brybacki brybacki commented Dec 11, 2024

Description

Adds new node states to enable full control over shutdown and reactivation of workers.

  • state: DRAINING - a reversible shutdown,
  • state: DRAINED - all tasks are finished, server can be safely and quickly stopped. Can still go back to ACTIVE.

For backwards compatibility reasons and to cooperate with shutdown initiated manually or by kubernetes two new states are added DRAINING and DRAINED. The state SHUTTING_DOWN is left as is, without changes. With two new states it is now possible to have a fine grained control over worker shutdown. Trino worker can go from DRAINING or DRAINED to ACTIVE.

Any tool (for example a kubernetes operator) can now initiate the draining phase before executing a shutdown. Worker immediately goes to a DRAINING phase and goes through almost all the steps that the graceful shutdown requires. During the DRAINNIG state the worker can transition back to ACTIVE if requested to do so. Upon finishing all the tasks it does not terminate the thread pools and connections (like SHUTTING_DOWN), but transitions to a DRAINED state. In this state it can be safely and quickly terminated by being requested to go to SHUTTING_DOWN or it can transition back to ACTIVE (also by request).

Below is a state transition diagram (with sources in plantuml):
NodeState

@startuml
    [*] --> ACTIVE
    note "state INACTIVE is not used internally\nis only used when the service is unavailable " as a
    ACTIVE --> SHUTTING_DOWN : shutdown
    ACTIVE --> DRAINING : drain
    DRAINING --> ACTIVE: reactivate
    DRAINING --> DRAINED
    DRAINING --> SHUTTING_DOWN : gracefulShutdown
    DRAINED --> ACTIVE: reactivate
    DRAINED --> SHUTTING_DOWN : terminate
    SHUTTING_DOWN --> [*]
@enduml

Additional context and related issues

It solves the issue: #9976, and is similar to previous PRs (but has a smaller scoper and is simpler, does not add new logic to manage workers to coordinator):

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Add new node states(`DRAINING`, `DRAINED`) to make it possible to reactivate a draining worker node ({issue}`24444 `)

@cla-bot cla-bot bot added the cla-signed label Dec 11, 2024
@brybacki brybacki requested review from losipiuk and wendigo December 11, 2024 11:07
Comment thread core/trino-main/src/main/java/io/trino/metadata/DiscoveryNodeManager.java Outdated
Comment thread core/trino-main/src/main/java/io/trino/metadata/NodeState.java Outdated
Comment thread core/trino-main/src/main/java/io/trino/server/NodeStateManager.java Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you think of using StateMachine as implementation block here. Like we do in many places. E.g QueryStateMachine

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would low level method as this one as private and expose specific transitionTo... methods publically

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the low level method method and exposing specific transitionTo... methods publicly. This will require adding the switch ...case in the io.trino.server.ServerInfoResource#updateState. Do you think it brings us any value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the StateMachine - I have not seen it before. Interesting idea, but would require me learning how to use it, and rewriting NodeStateManager.

Is it much better or easier to use StateMachine instead of hand written code in NodeStateManager? Can it be done later as a refactoring task?

The thing is, the code presented here is already in production environments and is well tested and battle-proven. Any changes would require careful analysis and testing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StateMachine is kinda trivial. It is just a building block wrapping atomic with a state with nicer interface. It also adds tooling for attaching state change listeners, but you do not need that here I guess

Comment thread core/trino-main/src/main/java/io/trino/server/NodeStateManager.java Outdated
Comment thread core/trino-main/src/main/java/io/trino/server/NodeStateManager.java Outdated
Comment thread core/trino-main/src/main/java/io/trino/server/NodeStateManager.java Outdated
@brybacki brybacki requested a review from sopel39 December 12, 2024 14:10
A refactor - rename to prepare for adding new logic.
Adds new node states to enable full control over shutdown and reactivation of workers.
- state: DRAINING - a reversible shutdown,
- state: DRAINED - all tasks are finished, server can be safely and quickly stopped. Can still go back to ACTIVE.
@brybacki brybacki force-pushed the brybacki/worker-drain-reactivation branch from 92b0165 to 80fee59 Compare December 13, 2024 14:01
@losipiuk losipiuk merged commit e369f66 into trinodb:master Dec 16, 2024
@github-actions github-actions bot added this to the 468 milestone Dec 16, 2024
@mosabua
Copy link
Copy Markdown
Member

mosabua commented Dec 16, 2024

Couple of questions:

Does this show up in the UI or CLI or somewhere?
Is this a breaking change?
Do we need to add it as a state to any docs?

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Dec 16, 2024

Personally I think we should add for these states somewhere..

@losipiuk
Copy link
Copy Markdown
Member

I do not think it is visible anywhere outside of internal communication between coordinator and workers (/v1/info/state). So far we were not documenting those.
And should not be breaking anything - was built with that in mind.

@brybacki
Copy link
Copy Markdown
Contributor Author

Couple of questions:

Thank you for your questions. These are all valid concerns.

Does this show up in the UI or CLI or somewhere?

Are we showing SHUTTING_DOWN anywhere in UI or CLI? I can verify that. I know that it can show in SELECT * FROM system.runtime.nodes;. But this is just a string/varchar.

Is this a breaking change?

It is not a breaking change. To observe new STATE you need to explicitly force the transition by calling a worker API /v1/info/state. The standard shutdown hook only executes transition to SHUTTING_DOWN.

If someone executes the transition to a new DRAINING state, then he/she should be aware that a new state can be observed by any tool that uses coordinator or worker API's for observing node state or lifecycle management.

Do we need to add it as a state to any docs?

No idea, do we have such docs? So far @losipiuk thinks we do not need to update any docs.

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Dec 17, 2024

Thanks @losipiuk and @brybacki - in the longer run I want to add this to the docs but for now we can leave it out. Have a look at the release notes entry .. and I will also get review from @martint on that part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants