Skip to content

[native] Add watchdog to detach the worker if an operator call is stuck for too long#21783

Merged
pranjalssh merged 2 commits intoprestodb:masterfrom
Yuhta:tasks/T175424805/D52921877
Jan 29, 2024
Merged

[native] Add watchdog to detach the worker if an operator call is stuck for too long#21783
pranjalssh merged 2 commits intoprestodb:masterfrom
Yuhta:tasks/T175424805/D52921877

Conversation

@Yuhta
Copy link
Contributor

@Yuhta Yuhta commented Jan 25, 2024

Also detect deadlock or starving and signal alerts if these happen.

@Yuhta Yuhta requested a review from a team as a code owner January 25, 2024 17:00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we discussed internally. But shutdown state is specifically when the server gets a SIGTERM. This is breaking the semantics of the NodeState and can confuse coordinator with wrong state. Other options can be marking worker unhealthy or disable announcement. Did we observer some queries were ending successfully when worker got stuck?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkdutta

We didn't have enough data to see observe if some queries would finish successfully as we only seen this phenomenon 5 times exactly and late enough that everything was blocked.
However, I don't see why this might not be the case.

Since INACTIVE has that special weird meaning/limitation, I actually start seeing the SHUTTING_DOWN state as a good candidate to put the worker into to isolate the whole cluster from the problem.
I think about it like this:

  1. We detected a service-breaking issue.
  2. Our first reflex was 'restart' and continue.
  3. But then we thought: what about debugging? And what about some queries that can still finish successfully?

So putting ourselves in the SHUTTING_DOWN state and waiting for an engineer to debug looks very logical in this light.

@amitkdutta
Copy link
Contributor

Thanks @Yuhta for working on this critical feature.

@Yuhta Yuhta force-pushed the tasks/T175424805/D52921877 branch 2 times, most recently from 188f275 to 9f91fe4 Compare January 25, 2024 23:32
@Yuhta Yuhta requested a review from a team as a code owner January 25, 2024 23:32
@Yuhta Yuhta requested a review from presto-oss January 25, 2024 23:32
@Yuhta Yuhta force-pushed the tasks/T175424805/D52921877 branch 2 times, most recently from 0e906bb to c9ffc61 Compare January 26, 2024 16:47
Copy link
Contributor

@spershin spershin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Thanks!

@Yuhta Yuhta force-pushed the tasks/T175424805/D52921877 branch 2 times, most recently from 61e7957 to 8006e39 Compare January 26, 2024 22:40
Jimmy Lu added 2 commits January 29, 2024 07:18
…ck for too long

Also detect deadlock or starving and signal alerts if these happen.
@Yuhta Yuhta force-pushed the tasks/T175424805/D52921877 branch from 8006e39 to 097e970 Compare January 29, 2024 15:18
@pranjalssh pranjalssh self-requested a review January 29, 2024 18:25
Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pranjalssh pranjalssh merged commit 0ddb2c1 into prestodb:master Jan 29, 2024
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta some minors. Thanks!

asyncDataCache,
velox::connector::getAllConnectors());
velox::connector::getAllConnectors(),
this,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we provide Presto server here, then we do need to provide cache, driver executor? We can get those members from Presto server object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not exposed as public. Do we want to expose them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants