[native] Add watchdog to detach the worker if an operator call is stuck for too long#21783
Conversation
There was a problem hiding this comment.
I know we discussed internally. But shutdown state is specifically when the server gets a SIGTERM. This is breaking the semantics of the NodeState and can confuse coordinator with wrong state. Other options can be marking worker unhealthy or disable announcement. Did we observer some queries were ending successfully when worker got stuck?
There was a problem hiding this comment.
We didn't have enough data to see observe if some queries would finish successfully as we only seen this phenomenon 5 times exactly and late enough that everything was blocked.
However, I don't see why this might not be the case.
Since INACTIVE has that special weird meaning/limitation, I actually start seeing the SHUTTING_DOWN state as a good candidate to put the worker into to isolate the whole cluster from the problem.
I think about it like this:
- We detected a service-breaking issue.
- Our first reflex was 'restart' and continue.
- But then we thought: what about debugging? And what about some queries that can still finish successfully?
So putting ourselves in the SHUTTING_DOWN state and waiting for an engineer to debug looks very logical in this light.
|
Thanks @Yuhta for working on this critical feature. |
188f275 to
9f91fe4
Compare
0e906bb to
c9ffc61
Compare
...e-execution/src/test/java/com/facebook/presto/nativeworker/PrestoNativeQueryRunnerUtils.java
Outdated
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/PeriodicTaskManager.cpp
Outdated
Show resolved
Hide resolved
61e7957 to
8006e39
Compare
…ck for too long Also detect deadlock or starving and signal alerts if these happen.
8006e39 to
097e970
Compare
| asyncDataCache, | ||
| velox::connector::getAllConnectors()); | ||
| velox::connector::getAllConnectors(), | ||
| this, |
There was a problem hiding this comment.
If we provide Presto server here, then we do need to provide cache, driver executor? We can get those members from Presto server object?
There was a problem hiding this comment.
These are not exposed as public. Do we want to expose them?
Also detect deadlock or starving and signal alerts if these happen.