Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing. #12392
Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing. #12392phd3 merged 2 commits intotrinodb:masterfrom
Conversation
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
|
74df201 to
88a96da
Compare
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
|
42eb6c7 to
7e7112b
Compare
core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
losipiuk
left a comment
There was a problem hiding this comment.
A few editorials, but major concern is there is no synchronization between code which interrupts Split runner threads and the threads themselves. It very much looks like we may interrupt thread which already moved to another split.
There was a problem hiding this comment.
We also have a warning about the runaway splits, currently hardcoded. IMO we should validate the two thresholds are coherent (i.e. warning <= timeout), which would require making the warning configurable too
There was a problem hiding this comment.
got it, will configure: "LONG_SPLIT_WARNING_THRESHOLD"
There was a problem hiding this comment.
agree we should make warning <= timeout, because if the thread runtime for the split > than timeout, the system had already interrupted the split.
There was a problem hiding this comment.
I also felt "/v1/maxActiveSplits" is unnecessary.. Since this PR will print out the run away splits in the log..
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
is 1s based off of SPLIT_RUN_QUANTA? if so let's just use that
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/executor/TestTaskExecutor.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
|
Hey, I found out the issue is very similar to #7213, in which we want to limit the total time for SQLQueryExecution. In that case, the thread is reused between queries, in our case, the thread is reused between tasks. In either case, I abstract the following programming pattern:
From my personally understanding, the AtomicReference can also be replaced by volatile keyword. |
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
|
14a554d to
8c35f16
Compare
jhlodin
left a comment
There was a problem hiding this comment.
Docs LGTM, just need to word wrap at 80 characters
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/executor/PrioritizedSplitRunner.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/TestSqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
arhimondr
left a comment
There was a problem hiding this comment.
Generally looks good to me % nits.
Please update the commit message according to the guideline: https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages
core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
nit: Store the task id and split info directly.
There was a problem hiding this comment.
Please respond (let me know if you missed it or not going to address)
There was a problem hiding this comment.
Discussed offline. The split info cannot be cached as it changes as the split execution progresses.
adb169e to
93e5f09
Compare
|
Hi, I have addressed all the comments. I also changed the minimum value for the configs. Due to the async nature, the check based on the walltime is inaccurate when the timeout is at a similar scale to the time quota. It can only detect splits that significantly run longer than the split's 1 second time quota |
core/trino-main/src/main/java/io/trino/execution/TaskManagerConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Please respond (let me know if you missed it or not going to address)
c8d31e7 to
89d7dee
Compare
|
Merged, thanks! |
Description
Enable Trino to detect and fail tasks that are stuck in long running JONI parsing.
Fix
Query Engine
Trino is a time shared multi-tenant system. For context switching, trino relies on the threadpool workers cooperatively yield itself to the scheduling logic. It is not like an operating system which used a hard interrupt signal to force the process to do the context switch.
Most of the split processing can be finished in a relative short interval, whereas the JONI processing is an exception.
Furthermore, some of the trino features: gathering statistics, relaying on the trino to do callback function(to execute splitFinished() after yield) after the context switch, since there is no context switch in this case, trino can't gather accurate CPU usage for that split.
This PR allows trino to detect and fail tasks that are stuck in long running JONI parsing.
Documentation
( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:
Enable Trino to detect and fail queries that have tasks stuck in long running JONI parsing.