Fix infinite loop in non-legacy SqlQueryScheduler#14879
Fix infinite loop in non-legacy SqlQueryScheduler#14879rschlussel merged 2 commits intoprestodb:masterfrom
Conversation
Queries that don't have table scan inputs could enter an infinite loop in the scheduler if they hit the queryStateMachine.isDone() check and had any sections ready for execution. If the query is done (e.g. canceled or abandoned at this point), we abort the section. Since "aborted" is a failed state, the section would continue to be considered "ready for execution" in our next time through, and the loop would continue indefitely. We need to explicitly exit the scheduling loop if the query is finished. Queries with inputs didn't have this issue because the split scheduler would be closed when the query finished, so the scheduler would hit an error when it tried to create the scheduler for the scan stages.
| .collect(toImmutableList())); | ||
| if (queryStateMachine.isDone()) { | ||
| sectionExecutions.forEach(SectionExecution::abort); | ||
| break; |
There was a problem hiding this comment.
I assume if we clean up executionSchedules list here, we should also be able to break the loop via the condition form 276-278? :)
There was a problem hiding this comment.
no the problem is that next time through the loop we'll consider this section "ready for execution" again because the most recent attempt is in a failed state and this might be a retry.
wenleix
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the fix!
What about other failures? |
Yes, all failures are considered "ready for execution", but the infinite loop comes because if the query is in a finished state and there are any sections considered ready for execution (e.g. their most recent attempt failed or was aborted, or they haven't been scheduled yet), then those sections will continue to be created and immediately aborted in a loop. |
|
test failure is related to #14882. |
Queries that don't have table scan inputs could enter an infinite loop inthe scheduler if they hit the queryStateMachine.isDone() check and had any sections ready for execution. If the query is done (e.g. canceled or abandoned at this point), we abort the section. Since "aborted" is a failed state, the section would continue to be considered "ready for execution" in our next time through, and the loop would continue indefinitely. We need to explicitly exit the scheduling loop if the query is finished.
Queries with inputs didn't have this issue because the split scheduler would be closed when the query finished, so the scheduler would hit an error when it tried to create the scheduler for the scan stages.
I don't have a test because I was only able to reproduce the issue by explicitly calling queryStateMachine.transitionToFailed() here so we would enter the if statement. Recommendations about how to test this are appreciated.