Skip to content

Fix infinite loop in non-legacy SqlQueryScheduler#14879

Merged
rschlussel merged 2 commits intoprestodb:masterfrom
rschlussel:fix-scheduler-loop
Jul 27, 2020
Merged

Fix infinite loop in non-legacy SqlQueryScheduler#14879
rschlussel merged 2 commits intoprestodb:masterfrom
rschlussel:fix-scheduler-loop

Conversation

@rschlussel
Copy link
Contributor

Queries that don't have table scan inputs could enter an infinite loop inthe scheduler if they hit the queryStateMachine.isDone() check and had any sections ready for execution. If the query is done (e.g. canceled or abandoned at this point), we abort the section. Since "aborted" is a failed state, the section would continue to be considered "ready for execution" in our next time through, and the loop would continue indefinitely. We need to explicitly exit the scheduling loop if the query is finished.

Queries with inputs didn't have this issue because the split scheduler would be closed when the query finished, so the scheduler would hit an error when it tried to create the scheduler for the scan stages.

I don't have a test because I was only able to reproduce the issue by explicitly calling queryStateMachine.transitionToFailed() here so we would enter the if statement. Recommendations about how to test this are appreciated.

== RELEASE NOTES ==
General Changes
* Fix potential infinite loop when the setting ``use_legacy_scheduler`` is set to false.

Queries that don't have table scan inputs could enter an infinite loop in
the scheduler if they hit the queryStateMachine.isDone() check and had any
sections ready for execution. If the query is done (e.g. canceled or
abandoned at this point), we abort the section.  Since "aborted" is
a failed state, the section would continue to be considered "ready for
execution"  in our next time through, and the loop would continue indefitely.
We need to explicitly exit the scheduling loop if the query is finished.

Queries with inputs didn't have this issue because the split scheduler
would be closed when the query finished, so the scheduler would hit an
error when it tried to create the scheduler for the scan stages.
.collect(toImmutableList()));
if (queryStateMachine.isDone()) {
sectionExecutions.forEach(SectionExecution::abort);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume if we clean up executionSchedules list here, we should also be able to break the loop via the condition form 276-278? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the problem is that next time through the loop we'll consider this section "ready for execution" again because the most recent attempt is in a failed state and this might be a retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@wenleix
Copy link
Contributor

wenleix commented Jul 24, 2020

Since "aborted" is a failed state, the section would continue to be considered "ready for execution" in our next time through, and the loop would continue indefinitely. We need to explicitly exit the scheduling loop if the query is finished.

What about other failures?

@rschlussel
Copy link
Contributor Author

Since "aborted" is a failed state, the section would continue to be considered "ready for execution" in our next time through, and the loop would continue indefinitely. We need to explicitly exit the scheduling loop if the query is finished.

What about other failures?

Yes, all failures are considered "ready for execution", but the infinite loop comes because if the query is in a finished state and there are any sections considered ready for execution (e.g. their most recent attempt failed or was aborted, or they haven't been scheduled yet), then those sections will continue to be created and immediately aborted in a loop.

@rschlussel
Copy link
Contributor Author

test failure is related to #14882.

@rschlussel rschlussel merged commit 23082e8 into prestodb:master Jul 27, 2020
@caithagoras caithagoras mentioned this pull request Jul 28, 2020
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants