Skip to content

Interrupt runaway splits#16111

Merged
highker merged 1 commit intoprestodb:masterfrom
aweisberg:interrupt_runaway_splits
May 26, 2021
Merged

Interrupt runaway splits#16111
highker merged 1 commit intoprestodb:masterfrom
aweisberg:interrupt_runaway_splits

Conversation

@aweisberg
Copy link
Contributor

@aweisberg aweisberg commented May 17, 2021

Test plan - Included unit test, deploy to a cluster and test with a known bad query

We do use thread interruption as part of the query cancellation mechanism so it is probably safer then I thought it was in some earlier discussions.

What this is really implementing is eager split termination where we don't honor the user specified query timeout. Because of that I am limiting this to only killing a split that looks like it is blocked in Joni. We can add more if we find other common culprits.

@arhimondr brought up the possibility of edge cases like JDBC where we might have an expensive query that might not yield a page for 10 minutes as it tries to generate a small number of rows from a system that isn't very fast. Even with Hive + HDFS if it is needle in a haystack type stuff this could occur. I think a targeted approach of interrupting allow listed things is a good place to start.

== RELEASE NOTES ==

General Changes
* Add interruption for runaway splits blocked in known situations controlled by ``task.interrupt-runaway-splits-timeout`` property which defaults to ``600s``.

@aweisberg aweisberg force-pushed the interrupt_runaway_splits branch from cee4828 to d3978a4 Compare May 17, 2021 23:12
Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

@aweisberg
Copy link
Contributor Author

aweisberg commented May 18, 2021

I didn't add reviewers to this because it isn't ready.

@aweisberg
Copy link
Contributor Author

This does work in practice 20210518_201931_00000_y9r9j

@aweisberg aweisberg force-pushed the interrupt_runaway_splits branch 5 times, most recently from 0235e3f to 4c3cab2 Compare May 18, 2021 23:10
@aweisberg aweisberg requested review from arhimondr and highker May 19, 2021 00:21
@aweisberg
Copy link
Contributor Author

OK, open season!

@aweisberg aweisberg force-pushed the interrupt_runaway_splits branch 2 times, most recently from e734eff to 4782de6 Compare May 19, 2021 02:04
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove option in the param and put Option.DEFAULT directly in the body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is going to be landed by @kaikalur in #16109
I'll ask for it there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not config.getInterruptRunawaySplitsTimeout() directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test constructor and there is no accessible TaskManagerConfig I think? Or you mean instantiate the config and then request the value? I can do that if it is what you are thinking.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, given this is for test only, we can do that

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a blacklist, shall we use TaskManager::getAllTaskInfo to decide if a task is still alive and tracked to decide if we want to preempt the current thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the task is no longer alive then the threads have already been interrupted so there is maybe no point in interrupting them again. So to some extent just making the Joni regex functions interruptible is already a win because it will make query cancellation and timeout work better.

What this brings to the table is that in a specific allow listed cases we will interrupt the splits with a timeout that is different (shorter) from the full query timeout which may be in the many hours range. The idea being this a step towards more automatically and aggressively remediating runaway splits.

@aweisberg aweisberg force-pushed the interrupt_runaway_splits branch from 4782de6 to 473288a Compare May 19, 2021 11:22
@highker highker self-requested a review May 20, 2021 07:39
Copy link

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits only

Comment on lines +219 to +216
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these defaults are not going to be changed, I guess we don't need to pass them in through parameters? Just directly use them in the body of the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different ones are passed in for unit tests so the tests can complete quickly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, given this is for test only, we can do that

Comment on lines 241 to 239
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here; we could avoid all defaults through parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing RE making the unit test run quickly

@aweisberg aweisberg force-pushed the interrupt_runaway_splits branch from 473288a to a607931 Compare May 24, 2021 14:21
@aweisberg aweisberg requested a review from highker May 24, 2021 14:22
@aweisberg
Copy link
Contributor Author

@highker I am not 100% clear on what you are asking for with removing default parameters that aren't in TaskConfig? If it's not parameterized how should the unit test change the value so the test completes quickly?

Copy link
Contributor

@viczhang861 viczhang861 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in release note task.task.interrupt-runaway-splits-timeout

@aweisberg aweisberg force-pushed the interrupt_runaway_splits branch 2 times, most recently from fce61da to ac0e0e7 Compare May 24, 2021 21:16
@highker highker self-assigned this May 25, 2021
@highker
Copy link

highker commented May 25, 2021

There is a conflict; maybe rebase and we can merge this PR?

@aweisberg aweisberg force-pushed the interrupt_runaway_splits branch from ac0e0e7 to e7c97d7 Compare May 25, 2021 18:53
@aweisberg aweisberg requested a review from viczhang861 May 25, 2021 22:28
Copy link
Contributor

@viczhang861 viczhang861 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check property name in release note.

@highker highker merged commit 8e8a546 into prestodb:master May 26, 2021
@jainxrohit jainxrohit mentioned this pull request Jun 5, 2021
4 tasks
@ajaygeorge ajaygeorge mentioned this pull request Jun 9, 2021
4 tasks
BlueStalker pushed a commit to BlueStalker/presto that referenced this pull request Oct 20, 2021
Summary:
OS link for Ref: prestodb#16111

The OS change interrupts long running spits stuck on JoniRegexpFunctions, internally we encounter issue on HashBuilderOperator.buildLookupSource. So added as an additional interrupt condition.

Reviewers: #ldap_presto-core, sgurmeet

Reviewed By: #ldap_presto-core, sgurmeet

Subscribers: sgurmeet, O4263 subscribe to presto changes

JIRA Issues: PRESTO-3695

Differential Revision: https://code.uberinternal.com/D6427367
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants