Enforce task count limit by viczhang861 · Pull Request #13228 · prestodb/presto

viczhang861 · 2019-08-14T19:28:00Z

Too many tasks cause stability issues like frequent GC, for query that exceeds an unreasonable high threshold, fail that single query to maintain cluster in a healthy state

Concern: task count check is performed with a delay of one second, is it possible that multiple queries generate a lot tasks (e.g., more than 100K) in less than one second?

Test: tested in production and expensive query is successfully killed

== RELEASE NOTES ==

General Changes
* Add configuration parameters experimental.max-total-running-task-count and experimental.max-query-running-task-count to control the maximal number of tasks for all queries and a single query, respectively

aweisberg

This is a good idea, thanks for making it happen.

This is missing a unit test that demonstrates the policy behaviors of the configuration options are implemented correctly and propagate the desired error message. It would have found the format string issue ;-)

aweisberg · 2019-08-15T22:46:19Z

presto-main/src/main/java/com/facebook/presto/execution/QueryManagerConfig.java

The config description mentions this one is soft, so presumably the other is a hard limit? What do soft and hard mean?

aweisberg · 2019-08-15T22:57:18Z

presto-main/src/main/java/com/facebook/presto/execution/QueryTracker.java

This won't kill a query if we violated maxTotalRunningTaskCount if highestRunningTaskCount <= maxQueryRunningTaskCount which doesn't fit what I would expect the contract for those configuration values to be?

Is this what you mean by soft limit? They are allowed to exceed maxQueryRunningTaskCount as long as the cluster as a whole doesn't exceed maxTotalRunningTaskCount? This behavior could be clearer from the description.

Correct, description is updated.

what's the purpose of the soft limit? If we're out of resources won't we always want to kill something?

presto-main/src/main/java/com/facebook/presto/execution/QueryTracker.java

presto-main/src/main/java/com/facebook/presto/execution/SqlQueryManager.java

aweisberg · 2019-08-15T23:11:46Z

presto-main/src/main/java/com/facebook/presto/execution/TrackingRemoteTaskFactory.java

Can this go negative on some state transitions such as when it goes from PLANNED to anything other then RUNNING?

Very good point, updated this logic to be safe

presto-main/src/main/java/com/facebook/presto/execution/SqlQueryExecution.java

Tracking task count to identifiy expensive query

rschlussel · 2019-08-19T14:50:19Z

presto-main/src/main/java/com/facebook/presto/execution/QueryTracker.java

Change the message to be more about the query that was killed, similar to the cluster oom error. Something like:

"Query killed because the cluster is overloaded with too many tasks and this query was running the highest number of tasks. Please try again in a few minutes."

aweisberg

LGTM.

rschlussel · 2019-08-20T14:44:59Z

presto-main/src/main/java/com/facebook/presto/execution/QueryTracker.java

what's the purpose of the soft limit? If we're out of resources won't we always want to kill something?

rschlussel · 2019-08-20T14:51:12Z

presto-main/src/main/java/com/facebook/presto/execution/QueryTracker.java

nit: typo. excceds -> exceeds

Yes, if cluster is not busy, let query run.

rschlussel · 2019-08-20T15:26:48Z

presto-main/src/main/java/com/facebook/presto/execution/QueryTracker.java

could we do this in the SqlQueryManager where all the other limits (memory, cpu) are enforced?

alternatively, could we do it during scheduling before exceeding the task limit in the first place?

Currently both QueryTracker and SqlQueryManager enforce some limits (time, memory, cpu)
QueryTracker.java is added later in October 2018,
https://github.com/prestodb/presto/commits/fe743c0b1836343d428a8d07fba11691cbf29541/presto-main/src/main/java/com/facebook/presto/execution/QueryTracker.java
They both use the same Executor, I assume I should use the new one

@arhimondr mentioned some tasks are dynamically generated. A following task I am investigating is to not create new query execution when cluster is overloaded. This needs more caution thus I want to collect enough production statistics before making a decision about threshold.

it sounds like the abstraction between SqlQueryManager and QueryTracker is a bit fuzzy. You can leave it as is for now and we can look at refactoring to make the abstraction clearer.

@rschlussel , @viczhang861

QueryTracker.java is added later in October 2018

Correct. Query timeout tracking is refactored into QueryTracker through #11518 . But memory and CPU enforcement is still in SqlQueryManager

@wenleix. yeah, I saw that, but it's not clear to me what unites things that go in query tracker vs. SqlQueryManager/what's the purpose of each.

QueryTracker right now does a few things:

keeps track of query info and history and removes queries when they expire

kills queries that exceed a time limit

kills abandoned queries

SqlQueryManager does some other stuff:

creates queries and other lifecycle operations on the query

kills queries that exceed a memory limit

kills queries that exceed a cpu limit

is the thing that calls the query tracker to do its things

@rschlussel : Yeah, unfortunately resource management in Presto is a bit scattered right now. Might worthy a refactor to clean up code and to have a better understanding about how it works. cc @oerling , @mbasmanova , @viczhang861 , @bhhari

BTW: isn't memory limit enforcement done in ClusterMemoryManager ?

yes, but that's called from the SqlQueryManager.enforceMemoryLimits()

When cluster is overloaded with too many running tasks, kills the most expensive query whose task count exceeds a custom-configured threshold

viczhang861 requested review from arhimondr, aweisberg, cemcayiroglu, rschlussel and wenleix August 14, 2019 19:28

facebook-github-bot added the CLA Signed label Aug 14, 2019

aweisberg suggested changes Aug 15, 2019

View reviewed changes

Add current and peak running tasks to query state machine

654e733

Tracking task count to identifiy expensive query

viczhang861 force-pushed the task-limit branch 3 times, most recently from e44e36d to f83e337 Compare August 16, 2019 16:26

viczhang861 requested a review from aweisberg August 16, 2019 18:51

rschlussel reviewed Aug 19, 2019

View reviewed changes

aweisberg approved these changes Aug 19, 2019

View reviewed changes

viczhang861 force-pushed the task-limit branch from f83e337 to c8583ed Compare August 20, 2019 00:32

rschlussel reviewed Aug 20, 2019

View reviewed changes

rschlussel approved these changes Aug 20, 2019

View reviewed changes

Fail query that has highest number of tasks

ff49687

When cluster is overloaded with too many running tasks, kills the most expensive query whose task count exceeds a custom-configured threshold

viczhang861 force-pushed the task-limit branch from c8583ed to ff49687 Compare August 20, 2019 21:24

rschlussel merged commit fbf9777 into prestodb:master Aug 21, 2019

caithagoras mentioned this pull request Aug 29, 2019

Release notes for 0.225 #13212

Closed

Conversation

viczhang861 commented Aug 14, 2019

Uh oh!

aweisberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rschlussel Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweisberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viczhang861 Aug 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rschlussel Aug 19, 2019 •

edited

Loading

viczhang861 Aug 20, 2019 •

edited

Loading