Skip to content

Add listener model for TaskThresholdMemoryRevokingScheduler#15505

Merged
highker merged 1 commit intoprestodb:masterfrom
sachdevs:listener-TaskThresholdMemoryRevokingScheduler
Jan 7, 2021
Merged

Add listener model for TaskThresholdMemoryRevokingScheduler#15505
highker merged 1 commit intoprestodb:masterfrom
sachdevs:listener-TaskThresholdMemoryRevokingScheduler

Conversation

@sachdevs
Copy link
Copy Markdown
Contributor

@sachdevs sachdevs commented Dec 7, 2020

Test plan - deployed to cluster, travis, wrote tests

Previously, TaskThresholdMemoryRevokingScheduler had only a polling based model for when a task allocates too much memory. This introduces a listener-based model instead which ensures that we do not have to wait for the polling thread to be scheduled. Instead, we immediately schedule revoking.

== RELEASE NOTES ==

General Changes
* Add listener-based revocation model for spilling strategy PER_TASK_MEMORY_THRESHOLD

@sachdevs sachdevs requested a review from highker December 7, 2020 21:33
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure what we think of this way of alerting the TaskThresholdMemoryRevokingScheduler. Not ideal but this is the cleanest way I could think of - other alternatives are passing the listener around somehow instead of piggybacking on MemoryPool. The thing I dont like about this is that MemoryPool now has two sets of disparate listeners and technically this is a task/operator context allocation.

@sachdevs
Copy link
Copy Markdown
Contributor Author

sachdevs commented Dec 7, 2020

I also thought about adding the listener to SqlTaskManager but that required a pretty large refactor since OperatorContext does not have access to SqlTask/SqlTaskExecution/SqlTaskManager so I didn't end up taking that route.

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a simple application to this use case and we can tell if the design makes sense. So far I think it's mostly legit but would love to see it running with some operators

@sachdevs
Copy link
Copy Markdown
Contributor Author

sachdevs commented Dec 8, 2020

Okay cool, I will go forward with adding tests that demonstrate this then.

@sachdevs sachdevs force-pushed the listener-TaskThresholdMemoryRevokingScheduler branch 2 times, most recently from fb9494c to f1ddbd5 Compare December 16, 2020 23:32
@sachdevs sachdevs requested a review from highker December 16, 2020 23:33
@sachdevs sachdevs changed the title [WIP] Add listener model for TaskThresholdMemoryRevokingScheduler Add listener model for TaskThresholdMemoryRevokingScheduler Dec 16, 2020
Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

Comment on lines 43 to 50
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just keep SqlTaskManager here directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing - without parameterizing it, it is difficult to make this part of the code modular and testable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See @VisibleForTesting constructor for reference

@highker highker self-assigned this Dec 21, 2020
@sachdevs sachdevs force-pushed the listener-TaskThresholdMemoryRevokingScheduler branch 2 times, most recently from bc11850 to c6bb33a Compare January 6, 2021 01:52
@pettyjamesm
Copy link
Copy Markdown
Contributor

Related pending changes: #15569

Be careful about changing memory revoking behaviors in light of the fact that memory revoking has mostly been disabled by accident since its inception

@sachdevs
Copy link
Copy Markdown
Contributor Author

sachdevs commented Jan 6, 2021

I saw your PR earlier @pettyjamesm - can you further explain what you mean by "mostly been disabled"? I figured your PR is mostly removing the overcounting due to a query repeatedly counting memory from all other tasks in that query. Wouldn't this mean that revoking is triggered more frequently and not it being "disabled"?

@pettyjamesm
Copy link
Copy Markdown
Contributor

pettyjamesm commented Jan 6, 2021

I saw your PR earlier @pettyjamesm - can you further explain what you mean by "mostly been disabled"? I figured your PR is mostly removing the overcounting due to a query repeatedly counting memory from all other tasks in that query. Wouldn't this mean that revoking is triggered more frequently and not it being "disabled"?

The trigger to attempt revoking has been hyperactive, but actual revoking has been underactive because of the over counting. Very little revoking has ever been occurring in typical production environments because it was effectively short-circuited by the over counting and early termination logic

@sachdevs
Copy link
Copy Markdown
Contributor Author

sachdevs commented Jan 6, 2021

I see what is going on. For each task we first get its root QueryContext and visit its children. Each task belonging to the QueryContext will visit and triggering revoking on the same tasks every time. This explains why this works mostly fine in prod/testing when only aggregation spilling is enabled but is definitely a problem when join spilling especially. It is worth testing without this PR and seeing if there is still any merit to using a listener model after incorporating #15569.

@sachdevs sachdevs force-pushed the listener-TaskThresholdMemoryRevokingScheduler branch from c6bb33a to 4f386c2 Compare January 6, 2021 02:31
@highker
Copy link
Copy Markdown

highker commented Jan 6, 2021

@pettyjamesm, is spilling working well in Amazon? For FB, it's the first time we will turn on spilling in prod. Though we haven't, there are quite a few bugs we are fixing.

@pettyjamesm
Copy link
Copy Markdown
Contributor

@pettyjamesm, is spilling working well in Amazon? For FB, it's the first time we will turn on spilling in prod. Though we haven't, there are quite a few bugs we are fixing.

@highker The biggest problem we've had with spilling has always been the limitations on when spilling could occur. In Athena v1 scenarios that could spill were extremely narrow so spilling to disk was pretty rare. In v2 the potential spilling opportunities are broader but still somewhat limited in the version of presto it's based off of.

I can't personally recall seeing spill to disk ever cause a query failure but if there were subtle behavioral issues they may have just gone unnoticed. What bugs have you run into?

@sachdevs
Copy link
Copy Markdown
Contributor Author

sachdevs commented Jan 6, 2021

@pettyjamesm Many of the bugs we have fixed had to do with memory reporting primarily - some operators were under-reporting memory used causing spilling to not be triggered when it should be. Most of the spilling bugs that we have fixed are unnoticeable in environments where Presto has sufficiently high amount of heap space. They are much more serious in low memory use cases though since poor spilling can easily lead to worker heap OOM.

@highker highker merged commit ca94665 into prestodb:master Jan 7, 2021
@caithagoras caithagoras mentioned this pull request Jan 11, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants