-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Limit maximum size of dynamic filter collected by coordinator #12963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
b5c48cc
Refactor DynamicFilterService
arhimondr 96c41cc
Limit maximum size of a single dynamic filter
arhimondr fdc9b1f
Use roaring bitmap to store collected tasks
arhimondr e1bd0d2
Optimize union of domains in DynamicFilterService
arhimondr 07fce36
Clear collected domains if the collection is finished
arhimondr 852448e
Extract getRetainedSizeInBytes method in LocalDynamicFilterConsumer
arhimondr ad5b2c0
Optimize union of domains in LocalDynamicFilterConsumer
arhimondr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this refactor is probably an overkill for capping max DF size. I think it actually introduces quadratic computational complexity too (merging multiple domains at once is more efficient).
I would rather just keep track of overall DF size.
Do we really need to cap DF size since tasks already cap DF and number of tasks is limited?
For partitioned join, each domain is separate so union is not needed. For broadcast join, we simply skip collecting subsequent domains once we get a first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
ValueSetis currently being merged iteratively anyway: https://github.com/trinodb/trino/blob/master/core/trino-spi/src/main/java/io/trino/spi/predicate/ValueSet.java#L135, https://github.com/trinodb/trino/blob/master/core/trino-spi/src/main/java/io/trino/spi/predicate/SortedRangeSet.java#L580, https://github.com/trinodb/trino/blob/master/core/trino-spi/src/main/java/io/trino/spi/predicate/EquatableValueSet.java#L277. It may still result in additional objects allocation along the way, but the overhead shouldn't be significant.It may result in hitting the limit prematurely.
In fault tolerant execution we may create thousands and thousands of tasks as the total number of tasks is no longer limited by the cluster size
This is no longer true for fault tolerant execution, as the filters are collected before the exchange.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to 4fb9bb7
unionis now more efficient thanO(N^2)but there is still room for improvement.Is that really that big of a problem?
That really stresses the issue with quadratic computation. If every new task adds a tiny bit of information that coordinator needs to union over and over again.
I'm worried that we are repurposing DF mechanism for use-cases that won't be really that relevant for Tardigrade and that proper long-term solution is really adaptive planning. DF in Tardigrade was suppsed to be easy, but it exploded to rather significant effort. It might be best to just continue, but the complexity will remain even if new code won't be used much by community
cc @martint
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fault tolerant execution it is. There's a high chance the domains collected by different tasks may contain almost identical set of values as the values are going to be distributed uniformly across the tasks.
I can add an optimization and only union when the size limit is reached.
For now we see this as a long term solution. For adaptive re-planning we are thinking about starting with something simple and rely only on the "data size" metric for each partition that is "free". Collecting more advanced statistics (such as NDV, etc.) at shuffle boundary is expensive and we have to be careful to make sure that the extra optimizations that are possible with the advance statistics are going to pay off the stats collection cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The long term goal is to unify both execution models. We should be able to take advantage of fault-tolerant execution while still being able to do speculative or quasi-pipelined execution where appropriate. Unless we're planning to get rid of DF, it does make sense to make it work with fault tolerant execution.
Also, keep in mind that fault tolerant execution is not just for long running batch queries. It can be useful for interactive queries when running clusters in unreliable hardware or ephemeral instances (e.g., spot instances in AWS)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unification can also mean hybrid model, where short queries (few mins) execute with query-level restarts (maybe speculatively), while long running queries execute with task level retries.
I don't think it's that simple. IMO the complexity or Tardigrade + "new" DFs is not worth for interactive queries:
I think we should try to make Tardigrade default execution mode for long running or memory intensive queries first (that is actually intended Tardigrade use-case). However, I'm not sure this can be done without having better (non-S3) shuffle service in OS
Why can we start by making Tardigrade a default execution mode for large/long queries in OS? I don't think Tardigrade is that beneficial for interactive queries. Time gap for failures is much shorter, it seems that query restarts are sufficient.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add that using shuffle service for every query is a big paradigm shift in how Trino runs queries and manages resources. Shuffle service is not native to Trino. There are setups where setting up shuffle service won't be possible (e.g. native deployments). It's also something that Trino users don't have operational experience with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DF is not only beneficial for interactive queries. It helps with cutting computational costs for long running queries as well. While we don't care too much about low latency queries yet, we do care about overall system efficiency for fault tolerant execution (currently there's less than 15% regression in CPU efficiency with fault tolerant execution enabled, the current efficiency regression comes from applying encryption and compression for data exchanges that in theory could be disabled)