-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partitioning reasoning in DataFusion and Ballista #284
Comments
👍 I would add a couple things to this discussion: Currently partitioning is a bit restrictive in that either the client has to specify the partitioning or we fall back to a single default partitioning. This has several limitations
There may not be an optimal general solution for all of this so I think it might be useful to make this functionality pluggable. Something I've been tinkering with is to put the planning behind a
The existing |
I'll add the addition of range partitioning as well to this list - currently normal sorts are not running in parallel / distributed fashion. |
Yes, currently there are couple of gaps in the physical plan phase. The ExecutionPlan trait need to be enhanced also.
I'm working on an experimenting rule for this now and also try to verify a new optimization process, if it is proved. it will be much easy to write new optimization rules. And the same methods can be applied to logical optimization rules as well. |
Another recent paper, see the EXCHANGE PLACEMENT sections. |
The issue is partially addressed by the new Enforcement rule in DataFusion, so just close the issue. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)
In current Ballista code base, when it generates the distributed plan, it will remove any non-hash repartition from the distributed plan.
And In DataFusion, when it does the physical planning, it added RepartitionExec node blindly when it sees the hash join or aggregation without considering the children's output partitioning.
When I look into Presto's source code, Presto's distributed plan can includes both remote exchanges and local exchanges.
Local exchange can benefit the inner Stage parallelism. Presto can add the remote exchanges and local exchanges only when necessary. I think it is time to introduce more advanced methods to reason the partitioning in a distributed plan, something more powerful than Spark SQL's EnsureRequirements rule
Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer
http://www.cs.albany.edu/~jhh/courses/readings/zhou10.pdf
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: