-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2024 Q3-Q4 Roadmap? #11442
Comments
It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance. That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects. |
I agree with this sentiment. Something about performance improvements is I think they take sustained engineering investment and significant existing engine expertise (thus it is hard to have newcomers to the project make singificant performance improvements) I will try and find time from myself and InfluxData these next two quarters to meaninfully invest in improvements in this area However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there
Thank you. Your help (and everyone else's help) with documentation and reviews I think would also be tremendously beneficial |
As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views. That said, I fully agree with focusing on performance, and I would be happy to rescope my proposal to make it easier to manage. |
Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head
Anything else? There's always room for improvement, particularly in terms of performance. Regularly updating the active lists could provider valuable pointer for the community |
In my mind, here are somre "obvious" performance projects (the ones I have the most confidence that would make a meaningful difference on ClickBench or TPCH queries) are as follows (I can maybe put this in the documentation) Integrate StringView into Parquet / Filtering / Grouping@XiangpengHao is doing this as his summer project and doing an amazing job. I also think this is a great example of the the level of effort required to drive one of these performance projects. It requires implementing the features, then analyzing / profiling, identifying the bottlenecks, and then making PRs to remove the bottlenecks. ee #10918 and apache/arrow-rs#5374 have the entire list. Some of my favorites:
What: Use newly added |
Complete Parquet Filter PerformanceWhat: Enable the most advanced form of predicate pushdown / late materialization that DataFusion |
Improve Aggregate performance for multi-column grouping when at least one column is variable lengthWhat: Queries like |
Aggregate performance / memory use for high cardinality aggregatesWhat: Improve Queries when the number of groups is very high (1 million+) |
Join performance with dynamic join filters / Sideways Information PassingWhat: Introduce filters apply join filtering during the Scan in addition to during the actual join. |
@notfilippo -- I think |
Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right. |
My (perhaps unrealistic) hope is that we could find additional improvements (like #4028 or other optimizations) that could make back up any performance that was lost so that we didn't have to have code to choose. |
I agree! 😄 I was just highlighting how the logical/physical separation could support performance improvements while simplifying things, such as handling custom code for dictionaries. That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold? |
I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do. Also we can finalize our previous discussion on a better statistics infrastructure and start using this information in better ways during planning. |
@notfilippo, I think we should complete the exploratory work. Even if we don't get to full-focus on it, this is something we likely want to do at some point (gradually or otherwise). For example, it was through your draft I started to think about what would happen to |
I think we are at a much better place now for LogicalPlaning -- I don't think we have done anything similar for the various physical optimizer passes for |
I agree with this basic approach -- my feeling is that introducing logical types successfully will require some concerted effort from existing committers and people with expertise with the current code to help it along. @notfilippo is doing a great job but unless we find them help / support I think the project would struggle to be successful |
Is your feature request related to a problem or challenge?
@comphead asked #11426 (comment)
Which I think is an excellent question.
In general since this project isn't really coordinated centrally the roadmap typically follows what people are working on / want to invest time in
However it is a neat idea to collect any thoughts people have / want to share about what they might work on
Describe the solution you'd like
Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!
Then we can add it to the roadmap on the doc site https://datafusion.apache.org/contributor-guide/roadmap.html
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: