Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024 Q3-Q4 Roadmap? #11442

Open
alamb opened this issue Jul 12, 2024 · 19 comments
Open

2024 Q3-Q4 Roadmap? #11442

alamb opened this issue Jul 12, 2024 · 19 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

Is your feature request related to a problem or challenge?

@comphead asked #11426 (comment)

do we have a roadmap for 2024?

Which I think is an excellent question.

In general since this project isn't really coordinated centrally the roadmap typically follows what people are working on / want to invest time in

However it is a neat idea to collect any thoughts people have / want to share about what they might work on

Describe the solution you'd like

Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!

Then we can add it to the roadmap on the doc site https://datafusion.apache.org/contributor-guide/roadmap.html

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jul 12, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2024

🤔 I have stuff I would like to do. But that doesn't really count as a roadmap for the project 😆

Here are some things I might guess

  1. streaming stuff [DISCUSSION] Support for Streaming in DataFusion #11404 (@ozankabak and @ameyc perhaps)
  2. ASOF / range joins ASOF join support / Specialize Range Joins #318 (InfluxData might do something here)
  3. Performance improvements (what I hope to would personally like to spend more time on): Enable parquet filter pushdown by default #3463 Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937 Improve performance for grouping by variable length columns (strings) #9403
  4. @XiangpengHao 's work on StringView: [Epic] Implement support for StringView in DataFusion #10918 / use StringViewArray when reading String columns from Parquet #10921

Possibilities:

  1. Logical / user dfined Types: [draft] Add LogicalType, try to support user-defined types #11160 @notfilippo
  2. Make window functions user defined [Epic] Unify WindowFunction Interface (remove built in list of BuiltInWindowFunction s) #8709
  3. Split DataSource / catalogs from the core: Break datafusion-catalog code into its own crate #11182

@ozankabak
Copy link
Contributor

ozankabak commented Jul 13, 2024

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

@alamb
Copy link
Contributor Author

alamb commented Jul 14, 2024

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

I agree with this sentiment.

Something about performance improvements is I think they take sustained engineering investment and significant existing engine expertise (thus it is hard to have newcomers to the project make singificant performance improvements)

I will try and find time from myself and InfluxData these next two quarters to meaninfully invest in improvements in this area

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Thank you. Your help (and everyone else's help) with documentation and reviews I think would also be tremendously beneficial

@notfilippo
Copy link

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

That said, I fully agree with focusing on performance, and I would be happy to rescope my proposal to make it easier to manage.

@lewiszlw
Copy link
Member

lewiszlw commented Jul 15, 2024

I would like to see more progress on logical type #11160 and index support #9963.

@jayzhan211
Copy link
Contributor

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

  • Aggregate Group by
  • Joins
  • Apply StringView
  • Planning
  • Parquet

Anything else?

There's always room for improvement, particularly in terms of performance. Regularly updating the active lists could provider valuable pointer for the community

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

In my mind, here are somre "obvious" performance projects (the ones I have the most confidence that would make a meaningful difference on ClickBench or TPCH queries) are as follows (I can maybe put this in the documentation)

Integrate StringView into Parquet / Filtering / Grouping

@XiangpengHao is doing this as his summer project and doing an amazing job. I also think this is a great example of the the level of effort required to drive one of these performance projects. It requires implementing the features, then analyzing / profiling, identifying the bottlenecks, and then making PRs to remove the bottlenecks. ee #10918 and apache/arrow-rs#5374 have the entire list. Some of my favorites:

What: Use newly added StringView from arrow to improve performance (by avoiding variable length/string data copies)
Why: This For queries that deal with string data in ClickBench or TPCH a large amount of time is spent in parquet decoding as well as filtering and grouping.
What is left: See #10918 and apache/arrow-rs#5374

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Complete Parquet Filter Performance

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion
Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me)
What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Improve Aggregate performance for multi-column grouping when at least one column is variable length

What: Queries like GROUP BY url, code where url is a string are significantly slower than GROUP BY url. We already have the single string column case handled with #7064
Why: There are several queries like this in ClickBench where copying string data to form group keys consumes significant time
What is left: @jayzhan211 already has shown the basic idea of #9430 works in #10976. What is left is to figure out how to get the types to work out in the plans and ensure it doesn't cause performance regressions

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Aggregate performance / memory use for high cardinality aggregates

What: Improve Queries when the number of groups is very high (1 million+)
Why: Queries when the number of groups is high are significantly slower than DuckDB and use substantially more memory. I think there is at least a factor of 2 of performance here
What is left: There are ideas on #6937 but someone has to try them out, prototype / see if they would work and then productionize them

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Join performance with dynamic join filters / Sideways Information Passing

What: Introduce filters apply join filtering during the Scan in addition to during the actual join.
Why: Joins in general (and TPCH definitely) end up being very selective (they filter many rows). However join operators are complex and often much slower than filters (this applies to DataFusion too). By pushing much of the filtering work that would be done in the join down into a scan, plans can be made to go much faster
What is left: There has been no work yet on this. The first thing to do would be to prototype some ideas and see if we can make the TPCH queries much faster, and then figure out how to structure the code

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

@thinkharderdev
Copy link
Contributor

Complete Parquet Filter Performance

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me) What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

My (perhaps unrealistic) hope is that we could find additional improvements (like #4028 or other optimizations) that could make back up any performance that was lost so that we didn't have to have code to choose.

@notfilippo
Copy link

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

I agree! 😄 I was just highlighting how the logical/physical separation could support performance improvements while simplifying things, such as handling custom code for dictionaries.

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

@ozankabak
Copy link
Contributor

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

Also we can finalize our previous discussion on a better statistics infrastructure and start using this information in better ways during planning.

@ozankabak
Copy link
Contributor

ozankabak commented Jul 15, 2024

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

@notfilippo, I think we should complete the exploratory work. Even if we don't get to full-focus on it, this is something we likely want to do at some point (gradually or otherwise).

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

I think we are at a much better place now for LogicalPlaning -- I don't think we have done anything similar for the various physical optimizer passes for ExecutionPlan and I suspect there is quite a bit of improvement to be had there

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

I agree with this basic approach -- my feeling is that introducing logical types successfully will require some concerted effort from existing committers and people with expertise with the current code to help it along. @notfilippo is doing a great job but unless we find them help / support I think the project would struggle to be successful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants