2024 Q3-Q4 Roadmap? #11442

alamb · 2024-07-12T19:20:38Z

Is your feature request related to a problem or challenge?

do we have a roadmap for 2024?

Which I think is an excellent question.

In general since this project isn't really coordinated centrally the roadmap typically follows what people are working on / want to invest time in

However it is a neat idea to collect any thoughts people have / want to share about what they might work on

Describe the solution you'd like

Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!

Then we can add it to the roadmap on the doc site https://datafusion.apache.org/contributor-guide/roadmap.html

Describe alternatives you've considered

No response

Additional context

No response

alamb · 2024-07-12T19:25:55Z

🤔 I have stuff I would like to do. But that doesn't really count as a roadmap for the project 😆

Here are some things I might guess

streaming stuff [DISCUSSION] Support for Streaming in DataFusion #11404 (@ozankabak and @ameyc perhaps)
ASOF / range joins ASOF join support / Specialize Range Joins #318 (InfluxData might do something here)
Performance improvements (what I hope to would personally like to spend more time on): Enable parquet filter pushdown by default #3463 Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937 Improve performance for grouping by variable length columns (strings) #9403
@XiangpengHao 's work on StringView: [Epic] Implement support for StringView in DataFusion #10918 / use StringViewArray when reading String columns from Parquet #10921

Possibilities:

Logical / user dfined Types: [draft] Add LogicalType, try to support user-defined types #11160 @notfilippo
Make window functions user defined [Epic] Unify WindowFunction Interface (remove built in list of BuiltInWindowFunction s) #8709
Split DataSource / catalogs from the core: Break datafusion-catalog code into its own crate #11182

ozankabak · 2024-07-13T09:20:20Z

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

alamb · 2024-07-14T15:29:13Z

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

I agree with this sentiment.

Something about performance improvements is I think they take sustained engineering investment and significant existing engine expertise (thus it is hard to have newcomers to the project make singificant performance improvements)

I will try and find time from myself and InfluxData these next two quarters to meaninfully invest in improvements in this area

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Thank you. Your help (and everyone else's help) with documentation and reviews I think would also be tremendously beneficial

notfilippo · 2024-07-14T17:55:14Z

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

That said, I fully agree with focusing on performance, and I would be happy to rescope my proposal to make it easier to manage.

lewiszlw · 2024-07-15T03:45:17Z

I would like to see more progress on logical type #11160 and index support #9963.

jayzhan211 · 2024-07-15T06:27:02Z

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

Aggregate Group by
Joins
Apply StringView
Planning
Parquet

Anything else?

There's always room for improvement, particularly in terms of performance. Regularly updating the active lists could provider valuable pointer for the community

alamb · 2024-07-15T10:18:29Z

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

In my mind, here are somre "obvious" performance projects (the ones I have the most confidence that would make a meaningful difference on ClickBench or TPCH queries) are as follows (I can maybe put this in the documentation)

Integrate StringView into Parquet / Filtering / Grouping

[Epic] Implement support for StringView in DataFusion #10918

@XiangpengHao is doing this as his summer project and doing an amazing job. I also think this is a great example of the the level of effort required to drive one of these performance projects. It requires implementing the features, then analyzing / profiling, identifying the bottlenecks, and then making PRs to remove the bottlenecks. ee #10918 and apache/arrow-rs#5374 have the entire list. Some of my favorites:

What: Use newly added StringView from arrow to improve performance (by avoiding variable length/string data copies)
Why: This For queries that deal with string data in ClickBench or TPCH a large amount of time is spent in parquet decoding as well as filtering and grouping.
What is left: See #10918 and apache/arrow-rs#5374

alamb · 2024-07-15T10:19:42Z

Complete Parquet Filter Performance

Enable parquet filter pushdown by default #3463

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion
Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me)
What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

alamb · 2024-07-15T10:21:55Z

Improve Aggregate performance for multi-column grouping when at least one column is variable length

Improve performance for grouping by variable length columns (strings) #9403

What: Queries like GROUP BY url, code where url is a string are significantly slower than GROUP BY url. We already have the single string column case handled with #7064
Why: There are several queries like this in ClickBench where copying string data to form group keys consumes significant time
What is left: @jayzhan211 already has shown the basic idea of #9430 works in #10976. What is left is to figure out how to get the types to work out in the plans and ensure it doesn't cause performance regressions

alamb · 2024-07-15T10:24:59Z

Aggregate performance / memory use for high cardinality aggregates

Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937

What: Improve Queries when the number of groups is very high (1 million+)
Why: Queries when the number of groups is high are significantly slower than DuckDB and use substantially more memory. I think there is at least a factor of 2 of performance here
What is left: There are ideas on #6937 but someone has to try them out, prototype / see if they would work and then productionize them

alamb · 2024-07-15T10:29:52Z

Join performance with dynamic join filters / Sideways Information Passing

Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) #7955

What: Introduce filters apply join filtering during the Scan in addition to during the actual join.
Why: Joins in general (and TPCH definitely) end up being very selective (they filter many rows). However join operators are complex and often much slower than filters (this applies to DataFusion too). By pushing much of the filtering work that would be done in the join down into a scan, plans can be made to go much faster
What is left: There has been no work yet on this. The first thing to do would be to prototype some ideas and see if we can make the TPCH queries much faster, and then figure out how to structure the code

alamb · 2024-07-15T10:31:46Z

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

thinkharderdev · 2024-07-15T10:40:12Z

Complete Parquet Filter Performance

Enable parquet filter pushdown by default #3463

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me) What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

Adaptive Parquet Predicate Pushdown arrow-rs#5523

Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled #4028)

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

alamb · 2024-07-15T10:50:02Z

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

My (perhaps unrealistic) hope is that we could find additional improvements (like #4028 or other optimizations) that could make back up any performance that was lost so that we didn't have to have code to choose.

notfilippo · 2024-07-15T11:05:15Z

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

I agree! 😄 I was just highlighting how the logical/physical separation could support performance improvements while simplifying things, such as handling custom code for dictionaries.

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

ozankabak · 2024-07-15T11:17:58Z

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

Also we can finalize our previous discussion on a better statistics infrastructure and start using this information in better ways during planning.

ozankabak · 2024-07-15T11:22:57Z

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

@notfilippo, I think we should complete the exploratory work. Even if we don't get to full-focus on it, this is something we likely want to do at some point (gradually or otherwise).

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

alamb · 2024-07-15T11:23:52Z

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

I think we are at a much better place now for LogicalPlaning -- I don't think we have done anything similar for the various physical optimizer passes for ExecutionPlan and I suspect there is quite a bit of improvement to be had there

alamb · 2024-07-15T11:25:50Z

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

I agree with this basic approach -- my feeling is that introducing logical types successfully will require some concerted effort from existing committers and people with expertise with the current code to help it along. @notfilippo is doing a great job but unless we find them help / support I think the project would struggle to be successful

alamb added the enhancement New feature or request label Jul 12, 2024

alamb mentioned this issue Jul 12, 2024

Combine the Roadmap / Quarterly Roadmap sections #11426

Merged

This was referenced Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 #11474

Closed

DataFusion weekly project plan (Andrew Lamb) - July 22, 2024 #11601

Closed

alamb mentioned this issue Jul 29, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 #11710

Closed

8 tasks

jayzhan211 pinned this issue Aug 29, 2024

jayzhan211 mentioned this issue Sep 6, 2024

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

Open

alamb mentioned this issue Sep 9, 2024

September 2024 ASF Board Report #10156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024 Q3-Q4 Roadmap? #11442

2024 Q3-Q4 Roadmap? #11442

alamb commented Jul 12, 2024 •

edited

Loading

alamb commented Jul 12, 2024

ozankabak commented Jul 13, 2024 •

edited

Loading

alamb commented Jul 14, 2024

notfilippo commented Jul 14, 2024

lewiszlw commented Jul 15, 2024 •

edited

Loading

jayzhan211 commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

thinkharderdev commented Jul 15, 2024

Complete Parquet Filter Performance

alamb commented Jul 15, 2024 •

edited

Loading

notfilippo commented Jul 15, 2024

ozankabak commented Jul 15, 2024

ozankabak commented Jul 15, 2024 •

edited

Loading

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

2024 Q3-Q4 Roadmap? #11442

2024 Q3-Q4 Roadmap? #11442

Comments

alamb commented Jul 12, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jul 12, 2024

ozankabak commented Jul 13, 2024 • edited Loading

alamb commented Jul 14, 2024

notfilippo commented Jul 14, 2024

lewiszlw commented Jul 15, 2024 • edited Loading

jayzhan211 commented Jul 15, 2024

alamb commented Jul 15, 2024

Integrate StringView into Parquet / Filtering / Grouping

alamb commented Jul 15, 2024

Complete Parquet Filter Performance

alamb commented Jul 15, 2024

Improve Aggregate performance for multi-column grouping when at least one column is variable length

alamb commented Jul 15, 2024

Aggregate performance / memory use for high cardinality aggregates

alamb commented Jul 15, 2024

Join performance with dynamic join filters / Sideways Information Passing

alamb commented Jul 15, 2024

thinkharderdev commented Jul 15, 2024

Complete Parquet Filter Performance

alamb commented Jul 15, 2024 • edited Loading

notfilippo commented Jul 15, 2024

ozankabak commented Jul 15, 2024

ozankabak commented Jul 15, 2024 • edited Loading

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 12, 2024 •

edited

Loading

ozankabak commented Jul 13, 2024 •

edited

Loading

lewiszlw commented Jul 15, 2024 •

edited

Loading

alamb commented Jul 15, 2024 •

edited

Loading

ozankabak commented Jul 15, 2024 •

edited

Loading