-
Notifications
You must be signed in to change notification settings - Fork 22
Blog post about CASE optimization #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Current state is a very crude first draft written by Claude AI |
|
@alamb I finally found some time to get through what the bot produced. I think this is now in a good enough shape for a first review. |
Thank you -- I will try and review it over the next few days |
|
After reading the excellent consecutive repartitioning post I think there might be some more polishing work to do on this one. |
Co-authored-by: Andrew Lamb <[email protected]>
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you @pepijnve -- this is great and a really strong piece.
I left a bunch of polish suggestions, any/all of which I am more than happy to help implement.
Another optimization that I think would fit well into this blog would be the optimization for constant tables @rluvaton added in apache/datafusion#18183 (will be released in DataFusion 52)
| } | ||
| </style> | ||
|
|
||
| # Optimizing CASE Expression Evaluation in DataFusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to start this post off with some sort of quantification / visual of "how much faster is CASE after these optimizations"
Maybe either a chart like on https://datafusion.apache.org/blog/output/2025/09/29/datafusion-50.0.0/
Or a table
It would also be cool to have an "ablation" version (aka measure the performance after each additional optimization was added -- like a chart that shows progressive improvement).
Maybe we could use the "average runtime of the case benchmark"? I can look into generating this if you like
|
|
||
| # Optimizing CASE Expression Evaluation in DataFusion | ||
|
|
||
| SQL's `CASE` expression is one of the few constructs the language provides to perform conditional logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might also make sense to mention that DataFusion (now) also rewrites all other conditional expressions to CASE (like COALESCE, IF, etc) so that these optimizations are now widely used
| Its deceptively simple syntax hides significant implementation complexity. | ||
| Over the past few weeks, we've landed a series of improvements to DataFusion's `CASE` expression evaluator that reduce both CPU time and memory allocations. | ||
| This post walks through the original implementation, its performance bottlenecks, and how we addressed them step by step. | ||
| Finally we'll also take a look at some future improvements to `CASE` that are in the works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any section about future improvements 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I get for letting AI draft stuff. I think I can fill that in with a description of @rluvaton's upcoming work.
| 2. Build a projection vector containing only those column indices | ||
| 3. Derive new versions of the expressions with updated column references | ||
|
|
||
| For example, if the original CASE references columns at indices `[1, 5, 8]` in a 20-column batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine a nice diagram for this section too where it shows rewriting the expression to only refer to the three columns, and then rewriting the input to a new three column form. Not necessary I am just thinking
| 3. **Projects columns** to avoid filtering unused data | ||
| 4. **Eliminates scatter** operations in common patterns | ||
|
|
||
| These improvements compound: a CASE expression on a wide table with multiple branches and early matches benefits from all four optimizations simultaneously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be stronger here to refer to the actual performance numbers (see suggestion on introduction) rather than generalizations.
|
|
||
| For the rest of this post we'll be looking at 'searched case' evaluation. | ||
| 'Simple case' uses a distinct, but very similar implementation. | ||
| The same set of improvements has been applied to both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section would be easier to follow if it had a diagram showing the steps -- I am not sure if you have a visual in your mind but we might be able to come up with something that visually shows the improvements the blogs describes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly -- showing how the data flows from input to output, with any important intermediate results along the way ❤️
|
With the impending release of DataFusion 52.0.0 I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog I think this one is looking pretty good, though I had been dreaming about more diagrams, I don't think they are requred. Do you think this is ready to go from your perspective @pepijnve ? Would you mind if I took a pass trhough to clean up some formatting (like the title)?
|
|
I've been meaning to add those diagrams and do another round of editing. Not much progress due to the holiday break and other priorities. |
Me too! No worries. I'll check back in a few days -- anything I can do to help? |



Covers the work done as part of apache/datafusion#18075