Skip to content

Conversation

@pepijnve
Copy link
Contributor

Covers the work done as part of apache/datafusion#18075

@pepijnve
Copy link
Contributor Author

Current state is a very crude first draft written by Claude AI

@pepijnve pepijnve marked this pull request as ready for review December 20, 2025 11:10
@pepijnve
Copy link
Contributor Author

@alamb I finally found some time to get through what the bot produced. I think this is now in a good enough shape for a first review.

@alamb
Copy link
Contributor

alamb commented Dec 20, 2025

@alamb I finally found some time to get through what the bot produced. I think this is now in a good enough shape for a first review.

Thank you -- I will try and review it over the next few days

@pepijnve
Copy link
Contributor Author

After reading the excellent consecutive repartitioning post I think there might be some more polishing work to do on this one.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you @pepijnve -- this is great and a really strong piece.

I left a bunch of polish suggestions, any/all of which I am more than happy to help implement.

Another optimization that I think would fit well into this blog would be the optimization for constant tables @rluvaton added in apache/datafusion#18183 (will be released in DataFusion 52)

}
</style>

# Optimizing CASE Expression Evaluation in DataFusion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to start this post off with some sort of quantification / visual of "how much faster is CASE after these optimizations"

Maybe either a chart like on https://datafusion.apache.org/blog/output/2025/09/29/datafusion-50.0.0/

Image

Or a table

It would also be cool to have an "ablation" version (aka measure the performance after each additional optimization was added -- like a chart that shows progressive improvement).

Maybe we could use the "average runtime of the case benchmark"? I can look into generating this if you like


# Optimizing CASE Expression Evaluation in DataFusion

SQL's `CASE` expression is one of the few constructs the language provides to perform conditional logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might also make sense to mention that DataFusion (now) also rewrites all other conditional expressions to CASE (like COALESCE, IF, etc) so that these optimizations are now widely used

Its deceptively simple syntax hides significant implementation complexity.
Over the past few weeks, we've landed a series of improvements to DataFusion's `CASE` expression evaluator that reduce both CPU time and memory allocations.
This post walks through the original implementation, its performance bottlenecks, and how we addressed them step by step.
Finally we'll also take a look at some future improvements to `CASE` that are in the works.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any section about future improvements 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I get for letting AI draft stuff. I think I can fill that in with a description of @rluvaton's upcoming work.

2. Build a projection vector containing only those column indices
3. Derive new versions of the expressions with updated column references

For example, if the original CASE references columns at indices `[1, 5, 8]` in a 20-column batch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine a nice diagram for this section too where it shows rewriting the expression to only refer to the three columns, and then rewriting the input to a new three column form. Not necessary I am just thinking

3. **Projects columns** to avoid filtering unused data
4. **Eliminates scatter** operations in common patterns

These improvements compound: a CASE expression on a wide table with multiple branches and early matches benefits from all four optimizations simultaneously.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be stronger here to refer to the actual performance numbers (see suggestion on introduction) rather than generalizations.


For the rest of this post we'll be looking at 'searched case' evaluation.
'Simple case' uses a distinct, but very similar implementation.
The same set of improvements has been applied to both.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section would be easier to follow if it had a diagram showing the steps -- I am not sure if you have a visual in your mind but we might be able to come up with something that visually shows the improvements the blogs describes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can polish the quick sketches I made when I was working on this. Copying them here as reference.
Is something along these lines what you have in mind?

image image

Copy link
Contributor

@alamb alamb Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly -- showing how the data flows from input to output, with any important intermediate results along the way ❤️

@alamb alamb changed the title Add blog post regarding CASE work Blog post about CASE optimization Jan 10, 2026
@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

With the impending release of DataFusion 52.0.0

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

I think this one is looking pretty good, though I had been dreaming about more diagrams, I don't think they are requred.

Do you think this is ready to go from your perspective @pepijnve ? Would you mind if I took a pass trhough to clean up some formatting (like the title)?

Screenshot 2026-01-10 at 2 57 44 PM

@pepijnve
Copy link
Contributor Author

I've been meaning to add those diagrams and do another round of editing. Not much progress due to the holiday break and other priorities.

@alamb
Copy link
Contributor

alamb commented Jan 11, 2026

I've been meaning to add those diagrams and do another round of editing. Not much progress due to the holiday break and other priorities.

Me too! No worries. I'll check back in a few days -- anything I can do to help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants