Blog post about CASE optimization #122

pepijnve · 2025-11-11T15:30:17Z

Covers the work done as part of apache/datafusion#18075

pepijnve · 2025-11-11T15:30:41Z

Current state is a very crude first draft written by Claude AI

pepijnve · 2025-12-20T11:11:39Z

@alamb I finally found some time to get through what the bot produced. I think this is now in a good enough shape for a first review.

alamb · 2025-12-20T14:47:17Z

@alamb I finally found some time to get through what the bot produced. I think this is now in a good enough shape for a first review.

Thank you -- I will try and review it over the next few days

pepijnve · 2025-12-20T16:49:05Z

After reading the excellent consecutive repartitioning post I think there might be some more polishing work to do on this one.

content/blog/2025-11-11-datafusion_case.md

Co-authored-by: Andrew Lamb <[email protected]>

alamb

First of all, thank you @pepijnve -- this is great and a really strong piece.

I left a bunch of polish suggestions, any/all of which I am more than happy to help implement.

Another optimization that I think would fit well into this blog would be the optimization for constant tables @rluvaton added in apache/datafusion#18183 (will be released in DataFusion 52)

alamb · 2025-12-22T11:43:52Z

content/blog/2025-11-11-datafusion_case.md

+}
+</style>
+
+# Optimizing CASE Expression Evaluation in DataFusion


I think it would be good to start this post off with some sort of quantification / visual of "how much faster is CASE after these optimizations"

Maybe either a chart like on https://datafusion.apache.org/blog/output/2025/09/29/datafusion-50.0.0/

Or a table

It would also be cool to have an "ablation" version (aka measure the performance after each additional optimization was added -- like a chart that shows progressive improvement).

Maybe we could use the "average runtime of the case benchmark"? I can look into generating this if you like

alamb · 2025-12-22T11:44:57Z

content/blog/2025-11-11-datafusion_case.md

+
+# Optimizing CASE Expression Evaluation in DataFusion
+
+SQL's `CASE` expression is one of the few constructs the language provides to perform conditional logic.


I think it might also make sense to mention that DataFusion (now) also rewrites all other conditional expressions to CASE (like COALESCE, IF, etc) so that these optimizations are now widely used

alamb · 2025-12-22T11:45:26Z

content/blog/2025-11-11-datafusion_case.md

+Its deceptively simple syntax hides significant implementation complexity.
+Over the past few weeks, we've landed a series of improvements to DataFusion's `CASE` expression evaluator that reduce both CPU time and memory allocations.
+This post walks through the original implementation, its performance bottlenecks, and how we addressed them step by step.
+Finally we'll also take a look at some future improvements to `CASE` that are in the works.


I didn't see any section about future improvements 🤔

That's what I get for letting AI draft stuff. I think I can fill that in with a description of @rluvaton's upcoming work.

content/blog/2025-11-11-datafusion_case.md

alamb · 2025-12-22T12:27:06Z

content/blog/2025-11-11-datafusion_case.md

+2. Build a projection vector containing only those column indices
+3. Derive new versions of the expressions with updated column references
+
+For example, if the original CASE references columns at indices `[1, 5, 8]` in a 20-column batch:


I can imagine a nice diagram for this section too where it shows rewriting the expression to only refer to the three columns, and then rewriting the input to a new three column form. Not necessary I am just thinking

alamb · 2025-12-22T12:29:18Z

content/blog/2025-11-11-datafusion_case.md

+3. **Projects columns** to avoid filtering unused data
+4. **Eliminates scatter** operations in common patterns
+
+These improvements compound: a CASE expression on a wide table with multiple branches and early matches benefits from all four optimizations simultaneously.


I think it would be stronger here to refer to the actual performance numbers (see suggestion on introduction) rather than generalizations.

content/blog/2025-11-11-datafusion_case.md

alamb · 2025-12-22T12:36:36Z

content/blog/2025-11-11-datafusion_case.md

+
+For the rest of this post we'll be looking at 'searched case' evaluation.
+'Simple case' uses a distinct, but very similar implementation.
+The same set of improvements has been applied to both.


I think this section would be easier to follow if it had a diagram showing the steps -- I am not sure if you have a visual in your mind but we might be able to come up with something that visually shows the improvements the blogs describes

I'll see if I can polish the quick sketches I made when I was working on this. Copying them here as reference.
Is something along these lines what you have in mind?

Yes, exactly -- showing how the data flows from input to output, with any important intermediate results along the way ❤️

alamb · 2026-01-10T19:58:28Z

With the impending release of DataFusion 52.0.0

Release DataFusion 52.0.0 (Dec 2025 / Jan 2026) datafusion#18566

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

Blog post for the DataFusion 52.0.0 release datafusion#19691

I think this one is looking pretty good, though I had been dreaming about more diagrams, I don't think they are requred.

Do you think this is ready to go from your perspective @pepijnve ? Would you mind if I took a pass trhough to clean up some formatting (like the title)?

pepijnve · 2026-01-10T19:59:56Z

I've been meaning to add those diagrams and do another round of editing. Not much progress due to the holiday break and other priorities.

alamb · 2026-01-11T12:25:31Z

I've been meaning to add those diagrams and do another round of editing. Not much progress due to the holiday break and other priorities.

Me too! No worries. I'll check back in a few days -- anything I can do to help?

First AI-generated draft

b4322af

pepijnve mentioned this pull request Nov 11, 2025

Blog post for the DataFusion 51.0.0 release apache/datafusion#18548

Closed

pepijnve added 2 commits November 28, 2025 14:16

Editing

1cc248e

More editing

a2fe26e

pepijnve marked this pull request as ready for review December 20, 2025 11:10

alamb reviewed Dec 22, 2025

View reviewed changes

content/blog/2025-11-11-datafusion_case.md Outdated Show resolved Hide resolved

Update 2025-11-11-datafusion_case.md

3da4212

Co-authored-by: Andrew Lamb <[email protected]>

alamb approved these changes Dec 22, 2025

View reviewed changes

pepijnve added 4 commits December 22, 2025 15:12

Remove AI cruft

c414da7

Adjust title

b8fd84a

Remove commit hashes

30a115c

Misc. edits

de5f874

alamb changed the title ~~Add blog post regarding CASE work~~ Blog post about CASE optimization Jan 10, 2026


		# Optimizing CASE Expression Evaluation in DataFusion

		SQL's `CASE` expression is one of the few constructs the language provides to perform conditional logic.

Blog post about CASE optimization #122

Are you sure you want to change the base?

Blog post about CASE optimization #122

Uh oh!

Conversation

pepijnve commented Nov 11, 2025

Uh oh!

pepijnve commented Nov 11, 2025

Uh oh!

pepijnve commented Dec 20, 2025

Uh oh!

alamb commented Dec 20, 2025

Uh oh!

pepijnve commented Dec 20, 2025

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

pepijnve commented Jan 10, 2026

Uh oh!

alamb commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alamb Dec 24, 2025 •

edited

Loading