Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
337 changes: 337 additions & 0 deletions content/blog/2026-01-08-datafusion-52.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,337 @@
---
layout: post
title: Apache DataFusion 52.0.0 Released
date: 2026-01-08
author: pmc
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

[TOC]

## Introduction

We are proud to announce the release of [DataFusion 52.0.0]. This post highlights
some of the major improvements since [DataFusion 51.0.0]. The complete list of
changes is available in the [changelog]. Thanks to the [120 contributors] for
making this release possible.

TODO: confirm the release date for 52.0.0 and update the front matter if needed.

[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0
[DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/
[changelog]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md
[120 contributors]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits

## Performance Improvements 🚀

We continue to make significant performance improvements in DataFusion, both in
the core engine and in the Parquet reader. This release includes faster `CASE`
expressions, better hash performance for string types, and continued string
function optimizations.

### Performance Chart (TODO)

TODO: add the 52.0.0 performance chart and update the caption.

<img
src="/blog/images/datafusion-52.0.0/performance_over_time_clickbench.png"
width="100%"
class="img-responsive"
alt="Performance over time"
/>

**Figure 1**: TODO: update caption for 52.0.0 benchmarking results.

## Major Features ✨

### Arrow IPC Stream file support

DataFusion can now read Arrow IPC stream files ([#18457]). This expands
interoperability with systems that emit Arrow streams directly, making it
simpler to ingest Arrow-native data without conversion.

Example (TODO: confirm exact syntax for IPC stream format selection):

```sql
-- TODO: confirm whether the format name is `arrow`, `ipc_stream`, or implicit.
CREATE EXTERNAL TABLE ipc_events
STORED AS ARROW
LOCATION 's3://bucket/events.arrow';
```

Related PRs: [#18457]

[#18457]: https://github.com/apache/datafusion/pull/18457

### Faster `CASE` expression evaluation

DataFusion 52 completes major work from the CASE performance epic ([#18075]).
Lookup-table based evaluation avoids repeated expression evaluation and reduces
branching overhead, accelerating common ETL patterns.

Example:

```sql
SELECT
CASE
WHEN status IN ('NEW', 'READY', 'STAGED') THEN 'PENDING'
WHEN status IN ('DONE', 'COMPLETE') THEN 'FINISHED'
ELSE 'OTHER'
END AS status_bucket,
count(*)
FROM jobs
GROUP BY 1;
```

Related PRs: [#18183]

[#18075]: https://github.com/apache/datafusion/issues/18075
[#18183]: https://github.com/apache/datafusion/pull/18183

### Extensible SQL planning with relation planner extensions

DataFusion now supports relation planner extensions for custom SQL syntax and
planning logic ([#17824], [#17843]). This lets downstream projects inject their
own planning behavior without forking the SQL planner, which is critical for
dialect extensions and custom table references.

Diagram:

```
SQL text
| (custom relation planner extension)
v
Logical plan
| (DataFusion optimizers)
v
Physical plan
```

TODO: include a short Rust snippet showing how to register a relation planner
extension once the final API example is confirmed.

Related PRs: [#17843]

[#17824]: https://github.com/apache/datafusion/issues/17824
[#17843]: https://github.com/apache/datafusion/pull/17843

### ListingTable object store usage improvements

ListingTable improvements continue to reduce object store I/O and planning
latency for partitioned datasets ([#17214]). DataFusion now normalizes partition
and flat listings, enables a memory-bound list-files cache by default, and
makes the cache prefix-aware for partition pruning.

Diagram:

```
Object store LIST
| (normalized listing + cache)
v
Partitioned files
| (planner)
v
Execution plan
```

Related PRs: [#18146], [#18855], [#19366], [#19298], [#18971]

[#17214]: https://github.com/apache/datafusion/issues/17214
[#18146]: https://github.com/apache/datafusion/pull/18146
[#18855]: https://github.com/apache/datafusion/pull/18855
[#19366]: https://github.com/apache/datafusion/pull/19366
[#19298]: https://github.com/apache/datafusion/pull/19298
[#18971]: https://github.com/apache/datafusion/pull/18971

### Statistics cache improvements

The statistics cache has been improved to make pruning and planning more
reliable in repeated workloads ([#19051]). DataFusion now exposes a
`statistics_cache` function and improves cache memory behavior for listing
workflows, making it easier to diagnose cache contents and reduce repeated I/O.

Example (TODO: confirm the function signature and output schema):

```sql
-- TODO: confirm the function name and arguments.
SELECT * FROM statistics_cache('my_table');
```

Related PRs: [#19054], [#18855], [#18971]

[#19051]: https://github.com/apache/datafusion/issues/19051
[#19054]: https://github.com/apache/datafusion/pull/19054

### Pushdown expression evaluation via PhysicalExprAdapter

DataFusion now pushes down expression evaluation into TableProviders using the
PhysicalExprAdapter, replacing the older SchemaAdapter approach ([#14993],
[#16800]). This enables richer pushdown (expressions and projections) and
improves consistency between logical and physical planning.

Diagram:

```
SQL filter/projection
| (PhysicalExprAdapter)
v
TableProvider pushdown
| (scan)
v
Reduced data
```

Related PRs: [#18998], [#19345]

[#14993]: https://github.com/apache/datafusion/issues/14993
[#16800]: https://github.com/apache/datafusion/issues/16800
[#18998]: https://github.com/apache/datafusion/pull/18998
[#19345]: https://github.com/apache/datafusion/pull/19345

### Hash join build-side pushdown

DataFusion can now push down build-side hash tables from HashJoinExec into scans
([#17171]). When the build side is small, DataFusion converts the hash table to
an `IN` list or hash lookup that can be evaluated during scans, reducing the
join input size early.

Example:

```sql
SELECT *
FROM orders o
JOIN small_dim d
ON o.dim_id = d.id;
```

TODO: include a physical plan snippet that shows the pushdown filter once a
canonical example is selected.

Related PRs: [#18393]

[#17171]: https://github.com/apache/datafusion/issues/17171
[#18393]: https://github.com/apache/datafusion/pull/18393

### Sort pushdown to sources

DataFusion now supports sort pushdown into data sources, allowing scans to
return sorted data or leverage reversed row groups when possible ([#10433],
[#19064]). This reduces memory pressure and can eliminate explicit sort stages
for partitioned or pre-sorted data.

Example:

```sql
SELECT *
FROM parquet_table
ORDER BY event_time DESC;
```

Related PRs: [#19064]

[#10433]: https://github.com/apache/datafusion/issues/10433
[#19064]: https://github.com/apache/datafusion/pull/19064

### DELETE/UPDATE hooks in TableProvider

TableProvider now includes DELETE and UPDATE hooks, with MemTable providing the
first implementation ([#19142]). This is an important step toward fully
featured DML support and enables downstream storage engines to plug in their
own mutation logic.

Example:

```sql
DELETE FROM mem_table WHERE status = 'obsolete';
```

Related PRs: [#19142]

[#19142]: https://github.com/apache/datafusion/pull/19142

### CoalesceBatchesExec removal and integrated batch coalescing

DataFusion continues the work to remove the standalone CoalesceBatchesExec
operator ([#18779]). Batch coalescing is now integrated into multiple operators,
reducing plan complexity and avoiding unnecessary batch materialization.

Diagram:

```
Before:
Scan -> CoalesceBatches -> Filter -> CoalesceBatches -> Join

After:
Scan -> Filter (coalesce inline) -> Join (coalesce inline)
```

Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239]

[#18779]: https://github.com/apache/datafusion/issues/18779
[#18540]: https://github.com/apache/datafusion/pull/18540
[#18604]: https://github.com/apache/datafusion/pull/18604
[#18630]: https://github.com/apache/datafusion/pull/18630
[#18972]: https://github.com/apache/datafusion/pull/18972
[#19002]: https://github.com/apache/datafusion/pull/19002
[#19342]: https://github.com/apache/datafusion/pull/19342
[#19239]: https://github.com/apache/datafusion/pull/19239

## Upgrade Guide and Changelog

Upgrading to 52.0.0 should be straightforward for most users. Please review the
[Upgrade Guide]
for details on breaking changes and code snippets to help with the transition.
For a comprehensive list of all changes, please refer to the [changelog].

## About DataFusion

[Apache DataFusion] is an extensible query engine, written in [Rust], that uses
[Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast, data-centric systems such as databases, dataframe libraries,
and machine learning and streaming applications. While [DataFusion's primary
design goal] is to accelerate the creation of other data-centric systems, it
provides a reasonable experience directly out of the box as a [dataframe
library], [Python library], and [command-line SQL tool].

[apache datafusion]: https://datafusion.apache.org/
[rust]: https://www.rust-lang.org/
[apache arrow]: https://arrow.apache.org
[DataFusion's primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals
[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
[python library]: https://datafusion.apache.org/python/
[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/
[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html

## How to Get Involved

DataFusion is not a project built or driven by a single person, company, or
foundation. Rather, our community of users and contributors works together to
build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out
DataFusion on some of your own data and projects and let us know how it goes,
contribute suggestions, documentation, bug reports, or a PR with documentation,
tests, or code. A list of open issues suitable for beginners is [here], and you
can find out how to reach us on the [communication doc].

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html