diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md b/content/blog/2026-01-08-datafusion-52.0.0.md new file mode 100644 index 00000000..52ff7fa7 --- /dev/null +++ b/content/blog/2026-01-08-datafusion-52.0.0.md @@ -0,0 +1,337 @@ +--- +layout: post +title: Apache DataFusion 52.0.0 Released +date: 2026-01-08 +author: pmc +categories: [release] +--- + + + +[TOC] + +## Introduction + +We are proud to announce the release of [DataFusion 52.0.0]. This post highlights +some of the major improvements since [DataFusion 51.0.0]. The complete list of +changes is available in the [changelog]. Thanks to the [120 contributors] for +making this release possible. + +TODO: confirm the release date for 52.0.0 and update the front matter if needed. + +[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0 +[DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md +[120 contributors]: https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits + +## Performance Improvements 🚀 + +We continue to make significant performance improvements in DataFusion, both in +the core engine and in the Parquet reader. This release includes faster `CASE` +expressions, better hash performance for string types, and continued string +function optimizations. + +### Performance Chart (TODO) + +TODO: add the 52.0.0 performance chart and update the caption. + + + +**Figure 1**: TODO: update caption for 52.0.0 benchmarking results. + +## Major Features ✨ + +### Arrow IPC Stream file support + +DataFusion can now read Arrow IPC stream files ([#18457]). This expands +interoperability with systems that emit Arrow streams directly, making it +simpler to ingest Arrow-native data without conversion. + +Example (TODO: confirm exact syntax for IPC stream format selection): + +```sql +-- TODO: confirm whether the format name is `arrow`, `ipc_stream`, or implicit. +CREATE EXTERNAL TABLE ipc_events +STORED AS ARROW +LOCATION 's3://bucket/events.arrow'; +``` + +Related PRs: [#18457] + +[#18457]: https://github.com/apache/datafusion/pull/18457 + +### Faster `CASE` expression evaluation + +DataFusion 52 completes major work from the CASE performance epic ([#18075]). +Lookup-table based evaluation avoids repeated expression evaluation and reduces +branching overhead, accelerating common ETL patterns. + +Example: + +```sql +SELECT + CASE + WHEN status IN ('NEW', 'READY', 'STAGED') THEN 'PENDING' + WHEN status IN ('DONE', 'COMPLETE') THEN 'FINISHED' + ELSE 'OTHER' + END AS status_bucket, + count(*) +FROM jobs +GROUP BY 1; +``` + +Related PRs: [#18183] + +[#18075]: https://github.com/apache/datafusion/issues/18075 +[#18183]: https://github.com/apache/datafusion/pull/18183 + +### Extensible SQL planning with relation planner extensions + +DataFusion now supports relation planner extensions for custom SQL syntax and +planning logic ([#17824], [#17843]). This lets downstream projects inject their +own planning behavior without forking the SQL planner, which is critical for +dialect extensions and custom table references. + +Diagram: + +``` +SQL text + | (custom relation planner extension) + v +Logical plan + | (DataFusion optimizers) + v +Physical plan +``` + +TODO: include a short Rust snippet showing how to register a relation planner +extension once the final API example is confirmed. + +Related PRs: [#17843] + +[#17824]: https://github.com/apache/datafusion/issues/17824 +[#17843]: https://github.com/apache/datafusion/pull/17843 + +### ListingTable object store usage improvements + +ListingTable improvements continue to reduce object store I/O and planning +latency for partitioned datasets ([#17214]). DataFusion now normalizes partition +and flat listings, enables a memory-bound list-files cache by default, and +makes the cache prefix-aware for partition pruning. + +Diagram: + +``` +Object store LIST + | (normalized listing + cache) + v +Partitioned files + | (planner) + v +Execution plan +``` + +Related PRs: [#18146], [#18855], [#19366], [#19298], [#18971] + +[#17214]: https://github.com/apache/datafusion/issues/17214 +[#18146]: https://github.com/apache/datafusion/pull/18146 +[#18855]: https://github.com/apache/datafusion/pull/18855 +[#19366]: https://github.com/apache/datafusion/pull/19366 +[#19298]: https://github.com/apache/datafusion/pull/19298 +[#18971]: https://github.com/apache/datafusion/pull/18971 + +### Statistics cache improvements + +The statistics cache has been improved to make pruning and planning more +reliable in repeated workloads ([#19051]). DataFusion now exposes a +`statistics_cache` function and improves cache memory behavior for listing +workflows, making it easier to diagnose cache contents and reduce repeated I/O. + +Example (TODO: confirm the function signature and output schema): + +```sql +-- TODO: confirm the function name and arguments. +SELECT * FROM statistics_cache('my_table'); +``` + +Related PRs: [#19054], [#18855], [#18971] + +[#19051]: https://github.com/apache/datafusion/issues/19051 +[#19054]: https://github.com/apache/datafusion/pull/19054 + +### Pushdown expression evaluation via PhysicalExprAdapter + +DataFusion now pushes down expression evaluation into TableProviders using the +PhysicalExprAdapter, replacing the older SchemaAdapter approach ([#14993], +[#16800]). This enables richer pushdown (expressions and projections) and +improves consistency between logical and physical planning. + +Diagram: + +``` +SQL filter/projection + | (PhysicalExprAdapter) + v +TableProvider pushdown + | (scan) + v +Reduced data +``` + +Related PRs: [#18998], [#19345] + +[#14993]: https://github.com/apache/datafusion/issues/14993 +[#16800]: https://github.com/apache/datafusion/issues/16800 +[#18998]: https://github.com/apache/datafusion/pull/18998 +[#19345]: https://github.com/apache/datafusion/pull/19345 + +### Hash join build-side pushdown + +DataFusion can now push down build-side hash tables from HashJoinExec into scans +([#17171]). When the build side is small, DataFusion converts the hash table to +an `IN` list or hash lookup that can be evaluated during scans, reducing the +join input size early. + +Example: + +```sql +SELECT * +FROM orders o +JOIN small_dim d +ON o.dim_id = d.id; +``` + +TODO: include a physical plan snippet that shows the pushdown filter once a +canonical example is selected. + +Related PRs: [#18393] + +[#17171]: https://github.com/apache/datafusion/issues/17171 +[#18393]: https://github.com/apache/datafusion/pull/18393 + +### Sort pushdown to sources + +DataFusion now supports sort pushdown into data sources, allowing scans to +return sorted data or leverage reversed row groups when possible ([#10433], +[#19064]). This reduces memory pressure and can eliminate explicit sort stages +for partitioned or pre-sorted data. + +Example: + +```sql +SELECT * +FROM parquet_table +ORDER BY event_time DESC; +``` + +Related PRs: [#19064] + +[#10433]: https://github.com/apache/datafusion/issues/10433 +[#19064]: https://github.com/apache/datafusion/pull/19064 + +### DELETE/UPDATE hooks in TableProvider + +TableProvider now includes DELETE and UPDATE hooks, with MemTable providing the +first implementation ([#19142]). This is an important step toward fully +featured DML support and enables downstream storage engines to plug in their +own mutation logic. + +Example: + +```sql +DELETE FROM mem_table WHERE status = 'obsolete'; +``` + +Related PRs: [#19142] + +[#19142]: https://github.com/apache/datafusion/pull/19142 + +### CoalesceBatchesExec removal and integrated batch coalescing + +DataFusion continues the work to remove the standalone CoalesceBatchesExec +operator ([#18779]). Batch coalescing is now integrated into multiple operators, +reducing plan complexity and avoiding unnecessary batch materialization. + +Diagram: + +``` +Before: + Scan -> CoalesceBatches -> Filter -> CoalesceBatches -> Join + +After: + Scan -> Filter (coalesce inline) -> Join (coalesce inline) +``` + +Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342], [#19239] + +[#18779]: https://github.com/apache/datafusion/issues/18779 +[#18540]: https://github.com/apache/datafusion/pull/18540 +[#18604]: https://github.com/apache/datafusion/pull/18604 +[#18630]: https://github.com/apache/datafusion/pull/18630 +[#18972]: https://github.com/apache/datafusion/pull/18972 +[#19002]: https://github.com/apache/datafusion/pull/19002 +[#19342]: https://github.com/apache/datafusion/pull/19342 +[#19239]: https://github.com/apache/datafusion/pull/19239 + +## Upgrade Guide and Changelog + +Upgrading to 52.0.0 should be straightforward for most users. Please review the +[Upgrade Guide] +for details on breaking changes and code snippets to help with the transition. +For a comprehensive list of all changes, please refer to the [changelog]. + +## About DataFusion + +[Apache DataFusion] is an extensible query engine, written in [Rust], that uses +[Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast, data-centric systems such as databases, dataframe libraries, +and machine learning and streaming applications. While [DataFusion's primary +design goal] is to accelerate the creation of other data-centric systems, it +provides a reasonable experience directly out of the box as a [dataframe +library], [Python library], and [command-line SQL tool]. + +[apache datafusion]: https://datafusion.apache.org/ +[rust]: https://www.rust-lang.org/ +[apache arrow]: https://arrow.apache.org +[DataFusion's primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html +[python library]: https://datafusion.apache.org/python/ +[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/ +[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html + +## How to Get Involved + +DataFusion is not a project built or driven by a single person, company, or +foundation. Rather, our community of users and contributors works together to +build a shared technology that none of us could have built alone. + +If you are interested in joining us, we would love to have you. You can try out +DataFusion on some of your own data and projects and let us know how it goes, +contribute suggestions, documentation, bug reports, or a PR with documentation, +tests, or code. A list of open issues suitable for beginners is [here], and you +can find out how to reach us on the [communication doc]. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html