Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
353 changes: 353 additions & 0 deletions _posts/2025-07-17-21.0.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
---
layout: post
title: "Apache Arrow 21.0.0 Release"
date: "2025-07-17 00:00:00"
author: pmc
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

The Apache Arrow team is pleased to announce the 21.0.0 release. This release
covers over 2 months of development work and includes [**339 resolved
issues**][1] on [**400 distinct commits**][2] from [**82 distinct
contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) to
learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bugfixes and improvements have been made: we refer
you to the [complete changelog][3].

## Community

Since the 20.0.0 release, Alenka Frim has been invited to join the Project
Management Committee (PMC).

Thanks for your contributions and participation in the project!

The [Call for Speakers](https://sessionize.com/arrow-summit-2025/) for the
Apache Arrow Summit 2025 is now open! The Summit will take place on October 2nd,
2025 in Paris, France as part of [PyData Paris](https://pydata.org/paris2025).
The call will be open until July 26, 2025. Please see the [Call for
Speakers](https://sessionize.com/arrow-summit-2025/) link to submit a talk or
the [developer mailing
list](https://lists.apache.org/thread/f0vcbtpzg6rntzbvmjstyd27bd9qfhl0) for more
information.

## Arrow Flight RPC Notes

In C++ and Python, a new IPC reader option was added to force data buffers to be
aligned based on the data type, making it easier to work with systems that
expected alignment ([GH-32276](https://github.com/apache/arrow/issues/32276)).
While this is not a Flight-specific option, it tended to occur with Flight due
to implementation details. Also, C++ and Python are now consistent with other
Flight implementations in allowing the schema of a `FlightInfo` to be omitted
([GH-37677](https://github.com/apache/arrow/issues/37677)).

We have accepted a donation of an ODBC driver for Flight SQL from Dremio
([GH-46522](https://github.com/apache/arrow/issues/46522)). Note that the driver
is not usable in its current state and contributors are working on implementing
the rest of the driver.

## C++ Notes

### Compute

The Cast function is now able to reorder fields when casting from one struct
type to another; the fields are matched by name, not by index
([GH-45028](https://github.com/apache/arrow/issues/45028)).

Many compute kernels have been moved into a separate, optional, shared library
([GH-25025](https://github.com/apache/arrow/issues/25025)). This improves
modularity for dependency management in the application and reduces the Arrow
C++ distribution size when the compute functionality is not being used. Note
that some compute functions, such as the Cast function, will still be built for
internal use in various Arrow components.

Better half-float support has been added to some compute functions: `is_nan`,
`is_inf`, `is_finite`, `negate`, `negate_checked`, `sign`
([GH-45083](https://github.com/apache/arrow/issues/45083)); `if_else`,
`case_when`, `coalesce`, `choose`, `replace_with_mask`, `fill_null_forward`,
`fill_null_backward` ([GH-37027](https://github.com/apache/arrow/issues/37027));
`run_end_encode`, `run_end_decode`
([GH-46285](https://github.com/apache/arrow/issues/46285)).

Better decimal32 and decimal64 support has been added to some compute functions:
`run_end_encode`, `run_end_decode`
([GH-46285](https://github.com/apache/arrow/issues/46285)).

A new function `utf8_zero_fill` acts like Python's `str.zfill` method by
providing a left-padding function that preserves the optional leading plus/minus
sign ([GH-46683](https://github.com/apache/arrow/issues/46683)).

Decimal sum aggregation now produces a decimal result with an increased
precision in order to reduce the risk of overflowing the result type
([GH-35166](https://github.com/apache/arrow/issues/35166)).

### CSV

Reading Duration columns is now supported
([GH-40278](https://github.com/apache/arrow/issues/40278)).

### Dataset

It is now possible to preserve order when writing a dataset multi-threaded. The
feature is disabled by default
([GH-26818](https://github.com/apache/arrow/issues/26818)).

### Filesystems

The S3 filesystem can optionally be built into a separate DLL
([GH-40343](https://github.com/apache/arrow/issues/40343)).

### Parquet

#### Encryption

A new `SecureString` class must now be used to communicate sensitive data (such
as secret keys) with Parquet encryption APIs. This class automatically wipes its
contents from memory when destroyed, unlike regular `std::string`
([GH-31603](https://github.com/apache/arrow/issues/31603)).

#### Type support

The new VARIANT logical type is supported at a low level, and an extension type
`parquet.variant` is added to reflect such columns when reading them to Arrow
([GH-45937](https://github.com/apache/arrow/issues/45937)).

The UUID logical type is automatically converted to/from the `arrow.uuid`
canonical extension type when reading or writing Parquet data, respectively.

The GEOMETRY and GEOGRAPHY logical types are supported
([GH-45522](https://github.com/apache/arrow/issues/45522)). They are
automatically converted to/from the corresponding GeoArrow extension type, if it
has been registered by GeoArrow. Geospatial column statistics are also
supported.

It is now possible to read BYTE_ARRAY columns directly as LargeBinary or
BinaryView, without any intermediate conversion from Binary. Similarly, those
types can be written directly to Parquet
([GH-43041](https://github.com/apache/arrow/issues/43041)). This allows
bypassing the 2 GiB data per chunk limitation of the Binary type, and can also
improve performance. This also applies to String types when a Parquet column has
the STRING logical type.

Similarly, LIST columns can now be read directly as LargeList rather than List.
This allows bypassing the 2^31 values per chunk limitation of regular List types
([GH-46676](https://github.com/apache/arrow/issues/46676)).

#### Other Parquet improvements

A new feature named Content-Defined Chunking improves deduplication of Parquet
files with mostly identical contents, by choosing data page boundaries based on
actual contents rather than a number of values. For that, it uses a rolling hash
function, and the min and max chunk size can be chosen. The feature is disabled
by default and can be enabled on a per-file basis in the Parquet
`WriterProperties` ([GH-45750](https://github.com/apache/arrow/issues/45750)).

The `EncodedStatistics` of a column chunk are publicly exposed in
`ColumnChunkMetaData` and can be read faster than if decoded as `Statistics`
([GH-46462](https://github.com/apache/arrow/issues/46462)).

SIMD optimizations for the BYTE_STREAM_SPLIT have been improved
([GH-46788](https://github.com/apache/arrow/issues/46788)).

Reading FIXED_LEN_BYTE_ARRAY data has been made significantly faster (up to 3x
faster on some benchmarks). This benefits logical types such as FLOAT16
([GH-43891](https://github.com/apache/arrow/issues/43891)).

### Miscellaneous C++ changes

The `ARROW_USE_PRECOMPILED_HEADERS` build option was removed, as
`CMAKE_UNITY_BUILD` usually provides more benefits while requiring less
maintenance.

New data creation helpers `ArrayFromJSONString`, `ChunkedArrayFromJSONString`,
`DictArrayFromJSONString`, `ScalarFromJSONString` and `DictScalarFromJSONString`
are now exposed publicly. While not as high-performing as `BufferBuilder` and
the concrete `ArrayBuilder` subclasses, they allow easy creation of test or
example data, for example:

```c++
ARROW_ASSIGN_OR_RAISE(
auto string_array,
arrow::ArrayFromJSONString(arrow::utf8(), R"(["Hello", "World", null])"));
ARROW_ASSIGN_OR_RAISE(
auto list_array,
arrow::ArrayFromJSONString(arrow::list(arrow::int32()),
"[[1, null, 2], [], [3]]"));
```

Some APIs were changed to accept `std::string_view` instead of `const
std::string&`. Most uses of those APIs should not be affected
([GH-46551](https://github.com/apache/arrow/issues/46551)).

A new pretty-print option allows limiting element size when printing string or
binary data ([GH-46403](https://github.com/apache/arrow/issues/46403)).

It is now possible to export `Tensor` data using
[DLPack](https://dmlc.github.io/dlpack/latest/)
([GH-39294](https://github.com/apache/arrow/issues/39294)).

Half-float arrays can be properly diff'ed and pretty-printed
([GH-36753](https://github.com/apache/arrow/issues/36753)).

Some header files in `arrow/util` that were not supposed to be exposed are
now made internal ([GH-46459](https://github.com/apache/arrow/issues/46459)).

## C# Notes

The C# Arrow implementation is being extracted from the [Arrow
monorepo](https://github.com/apache/arrow) into a standalone repository to allow
it to have its own release cadence. See [the mailing
list](https://lists.apache.org/thread/0vj7hlzbzrv0lcrm92tgtfdh9gsj4dqb) for more
information. This is the final release of the Arrow monorepo that will include
the the C# implementation.

## Java, JavaScript, Go, and Rust Notes

The Java, JavaScript, Go, and Rust Go projects have moved to separate
repositories outside the main Arrow [monorepo](https://github.com/apache/arrow).

- For notes on the latest release of the [Java
implementation](https://github.com/apache/arrow-java), see the latest [Arrow
Java changelog][7].
- For notes on the latest release of the [JavaScript
implementation](https://github.com/apache/arrow-js), see the latest [Arrow
JavaScript changelog][8].
- For notes on the latest release of the [Rust
implementation](https://github.com/apache/arrow-rs) see the latest [Arrow Rust
changelog][5].
- For notes on the latest release of the [Go
implementation](https://github.com/apache/arrow-go), see the latest [Arrow Go
changelog][6].

## Linux Packaging Notes

We added support for AlmaLinux 10. You can use AlmaLinux 10 packages on Red Hat
Enterprise Linux 10 like distributions too.

We dropped support for CentOS Stream 8 because it reached EOL on 2024-05-31.

## MATLAB Notes

### New Features

Added support for creating an `arrow.tabular.Table` from a list of
`arrow.tabular.RecordBatch` instances
([GH-46877](https://github.com/apache/arrow/issues/46877))

### Packaging

The MLTBX available in apache/arrow's GitHub Releases area was built against
MATLAB R2025a.

## Python Notes

Compatibility notes:

- Deprecated ``PyExtensionType`` has been removed
([GH-46198](https://github.com/apache/arrow/issues/46198)).
- Deprecated ``use_legacy_format``has been removed in favour of setting
``IpcWriteOptions`` ([GH-46130](https://github.com/apache/arrow/issues/46130)).
- Due to SciPy 1.15's stricter sparse code changes are made to
``pa.SparseCXXMatrix`` constructors and ``pa.SparseCXXMatrix.to_scipy`` methods
with migrating from ``scipy.spmatrix`` to ``scipy.sparray``
([GH-45229](https://github.com/apache/arrow/issues/45229)).

New features:

- PyArrow does not require NumPy anymore to generate ``float16`` scalars and
arrays ([GH-46611](https://github.com/apache/arrow/issues/46611)).
- ``pc.utf8_zero_fill`` is now available in the compute module imitating
Python’s `str.zfill`` ([GH-46683](https://github.com/apache/arrow/issues/46683)).
- ``pa.arange`` utility function is now available which creates an array of
evenly spaced values within a given interval
([GH-46771](https://github.com/apache/arrow/issues/46771)).
- Scalar subclasses are now implementing Python protocols
([GH-45653](https://github.com/apache/arrow/issues/45653)).
- It is possible now to specify footer metadata when opening IPC file for
writing using metadata keyword in ``pa.ipc.new_file()``
([GH-46222](https://github.com/apache/arrow/issues/46222)).
- DLPack is now implemented (export) on the Tensor class in C++ and available in
Python ([GH-39294](https://github.com/apache/arrow/issues/39294)).

Other improvements:

- Couple of improvements have been included in the Filesystems module:
Filesystem operations are more convenient to users by supporting explicit
``fsspec+{protocol}`` and ``hf://`` filesystem URIs
([GH-44900](https://github.com/apache/arrow/issues/44900)),
``ConfigureManagedIdentityCredential`` and ``ConfigureClientSecretCredential``
have been exposed to ``AzureFileSystem``
([GH-46833](https://github.com/apache/arrow/issues/46833)), ``allow_delayed_open``
([GH-45957](https://github.com/apache/arrow/issues/45957)) and
``tls_ca_file_path`` ([GH-40754](https://github.com/apache/arrow/issues/40754))
have been exposed to ``S3FileSystem``.
- Parquet module improvements include: mapping of logical types to Arrow
extension types by default
([GH-44500](https://github.com/apache/arrow/issues/44500)), UUID extension type
conversion support when writing or reading to/from Parquet
([GH-43807](https://github.com/apache/arrow/issues/43807)) and support for uniform
encryption when writing parquet files by exposing
``EncryptionConfiguration.uniform_encryption``
([GH-38914](https://github.com/apache/arrow/issues/38914)).
- ``filter_expression`` is exposed in ``Table.join`` and ``Dataset.join`` to
support filtering rows when performing hash joins with Acero
([GH-46572](https://github.com/apache/arrow/issues/46572)).
- ``dim_names`` argument can now be passed to ``from_numpy_ndarray`` constructor
([GH-45531](https://github.com/apache/arrow/issues/45531)).

Relevant bug fixes:

- ``pyarrow.Table.to_struct_array`` failure when the table is empty has been
fixed ([GH-46355](https://github.com/apache/arrow/issues/46355)).
- Filtering all rows with ``RecordBatch.filter`` using an expression now returns
empty table with same schema instead of erroring
([GH-44366](https://github.com/apache/arrow/issues/44366)).

## Ruby and C GLib Notes

A number of changes were made in the 21.0.0 release which affect both Ruby and C GLib:

- Added support for fixed shape tensor extension data type.
- Added support for UUID extension data type.
- Added support for fixed size list data type.
- Added support for [the Arrow C data
interface](https://arrow.apache.org/docs/format/CDataInterface.html) for
chunked array.
- Added support for distinct count in array statistics.

### Ruby

There were no update only for Ruby.

### C GLib

You must call `garrow_compute_initialize()` explicitly before you use
computation related features.

[1]: https://github.com/apache/arrow/milestone/69?closed=1
[2]: {{ site.baseurl }}/release/21.0.0.html#contributors
[3]: {{ site.baseurl }}/release/21.0.0.html#changelog
[4]: {{ site.baseurl }}/docs/r/news/
[5]: <https://github.com/apache/arrow-rs/blob/main/CHANGELOG.md>
[6]: <https://github.com/apache/arrow-go/releases>
[7]: <https://github.com/apache/arrow-java/releases>
[8]: <https://github.com/apache/arrow-js/releases>