apache · amoeba · Jul 21, 2025 · Jul 15, 2025 · Jul 16, 2025 · Jul 17, 2025
diff --git a/_posts/2025-07-17-21.0.0-release.md b/_posts/2025-07-17-21.0.0-release.md
@@ -0,0 +1,353 @@
+---
+layout: post
+title: "Apache Arrow 21.0.0 Release"
+date: "2025-07-17 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the 21.0.0 release. This release
+covers over 2 months of development work and includes [**339 resolved
+issues**][1] on [**400 distinct commits**][2] from [**82 distinct
+contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) to
+learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 20.0.0 release, Alenka Frim has been invited to join the Project
+Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+The [Call for Speakers](https://sessionize.com/arrow-summit-2025/) for the
+Apache Arrow Summit 2025 is now open! The Summit will take place on October 2nd,
+2025 in Paris, France as part of [PyData Paris](https://pydata.org/paris2025).
+The call will be open until July 26, 2025. Please see the [Call for
+Speakers](https://sessionize.com/arrow-summit-2025/) link to submit a talk or
+the [developer mailing
+list](https://lists.apache.org/thread/f0vcbtpzg6rntzbvmjstyd27bd9qfhl0) for more
+information.
+
+## Arrow Flight RPC Notes
+
+In C++ and Python, a new IPC reader option was added to force data buffers to be
+aligned based on the data type, making it easier to work with systems that
+expected alignment ([GH-32276](https://github.com/apache/arrow/issues/32276)).
+While this is not a Flight-specific option, it tended to occur with Flight due
+to implementation details. Also, C++ and Python are now consistent with other
+Flight implementations in allowing the schema of a `FlightInfo` to be omitted
+([GH-37677](https://github.com/apache/arrow/issues/37677)).
+
+We have accepted a donation of an ODBC driver for Flight SQL from Dremio
+([GH-46522](https://github.com/apache/arrow/issues/46522)). Note that the driver
+is not usable in its current state and contributors are working on implementing
+the rest of the driver.
+
+## C++ Notes
+
+### Compute
+
+The Cast function is now able to reorder fields when casting from one struct
+type to another; the fields are matched by name, not by index
+([GH-45028](https://github.com/apache/arrow/issues/45028)).
+
+Many compute kernels have been moved into a separate, optional, shared library
+([GH-25025](https://github.com/apache/arrow/issues/25025)). This improves
+modularity for dependency management in the application and reduces the Arrow
+C++ distribution size when the compute functionality is not being used. Note
+that some compute functions, such as the Cast function, will still be built for
+internal use in various Arrow components.
+
+Better half-float support has been added to some compute functions: `is_nan`,
+`is_inf`, `is_finite`, `negate`, `negate_checked`, `sign`
+([GH-45083](https://github.com/apache/arrow/issues/45083)); `if_else`,
+`case_when`, `coalesce`, `choose`, `replace_with_mask`, `fill_null_forward`,
+`fill_null_backward` ([GH-37027](https://github.com/apache/arrow/issues/37027));
+`run_end_encode`, `run_end_decode`
+([GH-46285](https://github.com/apache/arrow/issues/46285)).
+
+Better decimal32 and decimal64 support has been added to some compute functions:
+`run_end_encode`, `run_end_decode`
+([GH-46285](https://github.com/apache/arrow/issues/46285)).
+
+A new function `utf8_zero_fill` acts like Python's `str.zfill` method by
+providing a left-padding function that preserves the optional leading plus/minus
+sign ([GH-46683](https://github.com/apache/arrow/issues/46683)).
+
+Decimal sum aggregation now produces a decimal result with an increased
+precision in order to reduce the risk of overflowing the result type
+([GH-35166](https://github.com/apache/arrow/issues/35166)).
+
+### CSV
+
+Reading Duration columns is now supported
+([GH-40278](https://github.com/apache/arrow/issues/40278)).
+
+### Dataset
+
+It is now possible to preserve order when writing a dataset multi-threaded. The
+feature is disabled by default
+([GH-26818](https://github.com/apache/arrow/issues/26818)).
+
+### Filesystems
+
+The S3 filesystem can optionally be built into a separate DLL
+([GH-40343](https://github.com/apache/arrow/issues/40343)).
+
+### Parquet
+
+#### Encryption
+
+A new `SecureString` class must now be used to communicate sensitive data (such
+as secret keys) with Parquet encryption APIs. This class automatically wipes its
+contents from memory when destroyed, unlike regular `std::string`
+([GH-31603](https://github.com/apache/arrow/issues/31603)).
+
+#### Type support
+
+The new VARIANT logical type is supported at a low level, and an extension type
+`parquet.variant` is added to reflect such columns when reading them to Arrow
+([GH-45937](https://github.com/apache/arrow/issues/45937)).
+
+The UUID logical type is automatically converted to/from the `arrow.uuid`
+canonical extension type when reading or writing Parquet data, respectively.
+
+The GEOMETRY and GEOGRAPHY logical types are supported
+([GH-45522](https://github.com/apache/arrow/issues/45522)). They are
+automatically converted to/from the corresponding GeoArrow extension type, if it
+has been registered by GeoArrow. Geospatial column statistics are also
+supported.
+
+It is now possible to read BYTE_ARRAY columns directly as LargeBinary or
+BinaryView, without any intermediate conversion from Binary. Similarly, those
+types can be written directly to Parquet
+([GH-43041](https://github.com/apache/arrow/issues/43041)). This allows
+bypassing the 2 GiB data per chunk limitation of the Binary type, and can also
+improve performance. This also applies to String types when a Parquet column has
+the STRING logical type.
+
+Similarly, LIST columns can now be read directly as LargeList rather than List.
+This allows bypassing the 2^31 values per chunk limitation of regular List types
+([GH-46676](https://github.com/apache/arrow/issues/46676)).
+
+#### Other Parquet improvements
+
+A new feature named Content-Defined Chunking improves deduplication of Parquet
+files with mostly identical contents, by choosing data page boundaries based on
+actual contents rather than a number of values. For that, it uses a rolling hash
+function, and the min and max chunk size can be chosen. The feature is disabled
+by default and can be enabled on a per-file basis in the Parquet
+`WriterProperties` ([GH-45750](https://github.com/apache/arrow/issues/45750)).
+
+The `EncodedStatistics` of a column chunk are publicly exposed in
+`ColumnChunkMetaData` and can be read faster than if decoded as `Statistics`
+([GH-46462](https://github.com/apache/arrow/issues/46462)).
+
+SIMD optimizations for the BYTE_STREAM_SPLIT have been improved
+([GH-46788](https://github.com/apache/arrow/issues/46788)).
+
+Reading FIXED_LEN_BYTE_ARRAY data has been made significantly faster (up to 3x
+faster on some benchmarks). This benefits logical types such as FLOAT16
+([GH-43891](https://github.com/apache/arrow/issues/43891)).
+
+### Miscellaneous C++ changes
+
+The `ARROW_USE_PRECOMPILED_HEADERS` build option was removed, as
+`CMAKE_UNITY_BUILD` usually provides more benefits while requiring less
+maintenance.
+
+New data creation helpers `ArrayFromJSONString`, `ChunkedArrayFromJSONString`,
+`DictArrayFromJSONString`, `ScalarFromJSONString` and `DictScalarFromJSONString`
+are now exposed publicly. While not as high-performing as `BufferBuilder` and
+the concrete `ArrayBuilder` subclasses, they allow easy creation of test or
+example data, for example:
+
+```c++
+  ARROW_ASSIGN_OR_RAISE(
+      auto string_array,
+      arrow::ArrayFromJSONString(arrow::utf8(), R"(["Hello", "World", null])"));
+  ARROW_ASSIGN_OR_RAISE(
+      auto list_array,
+      arrow::ArrayFromJSONString(arrow::list(arrow::int32()),
+                                 "[[1, null, 2], [], [3]]"));
+```
+
+Some APIs were changed to accept `std::string_view` instead of `const
+std::string&`. Most uses of those APIs should not be affected
+([GH-46551](https://github.com/apache/arrow/issues/46551)).
+
+A new pretty-print option allows limiting element size when printing string or
+binary data ([GH-46403](https://github.com/apache/arrow/issues/46403)).
+
+It is now possible to export `Tensor` data using
+[DLPack](https://dmlc.github.io/dlpack/latest/)
+([GH-39294](https://github.com/apache/arrow/issues/39294)).
+
+Half-float arrays can be properly diff'ed and pretty-printed
+([GH-36753](https://github.com/apache/arrow/issues/36753)).
+
+Some header files in `arrow/util` that were not supposed to be exposed are
+now made internal ([GH-46459](https://github.com/apache/arrow/issues/46459)).
+
+## C# Notes
+
+The C# Arrow implementation is being extracted from the [Arrow
+monorepo](https://github.com/apache/arrow) into a standalone repository to allow
+it to have its own release cadence. See [the mailing
+list](https://lists.apache.org/thread/0vj7hlzbzrv0lcrm92tgtfdh9gsj4dqb) for more
+information. This is the final release of the Arrow monorepo that will include
+the the C# implementation.
+
+## Java, JavaScript, Go, and Rust Notes
+
+The Java, JavaScript, Go, and Rust Go projects have moved to separate
+repositories outside the main Arrow [monorepo](https://github.com/apache/arrow).
+
+- For notes on the latest release of the [Java
+implementation](https://github.com/apache/arrow-java), see the latest [Arrow
+Java changelog][7].
+- For notes on the latest release of the [JavaScript
+implementation](https://github.com/apache/arrow-js), see the latest [Arrow
+JavaScript changelog][8].
+- For notes on the latest release of the [Rust
+  implementation](https://github.com/apache/arrow-rs) see the latest [Arrow Rust
+  changelog][5].
+- For notes on the latest release of the [Go
+implementation](https://github.com/apache/arrow-go), see the latest [Arrow Go
+changelog][6].
+
+## Linux Packaging Notes
+
+We added support for AlmaLinux 10. You can use AlmaLinux 10 packages on Red Hat
+Enterprise Linux 10 like distributions too.
+
+We dropped support for CentOS Stream 8 because it reached EOL on 2024-05-31.
+
+## MATLAB Notes
+
+### New Features
+
+Added support for creating an `arrow.tabular.Table` from a list of
+`arrow.tabular.RecordBatch` instances
+([GH-46877](https://github.com/apache/arrow/issues/46877))
+
+### Packaging
+
+The MLTBX available in apache/arrow's GitHub Releases area was built against
+MATLAB R2025a.
+
+## Python Notes
+
+Compatibility notes:
+
+- Deprecated ``PyExtensionType`` has been removed
+([GH-46198](https://github.com/apache/arrow/issues/46198)).
+- Deprecated ``use_legacy_format``has been removed in favour of setting
+``IpcWriteOptions`` ([GH-46130](https://github.com/apache/arrow/issues/46130)).
+- Due to SciPy 1.15's stricter sparse code changes are made to
+``pa.SparseCXXMatrix`` constructors and ``pa.SparseCXXMatrix.to_scipy`` methods
+with migrating from ``scipy.spmatrix`` to ``scipy.sparray``
+([GH-45229](https://github.com/apache/arrow/issues/45229)).
+
+New features:
+
+- PyArrow does not require NumPy anymore to generate ``float16`` scalars and
+arrays ([GH-46611](https://github.com/apache/arrow/issues/46611)).
+- ``pc.utf8_zero_fill`` is now available in the compute module imitating
+Python’s `str.zfill`` ([GH-46683](https://github.com/apache/arrow/issues/46683)).
+- ``pa.arange`` utility function is now available which creates an array of
+evenly spaced values within a given interval
+([GH-46771](https://github.com/apache/arrow/issues/46771)).
+- Scalar subclasses are now implementing Python protocols
+([GH-45653](https://github.com/apache/arrow/issues/45653)).
+- It is possible now to specify footer metadata when opening IPC file for
+writing using metadata keyword in ``pa.ipc.new_file()``
+([GH-46222](https://github.com/apache/arrow/issues/46222)).
+- DLPack is now implemented (export) on the Tensor class in C++ and available in
+Python ([GH-39294](https://github.com/apache/arrow/issues/39294)).
+
+Other improvements:
+
+- Couple of improvements have been included in the Filesystems module:
+Filesystem operations are more convenient to users by supporting explicit
+``fsspec+{protocol}`` and ``hf://`` filesystem URIs
+([GH-44900](https://github.com/apache/arrow/issues/44900)),
+``ConfigureManagedIdentityCredential`` and ``ConfigureClientSecretCredential``
+have been exposed to ``AzureFileSystem``
+([GH-46833](https://github.com/apache/arrow/issues/46833)), ``allow_delayed_open``
+([GH-45957](https://github.com/apache/arrow/issues/45957)) and
+``tls_ca_file_path`` ([GH-40754](https://github.com/apache/arrow/issues/40754))
+have been exposed to ``S3FileSystem``.
+- Parquet module improvements include: mapping of logical types to Arrow
+extension types by default
+([GH-44500](https://github.com/apache/arrow/issues/44500)), UUID extension type
+conversion support when writing or reading to/from Parquet
+([GH-43807](https://github.com/apache/arrow/issues/43807)) and support for uniform
+encryption when writing parquet files by exposing
+``EncryptionConfiguration.uniform_encryption``
+([GH-38914](https://github.com/apache/arrow/issues/38914)).
+- ``filter_expression`` is exposed in ``Table.join`` and ``Dataset.join`` to
+support filtering rows when performing hash joins with Acero
+([GH-46572](https://github.com/apache/arrow/issues/46572)).
+- ``dim_names`` argument can now be passed to ``from_numpy_ndarray`` constructor
+([GH-45531](https://github.com/apache/arrow/issues/45531)).
+
+Relevant bug fixes:
+
+- ``pyarrow.Table.to_struct_array`` failure when the table is empty has been
+fixed ([GH-46355](https://github.com/apache/arrow/issues/46355)).
+- Filtering all rows with ``RecordBatch.filter`` using an expression now returns
+empty table with same schema instead of erroring
+([GH-44366](https://github.com/apache/arrow/issues/44366)).
+
+## Ruby and C GLib Notes
+
+A number of changes were made in the 21.0.0 release which affect both Ruby and C GLib:
+
+- Added support for fixed shape tensor extension data type.
+- Added support for UUID extension data type.
+- Added support for fixed size list data type.
+- Added support for [the Arrow C data
+  interface](https://arrow.apache.org/docs/format/CDataInterface.html) for
+  chunked array.
+- Added support for distinct count in array statistics.
+
+### Ruby
+
+There were no update only for Ruby.
+
+### C GLib
+
+You must call `garrow_compute_initialize()` explicitly before you use
+computation related features.
+
+[1]: https://github.com/apache/arrow/milestone/69?closed=1
+[2]: {{ site.baseurl }}/release/21.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/21.0.0.html#changelog
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: <https://github.com/apache/arrow-rs/blob/main/CHANGELOG.md>
+[6]: <https://github.com/apache/arrow-go/releases>
+[7]: <https://github.com/apache/arrow-java/releases>
+[8]: <https://github.com/apache/arrow-js/releases>