Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
274 changes: 206 additions & 68 deletions CHANGELOG.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions dev/make_changelog.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ def format_changelog_website(issues, out):
CATEGORIES = {
'New Feature': NEW_FEATURE,
'Improvement': NEW_FEATURE,
'Wish': NEW_FEATURE,
'Task': NEW_FEATURE,
'Test': NEW_FEATURE,
'Bug': BUGFIX
Expand Down
114 changes: 114 additions & 0 deletions site/_posts/2017-07-24-0.5.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
layout: post
title: "Apache Arrow 0.5.0 Release"
date: "2017-07-25 00:00:00 -0400"
author: wesm
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

The Apache Arrow team is pleased to announce the 0.5.0 release. It includes
[**130 resolved JIRAs**][1] with some new features, expanded integration
testing between implementations, and bug fixes. The Arrow memory format remains
stable since the 0.3.x and 0.4.x releases.

See the [Install Page][2] to learn how to get the libraries for your
platform. The [complete changelog][5] is also available.

## Expanded Integration Testing

In this release, we added compatibility tests for dictionary-encoded data
between Java and C++. This enables the distinct values (the *dictionary*) in a
vector to be transmitted as part of an Arrow schema while the record batches
contain integers which correspond to the dictionary.

So we might have:

```
data (string): ['foo', 'bar', 'foo', 'bar']
```

In dictionary-encoded form, this could be represented as:

```
indices (int8): [0, 1, 0, 1]
dictionary (string): ['foo', 'bar']
```

In upcoming releases, we plan to complete integration testing for the remaining
data types (including some more complicated types like unions and decimals) on
the road to a 1.0.0 release in the future.

## C++ Activity

We completed a number of significant pieces of work in the C++ part of Apache
Arrow.

### Using jemalloc as default memory allocator

We decided to use [jemalloc][4] as the default memory allocator unless it is
explicitly disabled. This memory allocator has significant performance
advantages in Arrow workloads over the default `malloc` implementation. We will
publish a blog post going into more detail about this and why you might care.

### Sharing more C++ code with Apache Parquet

We imported the compression library interfaces and dictionary encoding
algorithms from the [Apache Parquet C++ library][3]. The Parquet library now
depends on this code in Arrow, and we will be able to use it more easily for
data compression in Arrow use cases.

As part of incorporating Parquet's dictionary encoding utilities, we have
developed an `arrow::DictionaryBuilder` class to enable building
dictionary-encoded arrays iteratively. This can help save memory and yield
better performance when interacting with databases, Parquet files, or other
sources which may have columns having many duplicates.

### Support for LZ4 and ZSTD compressors

We added LZ4 and ZSTD compression library support. In ARROW-300 and other
planned work, we intend to add some compression features for data sent via RPC.

## Python Activity

We fixed many bugs which were affecting Parquet and Feather users and fixed
several other rough edges with normal Arrow use. We also added some additional
Arrow type conversions: structs, lists embedded in pandas objects, and Arrow
time types (which deserialize to the `datetime.time` type).

In upcoming releases we plan to continue to improve [Dask][7] support and
performance for distributed processing of Apache Parquet files with pyarrow.

## The Road Ahead

We have much work ahead of us to build out Arrow integrations in other data
systems to improve their processing performance and interoperability with other
systems.

We are discussing the roadmap to a future 1.0.0 release on the [developer
mailing list][6]. Please join the discussion there.

[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.5.0
[2]: http://arrow.apache.org/install
[3]: https://github.com/apache/parquet-cpp
[4]: https://github.com/jemalloc/jemalloc
[5]: http://arrow.apache.org/release/0.5.0.html
[6]: http://mail-archives.apache.org/mod_mbox/arrow-dev/
[7]: https://github.com/dask/dask
Loading