Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions dev/release/rat_exclude_files.txt
Original file line number Diff line number Diff line change
Expand Up @@ -258,3 +258,12 @@ ruby/red-arrow/.yardopts
rust/arrow/test/data/*.csv
rust/rust-toolchain
rust/arrow-flight/src/arrow.flight.protocol.rs
julia/Arrow/Project.toml
julia/Arrow/README.md
julia/Arrow/docs/Manifest.toml
julia/Arrow/docs/Project.toml
julia/Arrow/docs/make.jl
julia/Arrow/docs/mkdocs.yml
julia/Arrow/docs/src/index.md
julia/Arrow/docs/src/manual.md
julia/Arrow/docs/src/reference.md
15 changes: 15 additions & 0 deletions julia/Arrow/LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
34 changes: 34 additions & 0 deletions julia/Arrow/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name = "Arrow"
uuid = "69666777-d1a9-59fb-9406-91d4454c9d45"
authors = ["quinnj <[email protected]>"]
version = "0.3.0"

[deps]
CodecLz4 = "5ba52731-8f18-5e0d-9241-30f10d1ec561"
CodecZstd = "6b39b394-51ab-5f42-8807-6242bab2b4c2"
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Mmap = "a63ad114-7e13-5084-954f-fe012c677804"
PooledArrays = "2dfb63ee-cc39-5dd5-95bd-886bf059d720"
SentinelArrays = "91c51154-3ec4-41a3-a24f-3f23e20d615c"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TimeZones = "f269a46b-ccf7-5d73-abea-4c690281aa53"

[compat]
CodecLz4 = "0.4"
CodecZstd = "0.7"
julia = "1.3"
DataAPI = "1"
PooledArrays = "0.5"
Tables = "1.1"
SentinelArrays = "1"

[extras]
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
StructTypes = "856f2bd8-1eba-4b0a-8007-ebc267875bd4"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"


[targets]
test = ["Test", "Random", "JSON3", "StructTypes"]
145 changes: 145 additions & 0 deletions julia/Arrow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Arrow

[![Build Status](https://travis-ci.com/JuliaData/Arrow.jl.svg?branch=master)](https://travis-ci.com/JuliaData/Arrow.jl.svg?branch=master)
[![codecov](https://codecov.io/gh/JuliaData/Arrow.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JuliaData/Arrow.jl)

This is a pure Julia implementation of the [Apache Arrow](https://arrow.apache.org) data standard. This package provides Julia `AbstractVector` objects for
referencing data that conforms to the Arrow standard. This allows users to seamlessly interface Arrow formatted data with a great deal of existing Julia code.

Please see this [document](https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout) for a description of the Arrow memory layout.

## Format Support

This implementation supports the 1.0 version of the specification, including support for:
* All primitive data types
* All nested data types
* Dictionary encodings and messages
* Extension types
* Streaming, file, record batch, and replacement and isdelta dictionary messages

It currently doesn't include support for:
* Tensors or sparse tensors
* Flight RPC
* C data interface

Third-party data formats:
* csv and parquet support via the existing CSV.jl and Parquet.jl packages
* Other Tables.jl-compatible packages automatically supported (DataFrames.jl, JSONTables.jl, JuliaDB.jl, SQLite.jl, MySQL.jl, JDBC.jl, ODBC.jl, XLSX.jl, etc.)
* No current Julia packages support ORC or Avro data formats

## Basic usage:

### Installation

```julia
] add Arrow
```

### Reading

#### `Arrow.Table`

Arrow.Table(io::IO; convert::Bool=true)
Arrow.Table(file::String; convert::Bool=true)
Arrow.Table(bytes::Vector{UInt8}, pos=1, len=nothing; convert::Bool=true)

Read an arrow formatted table, from:
* `io`, bytes will be read all at once via `read(io)`
* `file`, bytes will be read via `Mmap.mmap(file)`
* `bytes`, a byte vector directly, optionally allowing specifying the starting byte position `pos` and `len`

Returns a `Arrow.Table` object that allows column access via `table.col1`, `table[:col1]`, or `table[1]`.

The Apache Arrow standard is foremost a "columnar" format and saves a variety of metadata with each column (such as column name, type, length, etc.).
A data set which has tens of thousands of columns is probably not well suited for the arrow format and may cause dramatic file size increases when one saves to a `arrow` file.
If it is possible to reshape the data such that there are not as many columns, `Arrow.Table` should not have as many problems.
A simple method Julia provides is to simply execute `transpose(data)` to switch the rows and columns of your data if that does not interfere with one's analysis.

NOTE: the columns in an `Arrow.Table` are views into the original arrow memory, and hence are not easily
modifiable (with e.g. `push!`, `append!`, etc.). To mutate arrow columns, call `copy(x)` to materialize
the arrow data as a normal Julia array.

`Arrow.Table` also satisfies the Tables.jl interface, and so can easily be materialized via any supporting
sink function: e.g. `DataFrame(Arrow.Table(file))`, `SQLite.load!(db, "table", Arrow.Table(file))`, etc.

Supports the `convert` keyword argument which controls whether certain arrow primitive types will be
lazily converted to more friendly Julia defaults; by default, `convert=true`.

##### Examples

```julia
using Arrow

# read arrow table from file format
tbl = Arrow.Table(file)

# read arrow table from IO
tbl = Arrow.Table(io)

# read arrow table directly from bytes, like from an HTTP request
resp = HTTP.get(url)
tbl = Arrow.Table(resp.body)
```

#### `Arrow.Stream`

Arrow.Stream(io::IO; convert::Bool=true)
Arrow.Stream(file::String; convert::Bool=true)
Arrow.Stream(bytes::Vector{UInt8}, pos=1, len=nothing; convert::Bool=true)

Start reading an arrow formatted table, from:
* `io`, bytes will be read all at once via `read(io)`
* `file`, bytes will be read via `Mmap.mmap(file)`
* `bytes`, a byte vector directly, optionally allowing specifying the starting byte position `pos` and `len`

Reads the initial schema message from the arrow stream/file, then returns an `Arrow.Stream` object
which will iterate over record batch messages, producing an `Arrow.Table` on each iteration.

By iterating `Arrow.Table`, `Arrow.Stream` satisfies the `Tables.partitions` interface, and as such can
be passed to Tables.jl-compatible sink functions.

This allows iterating over extremely large "arrow tables" in chunks represented as record batches.

Supports the `convert` keyword argument which controls whether certain arrow primitive types will be
lazily converted to more friendly Julia defaults; by default, `convert=true`.

### Writing

#### `Arrow.write`

Arrow.write(io::IO, tbl)
Arrow.write(file::String, tbl)

Write any Tables.jl-compatible `tbl` out as arrow formatted data.
Providing an `io::IO` argument will cause the data to be written to it
in the "streaming" format, unless `file=true` keyword argument is passed.
Providing a `file::String` argument will result in the "file" format being written.

Multiple record batches will be written based on the number of
`Tables.partitions(tbl)` that are provided; by default, this is just
one for a given table, but some table sources support automatic
partitioning. Note you can turn multiple table objects into partitions
by doing `Tables.partitioner([tbl1, tbl2, ...])`, but note that
each table must have the exact same `Tables.Schema`.

By default, `Arrow.write` will use multiple threads to write multiple
record batches simultaneously (e.g. if julia is started with `julia -t 8`).

Supported keyword arguments to `Arrow.write` include:
* `compress`: possible values include `:lz4`, `:zstd`, or your own initialized `LZ4FrameCompressor` or `ZstdCompressor` objects; will cause all buffers in each record batch to use the respective compression encoding
* `alignment::Int=8`: specify the number of bytes to align buffers to when written in messages; strongly recommended to only use alignment values of 8 or 64 for modern memory cache line optimization
* `dictencode::Bool=false`: whether all columns should use dictionary encoding when being written
* `dictencodenested::Bool=false`: whether nested data type columns should also dict encode nested arrays/buffers; many other implementations don't support this
* `denseunions::Bool=true`: whether Julia `Vector{<:Union}` arrays should be written using the dense union layout; passing `false` will result in the sparse union layout
* `largelists::Bool=false`: causes list column types to be written with Int64 offset arrays; mainly for testing purposes; by default, Int64 offsets will be used only if needed
* `file::Bool=false`: if a an `io` argument is being written to, passing `file=true` will cause the arrow file format to be written instead of just IPC streaming

##### Examples

```julia
# write directly to any IO in streaming format
Arrow.write(io, tbl)

# write to a file in file format
Arrow.write("data.arrow", tbl)
```
2 changes: 2 additions & 0 deletions julia/Arrow/docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
build/
site/
Loading