This package is deprecated in favor of JuliaData/Arrow.jl.
As of writing, this package is still used by Feather.jl which reads and writes legacy feather v1 files. However, as Feather v1 is deprecated in favor of Feather v2 which is implemented by JuliaData/Arrow.jl, it is unlikely that Feather.jl will be maintained in the future. We recommend using either Feather.jl or pyarrow to convert your data to the latest feather format.
This is a pure Julia implementation of the Apache Arrow data standard. This package provides Julia AbstractVector
objects for
referencing data that conforms to the Arrow standard. This allows users to seamlessly interface Arrow formatted data with a great deal of existing Julia code.
Please see this document for a description of the Arrow memory layout.
WARNING As of right now this package uses Julia Ptr
(pointer) objects and "unsafe" methods. This is for performance reasons. It should in principle be
possible to make this package completely safe with little to no loss in performance, but we are waiting on some performance improvements in Base
. While
Arrow.jl has been tested and should be safe with proper usage, it is up to the user to make sure that their Arrow.jl objects reference the appropriate locations
in data. If the user, for example, uses an Arrow.jl object to reference data past the end of an array, the resulting program will segfault!
Just do
import Pkg; Pkg.add("Arrow")
Arrow only has CategoricalArrays
as a dependency (and Missings
on 0.6).
The Arrow
package exposes several ArrowVector{J} <: AbstractVector{J}
objects. These provide an interface to arrow formatted data as well as providing
methods to convert Julia objects to the Arrow data format. The simplest of these is
Primitive{J} <: ArrowVector{J}
This object maintains a reference to a data buffer (a Vector{UInt8}
) and describes and contiguous subset of it. It will automatically convert the underlying
data to the type J
on demand. The Primitive
type can only describe bits type elements (i.e. types for which isbits
is true, in particular not strings). In the
following example we create a Primitive
to address a subset of a buffer
data = [0, 2, 3, 5, 7, 0] # this will be the underlying data from which we create our buffer
buff = reinterpret(UInt8, data) # in this simple case the Arrow format and Julia's in-memory format coincide
p = Primitive{Int}(buff, 9, 4) # arguments are: buffer, start location, length
p[1] # returns 2
p[2:3] # returns the (non-arrow) Vector [3,5]
p[:] # returns the (non-arrow) Vector [2,3,5,7]
p[2] = 999 # assignment is supported for AbstractPrimitive types. this change is reflected in buff and data
q = Primitive([2,3,5,7]) # if we didn't already have a buffer we needed to reference, we can create one like this
q = arrowformat([2,3,5,7]) # the arrowformat function automatically determines the appropriate ArrowVector for the provided array
rawvalues(q) # this returns the created buffer as a Vector{UInt8}
Here we see that indexing an ArrowVector
returns ordinary Julia arrays containing the data stored in the Arrow buffer. All other ArrowVector
objects are
built out of combinations of Primitive
s.
Enter ?Primitive
in the REPL for a full list of constructors.
Note: In what follows we show explicit methods for constructing each ArrowVector
type from a raw data buffer. This can become a bit confusing where
there are many sub-buffer locations to keep track of, so it is strongly suggested that you make use of the Locate
interface described in the next section.
The Arrow format also supports arrays with bits type elements that may be null. For these we provide the NullablePrimitive{J} <: AbstractVector{Union{J,Missing}}
type. Under the hood the
NullablePrimitive
type is a pair of Primitive
s: one references a Primitive{UInt8}
bit mask describing which elements of the NullablePrimitive
are null and the
other references the underlying data. In the following example we create a NullablePrimitive
from an existing buffer
buff = [[0x0d]; reinterpret(UInt8, [2.0, 3.0, 5.0, 7.0])] # bits(0x0d) == "00001101"
p = NullablePrimitive{Float64}(buff, 1, 2, 4) # arguments are: buffer, bitmask location, values location, length
p[1] # returns 2.0
p[2] # returns missing
p[1:4] # returns [2.0, missing, 5.0, 7.0]
p[2] = 3.0 # assignment also supported for NullablePrimitive, the change will be reflected in buff
q = NullablePrimitive([2.0,missing,5.0,7.0]) # if we didn't already have a buffer we needed to reference, we can create one
# the above will create seperate buffers for the bit mask and values. to create a contiguous buffer containing all we can do
q = NullablePrimitive(Array, [2.0,missing,5.0,7.0])
q = arrowformat([2.0,missing,5.0,7.0]) # you can also use arrowformat to automatically determine the ArrowVector type
rawvalues(bitmask(q)) # returns [0x0d]
Enter ?NullablePrimitive
in the REPL for a full list of constructors.
The underlying dataformat for arbitrary length objects such as strings is more complicated, so these objects require a dedicated type. For these we provide
List{J} <: AbstractVector{J}
. As well as containing the values contained by strings, these objects contain "offsets" for describing how long each string
should be. The arrow format suggests that these offsets are Int32
s and that there are length(l)+1
of them. For example
offs = reinterpret(UInt8, Int32[0,3,5,7])
vals = convert(Vector{UInt8}, "abcdefg")
buff = [offs; vals]
# type parameters: List return type, offsets type (must be <:Integer)
l = List{String,Int32}(buff, 1, length(offs)+1, 3, UInt8, length(vals)) # arguments are: buffer, offsets location, values location, length of List, value type, values length
# alternatively we can construct the values separately
v = Primitive{UInt8}(buff, length(offs)+1, length(vals))
l = List{String,Int32}(buff, 1, 3, v) # arguments are: buffer, offset location, length, values primitive
# or you can create each piece individually
o = Primitive{Int32}(buff, 1, 4) # note that the Int32 type is required for offsets by the arrow format
v = Primitive{UInt8}(buff, length(offs)+1, length(vals))
l = List{String}(o, v)
l[1] # returns "abc"
l[2] # returns "de"
l[3] # returns "fg"
l[1:3] # returns a normal Vector{String} (copies data!)
l[1] = "a" # ERROR: assignments are not currently supported for list types
m = List(["abc", "de", "fg"]) # just as in the other cases, you can create your own data
m = List(Array, ["abc", "de", "fg"]) # you can also require it all to be in a contiguous buffer
m = arrowformat(["abc", "de", "fg"]) # as always arrowformat automatically determines the ArrowVector type
rawvalues(offsets(m)) # returns reinterpret(UInt8, [0,3,5,7])
rawvalues(values(m)) # returns convert(Vector{UInt8}, "abcdefg")
Note that List{J}
and NullableList{J}
use the constructor J(::AbstractVector{C})
where C
is the values type (in the above example UInt8
)
WARNING: Currently the values of the offsets themselves are not bounds-checked for performance reasons. This means you have to be extra sure that your
offsets are properly constructed. It is recommended that you always use arrowformat
, List
, or offsets
to construct offsets, this should not be done
manually.
Enter ?List
in the REPL for a full list of constructors.
Next we have the NullableList{J} <: AbstractVector{Union{J,Missing}}
type. NullableList
is to List
as NullablePrimitive
is to Primitive
. In addition
to offsets and values, it also contains a bit mask describing which elements are null. By now you can probably predict what the example will look like
bmask = [0x05] # bits(0x05) == "00000101"
offs = reinterpret(UInt8, Int32[0,3,5,7])
vals = convert(Vector{UInt8}, "abcdefg")
buff = [bmask; offs; vals]
l = NullableList{String,Int32}(buff, 1, 2, length(offs)+2, 3, UInt8, length(vals))
# arguments above are: buffer, bit mask location, offsets location, values location, list length, values type, values length
# again you can also provide each piece separately
b = Primitive{UInt8}(buff, 1, 1) # required to have eltype UInt8
o = Primitive{Int32}(buff, 2, 4) # required to have eltype Int32
v = Primitive{UInt8}(buff, length(offs)+2, length(vals))
l = NullableList{String,Int32}(b, o, v)
l[1] # returns "abc"
l[2] # returns missing
l[3] # returns "fg"
l[2] = "de" # ERROR assignments not currently supported for list types
# you can also create lists of Primitives, though this may involve copying
l = NullableList{Primitive{UInt8},Int32}(b, o, v)
# by now all the ways of creating this from our own data should be familiar
m = NullableList(["abc", missing, "fg"])
m = NullableList(Array, ["abc", missing, "fg"])
m = arrowformat(["abc", missing, "fg"])
Enter ?NullableList
in the REPL for a full list of constructors.
The arrow format also supports dictionary encoding of arrays. What this means is simply that instead of one array, there are two, a "short" array containing a
view values, and a "long" array which contains pointers to those values (required by the Arrow standard to be Int32
). This provides a way of compressing
arrays in which a relatively small number of values are repeated in large numbers. Arrow.jl uses the Julia package
CategoricalArrays.jl to support this functionality. CategoricalArray
s will be dictionary encoded by
default when converted to Arrow array objects. One aspect of this that may seem confusing is that references are required to be 0-based indices, which is
contrary to the Julia 1-based approach we've used for everything else. In practice this shouldn't matter much: references do not need to be constructed
manually. See the following
# in most real cases these would be constructed from data in one of the ways described above
refs = Primitive{Int32}([0, 1, 2, 0, 1, 3])
vals = List(["fire", "walk", "with", "me"])
A = DictEncoding(refs, vals)
A[1] # returns "fire"
A[5] # return "walk"
A[[1,2,3,6]] # returns ["fire", "walk", "with", "me"]
# you can also create your own from Julia data
B = DictEncoding(["fire", "walk", "with", "me"]) # in this case there is no benefit to DictEncoding over List
# arrowformat will automatically convert any CategoricalArray object to an Arrow formatted DictEncoding
B = arrowformat(categorical(["fire", "walk", "with", "me"]))
Note that indexing a DictEncoding{T}
object will return objects of type T
or Vector{T}
. The only exception is when indexing with a :
, A[:]
, in which
case a CategoricalArray
will be returned (equivalently, this can be done with categorical(A)
. In order to retrieve slices as CategoricalArray
, one should
use the categorical
function, e.g. categorical(A, slice)
.
Because the Arrow format specifies that Bool
s should be stored as single bits, a special type is required to store Arrow formatted Bool
data. These are
analogous to the Julia BitVector
object. Note that there is nothing stopping you from serializing Julia Bool
(which are 8-bit), but these will not in
general be readable outside of Julia. arrowformat
will automatically convert AbstractVector{Bool}
and AbstractVector{Union{Bool,Missing}}
to
BitPrimitive
and NullableBitPrimitive
respectively. These types also provide the usual constructors as seen for the other types above.
Nothing is stopping you from storing Julia bits-type data that is not necessarily specified by the Arrow format. For example, a Primitive{Complex128}
will
work just as expected. ArrowVector
objects were deliberately designed so that the way they construct their output depends only on their type parameter.
While arrowformat
will pick the appropriate ArrowVector
for Arrow formatting data, there are no "hidden conversions" happening under the hood: the type
parameter of your ArrowVector
object is what you get. You can therefore serialize any type for which isbits
is true. In principle you can also serialize
more complicated types using List
s. The only caveat is that any type not explicitly described in the Arrow standard will not in general be readable outside
of Julia.
Given a Vector{UInt8}
locating your data objects can be rather pedantic, and the last thing you want to do is point your ArrowVector
objects to the wrong
memory locations, as this will lead to scary undefined behavior. Arrow provides an interface that will make this significantly easier provided your metadata is
sufficiently well organized (which it always should be). This interface will also check to make sure the locations you specify have proper alignment (still
does not guarantee correctness!!). The idea here is to create Julia struct
s which somehow represent the metadata of the various objects you want to access.
In the following, assume you have defined
struct ObjMetadata
# whatever is needed to locate objects and determine their types goes here. You can also use type parameters if you want
end
You are not limited to only having one such object, you can have arbitrarily many. Once you define the appropriate methods (described below), all you need to do is call
locate(data, T, obj)
# data is the data buffer; T is the return type of the container being constructed; obj is the ObjMetadata
This will automatically create the ArrowVector
object that represents your data.
The type parameter you provide specifies the return type, for example you might construct a List
with locate(data, String, obj)
or a NullableBitPrimitive
with locate(data, Union{Bool,Missing}, obj)
.
The minimal way of implementing the locate interface requires defining some of the following methods
Locate.length(obj::ObjMetadata) = # the length (in number of elements) of the ArrowVector
Locate.values(obj::ObjMetadata) = # location of values (i.e. return value data; char data for Lists)
Locate.valueslength(obj::ObjMetadata) = # the length of the values sub-buffer (in number of elements); not needed for Primitives, only Lists
Locate.bitmask(obj::ObjMetadata) = # location of null bitmask
Locate.offsets(obj::ObjMetadata) = # location of offsets buffer
Of course, you may only need to define a subset of these. For example, if all you want are Primitive
s, you need only define Locate.length
and
Locate.values
. If you never need lists, you needn't define Locate.valueslength
or Locate.offsets
.
The above interface may not be adequate for all purposes. For example, if you only define the methods listed above, the offsets type will always default to
Int32
(the Arrow format standard). Furthermore, the type of ArrowVector
will be determined by the desired return value (i.e. Primitive
for bits-types,
List
for strings, NullablePrimitive
for Union{T,Missing}
where T
is a bits-type, etc.) To override these defaults you can use more detailed methods:
# value data can be specified by defining the Locate.Values methods
# T is the value data type, but you may not need it because the overall container return type will override it
Locate.Values{T}(obj::ObjMetadata) = Locate.Values{T}(Locate.values(obj), Locate.valueslength(obj))
# you need a slightly different Values constructor for List values
# here T is the List return type so you can use it if you need to, but you may not
Locate.Values(::Type{T}, obj::ObjMetadata) where T = Locate.Values{UInt8}(Locate.values(obj), Locate.valueslength(obj))
# there's not really a reason to define Locate.Bitmask if you defined Locate.bitmask, but it's there
Locate.Bitmask(obj::ObjMetadata) = Locate.Bitmask(Locate.bitmask(obj))
# you can use Locate.Offsets to override the default offset type of Int32
Locate.Offsets(obj::ObjMetadata) = Locate.Offsets{Int64}(Locate.offsets(obj))
# as we described, you can also override the default container types, but this is not recommended
# it may be useful for custom types, but remember these won't in general be usable outside of Julia
Locate.containertype(::Type{CustomType}, obj::ObjMetadata) = NullablePrimitive # returned value should have no type paramters
In the above we showed constructors receiving the arguments they would receive if you only defined the Locate
methods listed in the previous section, but
of course you can make these constructors do anything you want, as long as return the proper type as an output.
Writing is somewhat simpler than reading as Arrow will figure out how to convert ordinary Julia data to Arrow formatted data for you. In addition to
arrowformat
the other two most important functions for writing data will be rawpadded
and writepadded
. rawpadded
takes a Primitive
as argument and
returns a properly Arrow padded Vector{UInt8}
appropriate for writing the data directly to an Arrow formatted buffer. writepadded
will write the properly
padded array to an IO
object.
A = NullableList(data)
writepadded(io, A, bitmask, offsets, values) # write bitmask, offsets then values of A, all contiguously, all properly padded
B = DictEncoding(data)
writepadded(io, B, references) # writes references
writepadded(io, levels(B), offsets, bitmask, values) # writes the NullableList in a different order than above
The following table show which sub-buffers are relevant for which ArrowVector
s. All sub-buffers can be accessed as Primitive
s simply by doing, for example
bitmask(l)
where l isa ArrowVector{Union{T,Missing}} where T
returns the primitive representing the null bit mask.
values |
bitmask |
offsets |
|
---|---|---|---|
Primitive |
1 | 0 | 0 |
NullablePrimitive |
1 | 1 | 0 |
List |
1 | 0 | 1 |
NullableList |
1 | 1 | 1 |
BitPrimitive |
1 | 0 | 0 |
NullableBitPrimitive |
1 | 1 | 0 |
DictEncoding
is a bit more complicated as it can contain any of the other types, but its references and data pool can be accessed with references
and pool
respectively.
Arrow.jl provides Arrow formatted date-time objects that have Julia equivalents. These are Arrow.Datestamp=>Dates.Date
, Arrow.Timestamp=>Dates.DateTime
and
Arrow.TimeOfDay=>Dates.Time
. The arrowformat
function will automatically convert objects of the Julia Dates
types to the appropriate Arrow format. When
constructing the various ArrowVector
objects, this conversion must be specified explicitly, e.g. with Primitive{TimeOfDay}(v)
where v::Vector{Dates.Time}
.
There is nothing stopping you from serializing the Julia Dates
objects, but they are not in general readable outside of Julia. The units in which DateTime
and TimeOfDay
are stored can be specified with Dates.TimePeriod
s. For example, to store a DateTime
with resolution of seconds, one should do
convert(Timestamp{Dates.Second}, t)
where t::DateTime
.
For a working (but as of this writing still in-development) example of a package built with Arrow.jl see this fork of Feather.jl.
A lot of work still to be done:
- Performance pass: performance seems ok according to basic sanity checks but it that code has neither been optimized nor thoroughly benchmarked.
- Extensive unit tests needed: hopefully I'll get to more of this soon.
- Support Arrow Structs.