-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement reference offset #72
Comments
See #65 for a recent discussion. Having said that, at least DataAPI.jl and in particular DataFrames.jl are designed in a way that does not assume that references start at |
Cc: @quinnj about Arrow. |
As I had worked a bit more on this, I realized the problem was a bit deeper. Really, to guarantee no copying we need a generic type that can use arbitrary collections for pool and references. Here is a minimal example I implemented for my parquet project: struct PooledVector{𝒯,ℛ<:AbstractVector{<:Integer},𝒱} <: AbstractVector{𝒯}
pool::𝒱
refs::ℛ
end
PooledVector{𝒯}(vs, rs) where {𝒯} = PooledVector{𝒯,typeof(rs),typeof(vs)}(vs, rs)
PooledVector(vs, rs) = PooledVector{eltype(vs)}(vs, rs)
Base.size(v::PooledVector) = size(v.refs)
Base.IndexStyle(::Type{<:PooledVector}) = IndexLinear()
DataAPI.refarray(v::PooledVector) = v.refs
DataAPI.refpool(v::PooledVector) = v.pool
DataAPI.levels(v::PooledVector) = v.pool
Base.@propagate_inbounds function Base.getindex(v::PooledVector, i::Int)
@boundscheck checkbounds(v, i)
@inbounds v.pool[v.refs[i]+1]
end (perhaps ironically, I did not bother to include an arbitrary offset). In the case of lazily loaded parquet files, it is not necessarily guaranteed that the pool and refs are of a particular type. An even more general version could included an arbitrary transformation from reference to index, though perhaps that's a bit excessive. |
For reference, Arrow.jl has the |
A number of binary formats contain dictionary encoded data with the references as 0-indexed integers. PooledArrays currently can't be used to wrap these because, if I understand correctly, the references are always 1-based. This means that to deserialize a format using PooledArrays you necessarily have to copy all the references.
I suggest we add an
offset::Int
field so this can be handled more generally. This would be addedgetindex
. The most obvious difficulty with implementing this is that currently zeros give undefined values. Perhaps this can be circumvented at compile time with a new type parameter.Anyway, before I try to implement this, has anyone given in consideration? How is arrow (which I seem to remember is 0-based) deal with this?
The text was updated successfully, but these errors were encountered: