Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement reference offset #72

Open
ExpandingMan opened this issue Sep 5, 2021 · 4 comments
Open

implement reference offset #72

ExpandingMan opened this issue Sep 5, 2021 · 4 comments

Comments

@ExpandingMan
Copy link

A number of binary formats contain dictionary encoded data with the references as 0-indexed integers. PooledArrays currently can't be used to wrap these because, if I understand correctly, the references are always 1-based. This means that to deserialize a format using PooledArrays you necessarily have to copy all the references.

I suggest we add an offset::Int field so this can be handled more generally. This would be added getindex. The most obvious difficulty with implementing this is that currently zeros give undefined values. Perhaps this can be circumvented at compile time with a new type parameter.

Anyway, before I try to implement this, has anyone given in consideration? How is arrow (which I seem to remember is 0-based) deal with this?

@bkamins
Copy link
Member

bkamins commented Sep 6, 2021

See #65 for a recent discussion.
We should resolve the encoding of missing in coordination with your request. The problem is that we also need to handle the cases where unsigned integers are used for storing references (so then 0 is an absolute minimum we can use).

Having said that, at least DataAPI.jl and in particular DataFrames.jl are designed in a way that does not assume that references start at 1 (but PooledArrays.jl currently assumes this AFAICT so making this change should be done very carefully).

@nalimilan
Copy link
Member

Cc: @quinnj about Arrow.

@ExpandingMan
Copy link
Author

ExpandingMan commented Sep 11, 2021

As I had worked a bit more on this, I realized the problem was a bit deeper. Really, to guarantee no copying we need a generic type that can use arbitrary collections for pool and references. Here is a minimal example I implemented for my parquet project:

struct PooledVector{𝒯,ℛ<:AbstractVector{<:Integer},𝒱} <: AbstractVector{𝒯}
    pool::𝒱
    refs::ℛ
end

PooledVector{𝒯}(vs, rs) where {𝒯} = PooledVector{𝒯,typeof(rs),typeof(vs)}(vs, rs)     
PooledVector(vs, rs) = PooledVector{eltype(vs)}(vs, rs)

Base.size(v::PooledVector) = size(v.refs)

Base.IndexStyle(::Type{<:PooledVector}) = IndexLinear()

DataAPI.refarray(v::PooledVector) = v.refs
DataAPI.refpool(v::PooledVector) = v.pool
DataAPI.levels(v::PooledVector) = v.pool

Base.@propagate_inbounds function Base.getindex(v::PooledVector, i::Int)
    @boundscheck checkbounds(v, i)
    @inbounds v.pool[v.refs[i]+1]
end

(perhaps ironically, I did not bother to include an arbitrary offset). In the case of lazily loaded parquet files, it is not necessarily guaranteed that the pool and refs are of a particular type. An even more general version could included an arbitrary transformation from reference to index, though perhaps that's a bit excessive.

@quinnj
Copy link
Member

quinnj commented Sep 16, 2021

For reference, Arrow.jl has the Arrow.DictEncoded custom "pooled array" type, implemented here. It allows zero-copy view into the arrow data. When you copy one, we convert to PooledArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants