implement reference offset #72

ExpandingMan · 2021-09-05T22:44:01Z

A number of binary formats contain dictionary encoded data with the references as 0-indexed integers. PooledArrays currently can't be used to wrap these because, if I understand correctly, the references are always 1-based. This means that to deserialize a format using PooledArrays you necessarily have to copy all the references.

I suggest we add an offset::Int field so this can be handled more generally. This would be added getindex. The most obvious difficulty with implementing this is that currently zeros give undefined values. Perhaps this can be circumvented at compile time with a new type parameter.

Anyway, before I try to implement this, has anyone given in consideration? How is arrow (which I seem to remember is 0-based) deal with this?

The text was updated successfully, but these errors were encountered:

bkamins · 2021-09-06T08:18:01Z

See #65 for a recent discussion.
We should resolve the encoding of missing in coordination with your request. The problem is that we also need to handle the cases where unsigned integers are used for storing references (so then 0 is an absolute minimum we can use).

Having said that, at least DataAPI.jl and in particular DataFrames.jl are designed in a way that does not assume that references start at 1 (but PooledArrays.jl currently assumes this AFAICT so making this change should be done very carefully).

nalimilan · 2021-09-11T14:34:54Z

Cc: @quinnj about Arrow.

ExpandingMan · 2021-09-11T15:52:08Z

As I had worked a bit more on this, I realized the problem was a bit deeper. Really, to guarantee no copying we need a generic type that can use arbitrary collections for pool and references. Here is a minimal example I implemented for my parquet project:

struct PooledVector{𝒯,ℛ<:AbstractVector{<:Integer},𝒱} <: AbstractVector{𝒯}
    pool::𝒱
    refs::ℛ
end

PooledVector{𝒯}(vs, rs) where {𝒯} = PooledVector{𝒯,typeof(rs),typeof(vs)}(vs, rs)     
PooledVector(vs, rs) = PooledVector{eltype(vs)}(vs, rs)

Base.size(v::PooledVector) = size(v.refs)

Base.IndexStyle(::Type{<:PooledVector}) = IndexLinear()

DataAPI.refarray(v::PooledVector) = v.refs
DataAPI.refpool(v::PooledVector) = v.pool
DataAPI.levels(v::PooledVector) = v.pool

Base.@propagate_inbounds function Base.getindex(v::PooledVector, i::Int)
    @boundscheck checkbounds(v, i)
    @inbounds v.pool[v.refs[i]+1]
end

(perhaps ironically, I did not bother to include an arbitrary offset). In the case of lazily loaded parquet files, it is not necessarily guaranteed that the pool and refs are of a particular type. An even more general version could included an arbitrary transformation from reference to index, though perhaps that's a bit excessive.

quinnj · 2021-09-16T05:30:47Z

For reference, Arrow.jl has the Arrow.DictEncoded custom "pooled array" type, implemented here. It allows zero-copy view into the arrow data. When you copy one, we convert to PooledArray.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement reference offset #72

implement reference offset #72

ExpandingMan commented Sep 5, 2021

bkamins commented Sep 6, 2021

nalimilan commented Sep 11, 2021

ExpandingMan commented Sep 11, 2021 •

edited

Loading

quinnj commented Sep 16, 2021

implement reference offset #72

implement reference offset #72

Comments

ExpandingMan commented Sep 5, 2021

bkamins commented Sep 6, 2021

nalimilan commented Sep 11, 2021

ExpandingMan commented Sep 11, 2021 • edited Loading

quinnj commented Sep 16, 2021

ExpandingMan commented Sep 11, 2021 •

edited

Loading