Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map, collect and similar for custom arrays #36106

Open
nalimilan opened this issue Jun 1, 2020 · 11 comments
Open

map, collect and similar for custom arrays #36106

nalimilan opened this issue Jun 1, 2020 · 11 comments
Labels
design Design of APIs or of the language itself domain:arrays [a, r, r, a, y, s] kind:speculative Whether the change will be implemented is speculative

Comments

@nalimilan
Copy link
Member

The current design of map, collect and similar seems problematic for custom arrays which are associated with a custom element type, like CategoricalArray and its CategoricalValue type.

In CategoricalArrays.jl, I would like map(f, ::CategoricalArray) to return a CategoricalArray if f returns only Union{CategoricalValue, Missing} values. This is natural in particular so that map(identity, ::CategoricalArray) returns a CategoricalArray.

To achieve this, I could define Base.map(f, ::CategoricalArray), but that would require duplicating tricky code from Base since it needs to handle eltype widening and so on. So I tried to define similar so that collect, that map uses under the hood, returns a CategoricalArray when appropriate. But collect uses similar(1:1, T, axes(itr)), so I have to override similar(::AbstractRange, ::Type{<:Union{CategoricalValue, Missing}}). For consistency I also have to define similar methods for AbstractArray, Array, Vector and Matrix (due to ambiguities).

Doing that has two consequences:

  • The first is that collect(::CategoricalArray) always returns a CategoricalArray. This makes sense actually since Array{CategoricalValue} is an inefficient type. But that seems to go against the docstring for collect which says that it returns an Array.
  • The second, more serious issue is that getindex(::Array{<:CategoricalValue}, ::Array) also returns a CategoricalArray. This doesn't sound correct.

This leads me to raise two questions/proposals:

  1. Should the collect docstring be made less strict? It sounds useful to be able to collect the contents of an interator into the most natural/efficient array type. If one really wants an Array better do Array(itr) -- otherwise collect is redundant.
  2. Should collect use a new system like similar(AbstractArray, T, axes(itr)) instead of similar(1:1, T, axes(itr))? That would allow specifying that the most appropriate AbstractArray{<:T} is requested rather than Array. That would differ from getindex which really wants the type of the input array. This could be introduced without breakage by having it fall back to similar(Array, ...).
@nalimilan nalimilan added kind:speculative Whether the change will be implemented is speculative domain:arrays [a, r, r, a, y, s] design Design of APIs or of the language itself labels Jun 1, 2020
@mbauman
Copy link
Sponsor Member

mbauman commented Jun 1, 2020

Ref:


  1. Should collect use a new system like similar(AbstractArray, T, axes(itr)) instead of similar(1:1, T, axes(itr))? That would allow specifying that the most appropriate AbstractArray{<:T} is requested rather than Array. That would differ from getindex which really wants the type of the input array. This could be introduced without breakage by having it fall back to similar(Array, ...).

Collect only uses similar(1:1, ...) in cases of generic iterables. Since CategoricalArray <: AbstractArray, it hits this:

collect(A::AbstractArray) = _collect_indices(axes(A), A)

julia/base/array.jl

Lines 643 to 645 in cfb9b55

_collect_indices(::Tuple{}, A) = copyto!(Array{eltype(A),0}(undef), A)
_collect_indices(indsA::Tuple{Vararg{OneTo}}, A) =
copyto!(Array{eltype(A)}(undef, length.(indsA)), A)

That is, we're not using similar at all — it's just creating the Array.

Edit: Oh, except that map collects a generator:

map(f, A::AbstractArray) = collect_similar(A, Generator(f,A))

@mbauman
Copy link
Sponsor Member

mbauman commented Jun 1, 2020

Wait, my edit there isn't quite right either, because it's explicitly using collect_similar, which should use similar to get the behavior you want. That is, the call chain goes:

map(f, A::AbstractArray) = collect_similar(A, Generator(f,A))

collect_similar(cont, itr) = _collect(cont, itr, IteratorEltype(itr), IteratorSize(itr))

julia/base/array.jl

Lines 632 to 633 in cfb9b55

_collect(cont, itr, ::HasEltype, isz::Union{HasLength,HasShape}) =
copyto!(_similar_for(cont, eltype(itr), itr, isz), itr)

julia/base/array.jl

Lines 601 to 602 in cfb9b55

_similar_for(c::AbstractArray, ::Type{T}, itr, ::HasShape) where {T} =
similar(c, T, axes(itr))

Why isn't that hitting similar(::CategoricalArray, ...) like you want?

@nalimilan
Copy link
Member Author

Woops, you're right. So looking at the history I think I actually added these similar methods to get broadcast to work, not map. I probably did it that way because having collect(itr) return a CategoricalArray when values are CategoricalValues was another advantage (cf. issue description).

Maybe I could have defined Base.similar(bc::Broadcasted{DefaultArrayStyle{N}}, ::Type{<:CategoricalValue}) instead. Though it might be not enough (need to check) since when the broadcasted function returns missing for the first value, a Vector{Missing} is (correctly) allocated here:

dest = similar(bc′, typeof(val))

and passed to copyto_nonleaf!, which has to widen it when a CategoricalValue. But it doesn't use the Broadcasted object when calling similar:
newdest = Base.similar(dest, promote_typejoin(T, typeof(val)))

So in that case defining Base.similar(bc::Broadcasted...) isn't enough. Do you think that line could be changed to Base.similar(bc, promote_typejoin(T, typeof(val)), axes(dest))?

@JeffBezanson
Copy link
Sponsor Member

🏴‍☠️

@nalimilan
Copy link
Member Author

Technically it's not type piracy, as CategoricalArrays own CategoricalValue. But yeah it's not really satisfying (which is why I filed this issue). :-)

@tkf
Copy link
Member

tkf commented Jun 1, 2020

I've already mentioned it in #34478 that @mbauman linked, but it'd be nice if Base can have BangBang.append!! and maybe even BangBang.collector as a composable map backend. It is essential for high-performance and parallelizable map implementations that does not rely on return_type (except for strictly optimization purpose). For example, StructArrays.jl already implements StructArrays.append!! as well but it's a bit hard to wire things together without bloating dependencies ATM. So I think it'd be nice to move this to Base.

In an ideal world, map, filter, etc. would be implemented with composable and extensible pieces. Something like

using BangBang: append!!, collector, finish!
using MicroCollections: UndefVector

map(f, xs) = finish!(append!!(materializer(xs), (f(x) for x in xs)))
filter(f, xs) = finish!(append!!(materializer(xs), (x for x in xs if f(x))))

materializer(xs::AbstractVector) = collector(UndefVector(size(xs)), true)  # unsafe = true
materializer(xs) = haslength(xs) ? collector(UndefVector(size(xs)), true) : collector()
materializer(::Tuple) = ()
materializer(::NamedTuple) = NamedTuple()
materializer(::AbstractDict) = error("nope")

CategoricalArray can then overload materializer and return it's own materializer by composing functions like collector and append!!.

@tkf
Copy link
Member

tkf commented Jun 2, 2020

By the way, why would you want to implement map(f, ::CategoricalArray) via generic map fallback? Wouldn't it be better to assume f is reasonably pure (make it an API requirement of CategoricalArray) and map over the pooled values?

It would be nice if we can do it at least with f.(::CategoricalArray) #36081.

For generic cases like map(f, ::CategoricalArray, ::Array), I think the topic in this issue still relevant, though.

@nalimilan
Copy link
Member Author

By the way, why would you want to implement map(f, ::CategoricalArray) via generic map fallback? Wouldn't it be better to assume f is reasonably pure (make it an API requirement of CategoricalArray) and map over the pooled values?

Yeah, that's an optimization that I should definitely implement at some point. But I wanted to raise the general design issue, as it is relevant for mixed-types multiple-input map as you noted. The question of what collect should return is also relevant in general.

The materializer approach sounds interesting. It would be very useful to make it simpler to implement custom map methods without reinventing the wheel. Though I'm not sure it's strictly required to sort out the public API of map/collect/similar.

@tkf
Copy link
Member

tkf commented Jun 3, 2020

It would be very useful to make it simpler to implement custom map methods without reinventing the wheel.

The approach I took in Transducers+BangBang is based on the first principle that you treat everything as monoid (-ish). It's actually very simple and, as a result, it is composable, works with immutables, and is usable as a building block of parallel computation.

There are obviously tons of overlap in terms of functionality to what Base does since they are doing the same thing for basic cases. So, I understand that it looks like it's "reinventing the wheel." However, if you are going to extend and expose low-level collect API, I don't think exposing what Base has as-is is the best choice. Of course, most of the code in Base is probably re-usable so I'm not saying rewrite everything from scratch. I'm by no means complaining about the implementation in Base. But I think it would be highly beneficial to think about what to expose as the interface.

Though I'm not sure it's strictly required to sort out the public API of map/collect/similar.

What about

  • Generator, filter, flatten, and other iterator transformations? Wouldn't it be nice to get a CategoricalArray from collect(f(x) for x in xs if p(x)) if xs isa CategoricalArray?

  • Other composite collections like zip and product?

  • Non-array collections like dictionaries and sets. Yes, I know map(f, ::Dict) does not work but we need something that can work with them.

@nalimilan
Copy link
Member Author

"without reinventing the wheel" wasn't a criticism of BangBang, quite the contrary actually: I mean that it would to write custom implementations with less code duplication.

@tkf
Copy link
Member

tkf commented Jun 3, 2020

Oh, I see. Sorry that I reacted rather strongly based on my miss-interpretation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design of APIs or of the language itself domain:arrays [a, r, r, a, y, s] kind:speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

4 participants