Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map over a KeySet unexpectedly returns a Set #26359

Closed
nalimilan opened this issue Mar 7, 2018 · 22 comments
Closed

map over a KeySet unexpectedly returns a Set #26359

nalimilan opened this issue Mar 7, 2018 · 22 comments
Labels
domain:collections Data structures holding multiple items, e.g. sets
Milestone

Comments

@nalimilan
Copy link
Member

This is a regression from 0.6.2 which affects DataFramesMeta (JuliaData/DataFramesMeta.jl#88). Under some very particular circumstances, expression interpolation reverses the order of keys in a dictionary, compared with the order obtained in other places inside the same function. This is a problem when the function relies on the order of keys matching that of values (as with DataFrames macros).

In the following example, :a appears before :b in the two first lines inside the quoted block, but that's the contrary for the third line.

function with_helper()
    membernames = Dict{Symbol,Int}(:a => 1, :b => 2)
    quote
        $(keys(membernames)...)
        $(map(key -> :($key), keys(membernames))...)
        $(map(key -> :(somedict[$key]), keys(membernames))...)
    end
end

julia> with_helper()
quote
    #= REPL[1]:4 =#
    a
    b
    #= REPL[1]:5 =#
    a
    b
    #= REPL[1]:6 =#
    somedict[b]
    somedict[a]
end
@nalimilan nalimilan added kind:regression Regression in behavior compared to a previous version macros @macros labels Mar 7, 2018
@vtjnash vtjnash added kind:invalid Indicates that an issue or pull request is no longer relevant and removed macros @macros kind:regression Regression in behavior compared to a previous version labels Mar 7, 2018
@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 7, 2018

As with other languages, Dict key iteration order is undefined and unstable (permitted to change from one call to the next).

@vtjnash vtjnash closed this as completed Mar 7, 2018
@nalimilan
Copy link
Member Author

nalimilan commented Mar 7, 2018

Yet:

  values(a::AbstractDict)

  Return an iterator over all values in a collection. collect(values(a))
  returns an array of values. Since the values are stored internally in a hash
  table, the order in which they are returned may vary. But keys(a) and
  values(a) both iterate a and return the elements in the same order.

EDIT: IOW, we're supposed to be able to call keys(d) then values(d) and get entries in the same order. Yet you claim we wouldn't be able to call keys(d) twice and get the same order?

Also FWIW the original code in DataFrames did this:

            function $funname($(values(membernames)...))
                $body
            end
            $funname($(map(keys(membernames)) do key
                :($d[$key])

@nalimilan nalimilan reopened this Mar 7, 2018
@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 7, 2018

The observed result is from map(f, Set) --> Set, so it's the return from map that is being observed as unordered in this example

@vtjnash vtjnash changed the title Inconsistent dict keys order inside complex expression interpolation map over a Set unexpectedly returns a Set Mar 7, 2018
@nalimilan nalimilan changed the title map over a Set unexpectedly returns a Set map over a KeySet unexpectedly returns a Set Mar 7, 2018
@nalimilan
Copy link
Member Author

Ah, good catch, so it's just due to the fact that calling map over a KeySet gives a Set, which loses ordering. I wonder why I couldn't reproduce this with $(map(key -> :($key), keys(membernames))...), though. I guess that's just luck...

So I guess there's nothing to change here and DataFramesMeta should just use a comprehension? It's a bit annoying that map over keys(::Dict) loses the ordering, but there are probably also advantages to returning a Set rather than a vector.

@StefanKarpinski
Copy link
Sponsor Member

It's not entirely clear to me that map over a set should produce a set.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 7, 2018

Possibly introduced by #15146? I don't think there's any advantage to losing the iteration order either, but that's the way it seems to be going for AbstractSet and AbstractDict.

@ararslan ararslan added domain:collections Data structures holding multiple items, e.g. sets and removed kind:invalid Indicates that an issue or pull request is no longer relevant labels Mar 7, 2018
@nalimilan
Copy link
Member Author

Actually part of the issue seems to be that KeySet is not really a standard AbstractSet because the ordering matters. For other AbstractSets that problem doesn't exist. So maybe KeySet shouldn't be an AbstractSet at all? Actually @andyferris had already noticed soon after making this change that inheriting from AbstractSet created problems : #24580 (comment). I guess we should just go back to KeyIterator.

@andyferris
Copy link
Member

andyferris commented Mar 8, 2018

That's basically right; I envisaged the change to KeySet to go together with #24508 which we didn't go ahead with, plus a few other tweaks to dictionaries and sets I was thinking about. I never really communicated the whole picture on github very clearly... so I'll give it a go now (at least for histortical referece, TLDR is we should definitely change KeySet or else consider more drastic things).

To clarify what my intentions were (and the behavior I still strongly desire): I want AbstractDict to behave much more similarly to AbstractArray. I'm imagining dictionaries are an indexable collection that iterates values (this iteration behavior seemed viable to me at the time since I first heard this idea from posts by @JeffBezanson, and also that's how named tuples behave), and supports keys and pairs just like AbstractArray. The keys of an array are ranges (or cartesian ranges) which are idempotent under indexing. E.g. I view the fact that getindex(::OneTo, i) = i as a key (no pun intended) property of abstract array indexing and keys. It's this behavior that guides the return type of non-scalar getindex (as in A[B] should return something with the same shape/keys as B using the individual values of B as indices to A - note that this extends for offset arrays too).

Thus, what I want is that keys(dict) returns a new special kind of dictionary that has the same indexes and keys, something like keys(Dict("key1" => "value1", ...)) == Dict("key1" => "key1", ...). This would be a special type like KeyIterator or KeySet or DictKeys or something. To give the history of #24508, I noticed that the values of this new, indexable keys object were guaranteed to be unique since the keys were unique - so it seemed reasonable to call it a "set". I also didn't see anything that breaks about AbstractSet except what @StefanKarpinski points out above:

It's not entirely clear to me that map over a set should produce a set.

Exactly - how do we know that the mapped values won't overlap? map seems somewhat tied to the iterator API and IMO probably shouldn't mess with the number of elements (discarding duplicates). In this brave new world I was imagining, mapping a set would indeed produce a dictionary, not a set. In fact what I was thinking is that AbstractSet{T} becomes a special, omnipotent idempotent type of AbstractDict{T,T}, and that dictionaries iterate values, and together both changes mean that sets really don't change very much at all, besides some new (admittedly somewhat esoteric!) indexing behavior.

We didn't end up going that way, which is fine, but it would be nice to have a look over this again and figure out what makes sense. For the record, my recommendations:

d = Dict("key1" => "value1", "key2" => "value2")
keys(d) == Dict("key1" => "key1", "key2" => "key2") # Not necessarily an actual `Dict`, of some other dictionary type analogous to how `OneTo` is an `AbstractArray`, etc.
pairs(d) == Dict("key1" => ("key1" => "value1"), "key2" => ("key2" => "value2")) # Not necessarily an actual `Dict`, just some "view" like we could do for arrays

I'll give an example of what this allows us to do, for example for map. The behavior is worked out completely in analogy to arrays (and also happens to be a behavior I think is quite easy to reason about and is very useful in practice):

# A dictionary matching name and occupation
d = Dict("Alice" => "Physicist", "Eve" => "Spy")

# Map the dictionary values and make a new dictionary with the same keys
map(job -> "I am a $job", d) == Dict("Alice" => "I am a Physicist", "Eve" => "I am a Spy")

# Map the dictionary keys and make a new dictionary with the same keys
map(name -> "My name is $name", keys(d)) == Dict("Alice" => "My name is Alice", "Eve" => "My name is Eve")

# Map the dictionary key-value pairs and make a new dictionary with the same keys
map(pair -> "My name is $(pair.first) and I am a $(pair.second)", pairs(d)) == Dict("Alice" => "My name is Alice and I am a Physicist", "Eve" => "My name is Eve and I am a Spy")

A final example of why I like my proposed behavior is grouping (aka "groupby"). Say I have a table, dictionary, or whatever with peoples names, gender and height like this:

name gender height (m)
Alice Female 1.58
Bob Male 1.78
... ... ...

I can perform a grouping dicarding the names and collecting the ages by gender. I played with prototypes in SplitApplyCombine.jl and found I could make a nice function to turn this into something like:

gender_vs_heights == Dict("Female" => [1.58, ...], "Male" => [1.78, ...])

I then realized - what then? I couldn't do non-scalar indexing on the resulting container (hence #24019 and Indexing.jl), I couldn't use map in an ergonomic way (hence #25013 with a view to iterate values in the future), I couldn't use broadcast in an ergonomic way (hence #25904). In this example, what I really, desperately, want to do is this:

gender_vs_mean_height = mean.(gender_vs_heights)

I feel this example is literally screaming out for this syntax and behavior, but maybe that's just me.

I don't really know where we'll go from here, but I do feel AbstractDicts could be made much more useful for "data" people, and like all good Julia people I'm greedy. I'm happy to discuss and help out - even just backing out the KeySet thing might be the most useful thing I can help with at this stage.

@nalimilan
Copy link
Member Author

Thanks for the explanation! I'm not sure I can discuss the overall design right now, but speaking specifically of the present issue, how would you envision fixing the annoying fact that map over keys(d) loses the ordering? AFAICT if

map(name -> "My name is $name", keys(d)) == Dict("Alice" => "My name is Alice", "Eve" => "My name is Eve")

then the resulting Dict would not necessarily iterate in the same order as d and keys(d), right? Or should/can we ensure that by copying d's structure (of course the value type will have to be changed, but keys are by definition the same)?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 8, 2018

I'm imagining dictionaries ... iterates values

What's missing from values(dict) to do precisely this?

Or should/can we ensure that

For most cases (including Dict) we can do it, the question is really whether it should be mandated of all containers. The downside I see is that it might be less composable, since we'd be requiring that all iterable containers have indices to be used with map, ruling out the ability to map over a Generator / Zip / Product / other lazy structure (as seen with #18618, the counter-proposal to Andy's above issue numbers)

@JeffBezanson
Copy link
Sponsor Member

One nice thing about having keys(dict) return a set is that it signals that in is efficient, so set functions don't need to bother converting it to a set first.

I like the picture painted by @andyferris , but I don't fully understand the thing about index sets mapping values to themselves. For example, an OffsetArray can have -5:5 as its indices, where (-5:5)[1] == -5. Would we wrap the range in another type to make it idempotent? That part hasn't quite clicked for me yet.

@mbauman
Copy link
Sponsor Member

mbauman commented Mar 8, 2018

Would we wrap the range in another type to make it idempotent? That part hasn't quite clicked for me yet.

That's precisely how we represent :.

julia> Base.Slice(-5:5)[1]
1

julia> Base.Slice(-5:5)[-5]
-5

julia> Base.Slice(-5:5)[-6]
ERROR: BoundsError: attempt to access Base.Slice{UnitRange{Int64}} with indices -5:5 at index [-6]

@mbauman
Copy link
Sponsor Member

mbauman commented Mar 8, 2018

We don't do that for axes(::OffsetArray) and keys(::OffsetArray), though. Perhaps we should?

(A::Array)[axes(A)...] == A
(A::Array)[keys(A)] == A
(B::OffsetArray)[axes(B)...] != B # It's an array with OneTo axes!
(B::OffsetArray)[keys(B)] != B # Again an array
(B::OffsetArray)[ntuple(i->:, ndims(B))] == B

We've committed to using the rule that the axes of the result are the concatenation of the axes of the indices for arrays. So now if we extend this to non-scalar indexing of dictionaries:

d::Dict
d.[keys(d)] == d # the same dictionary back if keys(d) == keys(keys(d))
d.[collect(keys(d))] == [d[k] for k in collect(keys(d))] # An array with OneTo axes

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 8, 2018

That's precisely how we represent :.

Isn't Slice representing the inverse operation (remapping 1:N to the parent indices, rather than mapping N:M to 1:(M - N), or are those equivalent since you shouldn't be trying to index into the array axes anyways)?

KeySet is not really a standard AbstractSet because the ordering matters.

I don't think that's the relevant issue here. The gotcha from your OP was that map assumes the output should form an unordered Set because the input was detected to be created from an unordered set. And that there's examples where either interpretation is more consistent / useful / dwim.

So now if we extend this to non-scalar indexing of dictionaries:

Full grid of these at #25904 (comment) – if I missed a case, would be good to add it, or move that to a gist to extend it more. :)

@JeffBezanson
Copy link
Sponsor Member

We should perhaps have a rule that map should be index-preserving for indexed collections, and otherwise be order-preserving. So with that rule map on a set could be undefined, or give an array (or other ordered collection).

@andyferris
Copy link
Member

andyferris commented Mar 8, 2018

I don't fully understand the thing about index sets mapping values to themselves

Like mentioned, this is in analogy to arrays (which are particularly awesome in Julia, BTW). Why is keys(::AbstractArray) an indexable container that has this idempotent property? It could be simply iterable, right? And if keys(::AbstractArray) returns an indexable container, is there a good reason that keys(::AbstractDict) shouldn't return an indexable container?

I would relate these questions directly to #24019, where the proposed semantics for generalized non-scalar getindex is that the output of A[B] is an indexable container that shares the keys of B and the appropriate values of A. It also seems logical to me that A[keys(A)] == A (and also A[keys(A)] == A[:] at least in the one-dimensional case, where I see : as mostly a placeholder for keys(A), in the multidimensional case : is a placeholder for one of the axes). Whenever keys returns an idempotently indexable collection, then A[keys(A)] == A follows naturally. On the other hand, if keys(dict) returned an Array of keys, I'd expect A[keys(A)] to become an Array. If you actually want the array - you can always insert a call to collect. (Note that going the other direction, turning an array into something which can create a dictionary as the indexer of some other collection isn't exactly ergonomic - unless Set has this idempotent indexing property in which case A[Set([2,3,5,7])] is a good way of creating the dictionary mapping 2 => A[2], and 3 => A[3], etc).

Regarding

how would you envision fixing the annoying fact that map over keys(d) loses the ordering?

and

We should perhaps have a rule that map should be index-preserving for indexed collections, and otherwise be order-preserving. So with that rule map on a set could be undefined, or give an array (or other ordered collection).

My opinion on these isn't quite as specific. @nalimilan For the example I gave, it is feasible to copy the internal structure of the input dictionary to preserve the ordering (it's faster than reconstructing a new hash map anyway, for example).

My vague preference is we can enforce that map always preserves ordering and not necessarily indices (e.g. it may produce an array from map(f, set) and that would be fine) and that broadcast of a single container always preserves indices and not necessarily iteration order (so that broadcast(f, keys(dict)) makes a dictionary sharing the keys of dict and the values f(dict[i]) for i in keys(dict)). Whenever practical, map and broadcast may produce the same output. I feel this would let us write more precise semantics for both of these functions, and more formally pin down the relationship between the iteration and indexing APIs and related functions.

What's missing from values(dict) to do precisely this?

Again, this is mostly the analogy to AbstractArray (for which I find the API awesome and convenient for manipulating data). I have personally found it confusing to switch between how I interact with arrays, tuples and named tuples and how I interact with dictionaries. It goes a little beyond just iteration and indexing and what keys or map does - it's also stuff like filter, reduce, mapreduce and mean. It's difficult to make generic code that works on "generic indexable containers" (i.e. Union{AbstractArray, AbstractDict}), and when I try there's usually a little hack somewhere to detect which it is. I also feel I would make use of pairs(dict) less often than values(dict), but I can definitely see this will vary from person-to-person. (One main advantage I see of iterating pairs is that iteration is faster than iterating keys then using getindex to get the value when you need both - but this isn't an argument for it as the default behavior, which IMO should be motivated by letting users create generic, readable, composable code).

@kescobo
Copy link
Contributor

kescobo commented Apr 10, 2018

As I brought up on slack, here's another example where map returning a Set doesn't make a lot of sense:

julia> s = Set(["foo", "fu", "bar", "baz", "blah"])
Set(String["bar", "blah", "fu", "baz", "foo"])

julia> map(x-> startswith(x, "f"), s)
Set(Bool[false, true])

Not sure I fully understood all of the back and forth, but this seems like an argument for map to return an array a la Jeff's suggestion:

We should perhaps have a rule that map should be index-preserving for indexed collections, and otherwise be order-preserving. So with that rule map on a set could be undefined, or give an array (or other ordered collection).

@StefanKarpinski
Copy link
Sponsor Member

I still think we should consider the "map preserves indices for collections with indices and preservers iteration order for collections without indices" rule. It seems like the one that works the best to me.

@nalimilan nalimilan added the status:triage This should be discussed on a triage call label Apr 26, 2018
@Keno Keno added this to the 1.0 milestone Apr 26, 2018
@Keno
Copy link
Member

Keno commented Apr 26, 2018

Triage thinks that at the very least, we should deprecate map on Set, tagging as 1.0.

mbauman added a commit that referenced this issue May 4, 2018
`map` on sets previously returned a `Set`, possibly changing the order or number of elements. This
behavior is deprecated and in the future `map` will preserve order and number of elements.

Fixes #26359.
@JeffBezanson JeffBezanson removed the status:triage This should be discussed on a triage call label May 9, 2018
@hraftery
Copy link

hraftery commented Dec 30, 2022

Fascinating. I came via #42132 through #5794 through a6c4691 to here. I sympathise with the dream of more array like collections. What I fear might have been delayed along the way is the principle of least surprise that makes composability worth using. I say "delayed" because the paths anticipated by this issue might not be undermined by just patching up what we have now.

By that, I mean that currently filter returns an Vector for a Vector, a Set for a Set and a Dict for a Dict. But map returns a Vector for a Vector and an error for a Set or a Dict, and there are proposals to replace that with array-like for all three.

My bias is a long history of bare-metal type C programming, only really getting into collections and iterators with maybe Rust, then seeing the forest for the trees with Haskell, and finally falling in love with multiple dispatch in Julia. So I find it jarring that map on a data structure would return a different data structure.

The Haskell docs on mapping a set may be illuminating here. For those unfamiliar with the highly unintuitive Haskell type signatures, Ord b => (a -> b) -> Set a -> Set b means "for ordinal orderable type b, map takes a function that maps a's to b's, and a set of a's, and returns a set of b's.

So given that history, I've come to expect that iterating a vector/list/array gives elements of the array type in a certain order; iterating a set gives elements of the set type in an unpredictable order, and iterating a dict gives (key, value) pairs in an unpredictable order (although often the key is suppressed, which is consistent with iterating an array not returning index/value pairs).

Starting with the set case, because the implications are much simpler, this would definitely provide the least surprise for me:

map(f, set) = Set([f(e) for e in set])

That this is may result in a smaller number of elements seems entirely normal to me, since I know that when I use Sets, adding an element may not change the Set size (and may re-order the collection!). The implications of uniqueness are front of mind, and desirable.

If I wanted instead the results of the function applied to every element, as described in @kescobo 's example, I would happily make that explicit:

map(x-> length(x), collect(s))

For a dict there's a couple of options, but only those that return a dict seem reasonable to me. Again, it may be useful to check the Haskell docs. Mind the confusion over the fact that the Haskell dict is called a Map! But basically you have three options: map iterates values and returns a dict with the same keys but with the values you provide, while mapWithKey does the same thing, but also provide you the key and the value. Finally, mapKeys does what map does but for keys instead of values (and you can use mapKeysWith to specify what to do if there's a clash).

The examples from @andyferris would then look much the same:

d = Dict("Alice" => "Physicist", "Eve" => "Spy")

# Map the dictionary values and make a new dictionary with the same keys
map(job -> "I am a $job", d) == Dict("Alice" => "I am a Physicist", "Eve" => "I am a Spy")

# Map the dictionary keys and make a new dictionary with the same keys
mapWithKey((name,_) -> "My name is $name", d) == Dict("Alice" => "My name is Alice", "Eve" => "My name is Eve")

# Map the dictionary key-value pairs and make a new dictionary with the same keys
mapWithKey(pair -> "My name is $(pair.first) and I am a $(pair.second)", pairs(d)) == Dict("Alice" => "My name is Alice and I am a Physicist", "Eve" => "My name is Eve and I am a Spy")

gender_vs_mean_height = map(mean, gender_vs_heights))

So I guess what I'm saying is that restoring the equivalent of:

map(f, A::Union{AbstractArray,AbstractSet}) = collect_similar(A, Generator(f,A))

seems to me to be an improvement right now, that doesn't interrupt progress on other ideas in this issue, provided there's agreement that map should preserve collection type, like filter does. I apologise earnestly for being very new to Julia and being very ignorant to other efforts and language design philosophies (eg. dynamic typing!) on this topic.

Edit: argh, just noticed this issue is closed. Happy to pick it up somewhere else?

@andyferris
Copy link
Member

Appologies, I realize this issue is closed but it seemed most straightforward to reply to @hraftery here.

I think it's desirable that users easily have access to both of those semantics (mapping sets to sets, and mapping sets to dictionaries).

We now have Iterators.map being another shorthand for generators. However Iterators.map is just meant to be the lazy version of map. Would it be surprising if for x in map(f, set); ...; end behaved differently to for x in Iterators.map(f, set); ... ; end? I think it could be. Generally you want to use Iterators.map as "I want to map the entries of this thing and then iterate over that, but I just want to save some memory in the meanwhile". (In my mental model, map is a generic function with semantics on generic iterables, and richer yet compatible semantics on richer data structures).

So I guess what I'm saying is that restoring the equivalent of:

map(f, A::Union{AbstractArray,AbstractSet}) = collect_similar(A, Generator(f,A))

Note that the meaning of similar on sets and dictionaries could be defined so that this is preservered (as is already the case in Dictionaries.jl).

@hraftery
Copy link

Heh, interesting @andyferris . Funnily enough I did turn to Iterators for take, flatten, partition (actually in IterTools) and most relevant to your point, cycle, and found myself in familiar (Haskell-ish) territory. So in general I think it's completely expected that a package would have a different opinion on what's important. Eg. Iterators favouring laziness, or say, Base reserving opinion on how to do element-wise operations on vectors of different lengths. But different return semantics for the same method name? I don't know - the name of the package is not enough to convey to me the difference. It all comes down to expectations, which are largely established by documentation. As I'm sure you appreciate, if the opinion at work is clear and consistent, surprise is minimised, but that's hard to achieve in general.

I already find the subtle differences between [f(x) for x in vec] and map(f, vec) confusing.

Note that the meaning of similar

Oh I have to admit this is beyond me - I only pulled the example from the changelog, not fully understanding what it does.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

No branches or pull requests

10 participants