Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing value not working with argmax and findmax #29305

Closed
getzze opened this issue Sep 21, 2018 · 11 comments · Fixed by #31008
Closed

Missing value not working with argmax and findmax #29305

getzze opened this issue Sep 21, 2018 · 11 comments · Fixed by #31008
Labels
missing data Base.missing and related functionality

Comments

@getzze
Copy link
Contributor

getzze commented Sep 21, 2018

My goal is to find the index of the maximum of an array containing missing values :

julia> w = [1.0, missing, 2.0]
3-element Array{Union{Missing, Float64},1}:
 1.0     
  missing
 2.0     

julia> maximum(w)
missing

julia> maximum(skipmissing(w))
2.0

This is working as I expected, however returning the index does not:

julia> findmax(w)
ERROR: TypeError: non-boolean (Missing) used in boolean context

julia> argmax(w)
ERROR: TypeError: non-boolean (Missing) used in boolean context

julia> findmax(skipmissing(w))
ERROR: MethodError: no method matching keys(::Base.SkipMissing{Array{Union{Missing, Float64},1}})

julia> argmax(skipmissing(w))
ERROR: MethodError: no method matching keys(::Base.SkipMissing{Array{Union{Missing, Float64},1}})

I don't think I should get an error in any of these cases, but rather propagate the missing value or skip it.

Notice that it cannot be done with NaNs either because NaNs = ±Inf in this context

@nalimilan nalimilan added the missing data Base.missing and related functionality label Sep 22, 2018
@nalimilan
Copy link
Member

The problem is that skipmissing simply returns an iterator, which has no idea about indices and doesn't support getindex.

I guess one solution would be to define Base.keys(itr::Base.SkipMissing) = (i for i in keys(itr.x) if !ismissing(itr.x[i])). If you run that line, the examples you give work. But then this raises another issue, which is that one should probably be allowed to call getindex with indices returned by keys. I'm not sure whether that's a good idea.

@martinholters
Copy link
Member

What should argmax(w) return, 2 or missing? What should argmax(skipmissing(w)) return, 2 (the second non-missing value) or 3 (the third value in the original array)? As someone not used to working with missing values, that is not quite obvious to me.

@nalimilan
Copy link
Member

What should argmax(w) return, 2 or missing?

Throwing an error as currently is fine IMO, but we could also return missing.

What should argmax(skipmissing(w)) return, 2 (the second non-missing value) or 3 (the third value in the original array)? As someone not used to working with missing values, that is not quite obvious to me.

Yes, that's tricky, which is why it isn't currently implemented. I guess the most useful behavior would be to refer to the keys of the original array (giving 3):

  • if you want indices referring to the position in non-missing values (i.e. 2), you are likely to call collect(skipmissing(w)) at some point, so you'd better work with that array instead
  • an index referring to the position in non-missing values cannot be efficiently used to index skipmissing(w) since that requires an O(N) operation

@getzze
Copy link
Contributor Author

getzze commented Sep 24, 2018

What should argmax(w) return, 2 or missing? What should argmax(skipmissing(w)) return, 2 (the second non-missing value) or 3 (the third value in the original array)? As someone not used to working with missing values, that is not quite obvious to me.

I expect argmax(w) to return 2 (the index of the maximum value given by maximum) and argmax(skipmissing(w)) to return 3, the index of the maximum value given by maximum(skipmissing(w)). Returning 2 in the second case does not make sense, because 2 would not be an index of w but of [i for i in w if !ismissing(i)] . Also, if I need to get the answer 2, I would run argmax([i for i in w if !ismissing(i)])

I guess one solution would be to define Base.keys(itr::Base.SkipMissing) = (i for i in keys(itr.x) if !ismissing(itr.x[i])). If you run that line, the examples you give work. But then this raises another issue, which is that one should probably be allowed to call getindex with indices returned by keys. I'm not sure whether that's a good idea.

getindex for a generator does not seem appropriate as you are suppose to read the generator item by item. However, keys could be defined, if you need to do enumerate(gen) for instance. In this case keys would also return a generator.

@nalimilan
Copy link
Member

getindex for a generator does not seem appropriate as you are suppose to read the generator item by item. However, keys could be defined, if you need to do enumerate(gen) for instance. In this case keys would also return a generator.

I didn't suggest to define getindex on a generator. Instead I suggested to define keys(::SkipMissing) as a generator (but that's an implementation detail, we could also use a special type if we wanted).

@Balinus
Copy link

Balinus commented Nov 23, 2018

In climate science, we work a lot with missing values (weather stations are notorious for being bad at being consistent in their recording).

The use of argmax in climate data would be to find the timestamp of a maximum value. Hence, in this case, argmax(skipmissing(w)) should return 3 in my opinion. This is useful to find the date of extreme events. If it would have returned 2, the event would be labeled on the wrong time.

@kescobo
Copy link
Contributor

kescobo commented Jan 3, 2019

I've just run into this issue as well. Another possible solution (though I don't know if it's better than the suggestions already articulated) would be to have something like findmax(w, skipmissing=true). This is possibly less julian, but it seems to me that

  1. from a data science perspective findmax/findmin should return missing (or (missing, missing) to be consistent with the other returns from those functions).
  2. to @martinholters' question, argmax(skipmissing(w)) should return 2 in the above case (the second non-missing value), but it would be really useful to have a way of getting the index of the original array.

The reason for point 2 IMO is that it would be weird if

x = collect(skipmissing(w))
x[argmax(skipmissing(w))]

could give a bounds error.

Edit to clarify - My proposal would be that argmax(w, skipmissing=true) would return 3 in the case above, argmax(w, skipmissing=false) (the default) would return missing, while argmax(skipmissing(w)) would return 2 (or error as it does currently).

@nalimilan
Copy link
Member

We really can't start adding skipmissing arguments to all find* functions. There's no end to that process and it's not generic. Instead I think argmax(skipmissing(w)) should return 3 and have skipmissing(w)[3] === w[3]. I'll make a detailed proposal soon.

@kescobo
Copy link
Contributor

kescobo commented Jan 4, 2019

Fair enough - if skipmissing() can be directly indexed rather than needing to collect() first, then I think my objection goes away.

@nalimilan
Copy link
Member

See mini Julep at #30606.

@nalimilan
Copy link
Member

#31008 fixes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants