Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mapreduce over dimensions with SkipMissing #28027

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Jul 10, 2018

The first commit fixes things like mapreduce(cos, +, Union{Int,Missing}[1], dims=1). See #26709 and #27457. A deeper revamp is needed, but given that the current code doesn't make sense, that sounds like an improvement. EDIT: first commit moved to #28089.

The second commit adds an implementation of mapreduce over dimensions for SkipMissing{<:AbstractArray}, closely based on the existing AbstractArray methods. This allows doing things which currently don't work like mapreduce(skipmissing([1 2; missing 4]), dims=1) to compute column sums while skipping missing values.

Unfortunately, the simple idea I had of just wrapping the user-provided reduction operation in a function which skips missing values does not work, because _mapreducedim! calls mapreduce_impl for efficiency when possible, and it fails when a slice contains only missing values (cf. #27743). Also performance would probably be worse, at least for now.

I'm not really happy about the amount of duplication this requires, but that's the only solution I could find. Apart from the boilerplate dispatching methods, we could avoid duplicating _mapreducedim! and reducedim_initarray0 but that would introduce several not-so-pretty A isa SkipMissing checks. So maybe maintaining two very similar sets of functions in parallel isn't so bad. It should be quite easy to apply changes to both, as long as one doesn't forget that two methods exist.

Replaces #27818.

@nalimilan nalimilan added arrays [a, r, r, a, y, s] missing data Base.missing and related functionality labels Jul 10, 2018
@andreasnoack
Copy link
Member

#27845 might be touching some of the same code.

@@ -116,7 +116,7 @@ function _reducedim_init(f, op, fv, fop, A, region)
if T !== Any && applicable(zero, T)
x = f(zero(T))
z = op(fv(x), fv(x))
Tr = typeof(z) == typeof(x) && !isbitstype(T) ? T : typeof(z)
Tr = typeof(z) <: T ? T : typeof(z)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

z isa T

@JeffBezanson
Copy link
Member

Separate PRs, please.

@nalimilan
Copy link
Member Author

Moved the first commit to #28089. I'll keep it here until it's merged since tests won't pass without it.

@stillyslalom
Copy link
Contributor

This is a much better implementation & interface than #27818. Cheers!

Allows calling mapreduce and specialized functinos with the dims argument
on SkipMissing objects. The implementation is copied on the generic methods,
but missing values need to be handled directly inside functions for efficiency
and because mapreduce_impl returns a Some object which needs special handling.
@nalimilan
Copy link
Member Author

I've rebased and used the same approach as in #27845. Unfortunately, special code is still needed when the first slice contains missing values. Good to go?

Just to check the optimized plain mapreduce is still used:
@nanosoldier runbenchmarks("union", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nalimilan nalimilan added the triage This should be discussed on a triage call label Jul 19, 2018
@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 19, 2018

Why does this need triage? Isn't it just an optimization? Nobody on the triage call really understands what's going on here...

@nalimilan
Copy link
Member Author

This one is not an optimization, it adds support for something which isn't currently possible.

@StefanKarpinski
Copy link
Member

Ok, that's a feature not a breaking change, so also doesn't need triage.

@nalimilan
Copy link
Member Author

But somebody needs to decide whether it's OK to merge before 0.7. I think it's important given that missing values are a major feature of that release.

@ararslan
Copy link
Member

If it passes tests and someone who knows things approves, I'd say merge whenever. ¯\_(ツ)_/¯

@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Jul 26, 2018
@JeffBezanson
Copy link
Member

Since this seems to be non-breaking, it doesn't really require triage. If people generally approve of it it can be merged any time.

@JeffBezanson
Copy link
Member

Though I'm a bit concerned about treating skipmissing(A::AbstractArray) as an n-d array, since it can't really act like one.

@nalimilan
Copy link
Member Author

Yes but we don't really treat it like an AbstractArray, we just support reduction operations. The only alternative I can see is to add a keyword argument to all reduction functions, which requires duplicating a lot of code and is inconsistent with sum(skipmissing(x)). Do you see another solution for this essential feature?

@nalimilan
Copy link
Member Author

Bump. Anybody willing to review this?

@nalimilan
Copy link
Member Author

See mini Julep at #30606.

end
end
else
filled = fill!(similar(R, Bool), false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use an external boolean array to keep the "stage" of reduction, using _InitialValue + BottomRF as now done in foldl might be simpler. However, this would need to use Array{Union{T,Missing,_InitialValue}} as the destination/state. It would be nice if there is an API to get Array{Union{T,Missing}} from Array{Union{T,Missing,_InitialValue}} by only re-writing the type tag part of the array. Is there such an API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we could simply use missing here to indicate a slice for which we haven't found a non-missing value yet. This is because in the end the initialized array should contain missing only if the slice contains only missing values, in which case an error is thrown.

(Note that the current code just returns a single value, which is the smallest/largest value in the whole array. This relies on the assumption that all entries in the array can be compared with <, which isn't necessarily the case. So returning the smallest/largest value for each slice would make sense, in which case the strategy I describe above should work.)

Regarding your question, I don't think such a conversion method exists, but there's a related issue: #26681.

end

# Iterate until we've encountered at least one non-missing value in each slice,
# and return the min/max non-missing value of all clices
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# and return the min/max non-missing value of all clices
# and return the min/max non-missing value of all slices

# non-missing value in each slice
if v0 === missing
v0 = nonmissingval(f, $f2, itr, R)
R = similar(A1, typeof(v0))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

end
end
else
filled = fill!(similar(R, Bool), false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we could simply use missing here to indicate a slice for which we haven't found a non-missing value yet. This is because in the end the initialized array should contain missing only if the slice contains only missing values, in which case an error is thrown.

(Note that the current code just returns a single value, which is the smallest/largest value in the whole array. This relies on the assumption that all entries in the array can be compared with <, which isn't necessarily the case. So returning the smallest/largest value for each slice would make sense, in which case the strategy I describe above should work.)

Regarding your question, I don't think such a conversion method exists, but there's a related issue: #26681.

@pdeffebach
Copy link
Contributor

Bumping this, as someone mentioned it on slack. I would assume this is too big a change to get in by the 1.6 feature freeze?

@briochemc
Copy link
Contributor

Bump as I stumbled on this just now :)

Was this given up on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants