-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support mapreduce over dimensions with SkipMissing #28027
base: master
Are you sure you want to change the base?
Conversation
#27845 might be touching some of the same code. |
base/reducedim.jl
Outdated
@@ -116,7 +116,7 @@ function _reducedim_init(f, op, fv, fop, A, region) | |||
if T !== Any && applicable(zero, T) | |||
x = f(zero(T)) | |||
z = op(fv(x), fv(x)) | |||
Tr = typeof(z) == typeof(x) && !isbitstype(T) ? T : typeof(z) | |||
Tr = typeof(z) <: T ? T : typeof(z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
z isa T
Separate PRs, please. |
Moved the first commit to #28089. I'll keep it here until it's merged since tests won't pass without it. |
This is a much better implementation & interface than #27818. Cheers! |
Allows calling mapreduce and specialized functinos with the dims argument on SkipMissing objects. The implementation is copied on the generic methods, but missing values need to be handled directly inside functions for efficiency and because mapreduce_impl returns a Some object which needs special handling.
f17f4c0
to
5efb69b
Compare
I've rebased and used the same approach as in #27845. Unfortunately, special code is still needed when the first slice contains missing values. Good to go? Just to check the optimized plain |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Why does this need triage? Isn't it just an optimization? Nobody on the triage call really understands what's going on here... |
This one is not an optimization, it adds support for something which isn't currently possible. |
Ok, that's a feature not a breaking change, so also doesn't need triage. |
But somebody needs to decide whether it's OK to merge before 0.7. I think it's important given that missing values are a major feature of that release. |
If it passes tests and someone who knows things approves, I'd say merge whenever. ¯\_(ツ)_/¯ |
Since this seems to be non-breaking, it doesn't really require triage. If people generally approve of it it can be merged any time. |
Though I'm a bit concerned about treating |
Yes but we don't really treat it like an |
Bump. Anybody willing to review this? |
See mini Julep at #30606. |
end | ||
end | ||
else | ||
filled = fill!(similar(R, Bool), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use an external boolean array to keep the "stage" of reduction, using _InitialValue
+ BottomRF
as now done in foldl
might be simpler. However, this would need to use Array{Union{T,Missing,_InitialValue}}
as the destination/state. It would be nice if there is an API to get Array{Union{T,Missing}}
from Array{Union{T,Missing,_InitialValue}}
by only re-writing the type tag part of the array. Is there such an API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we could simply use missing
here to indicate a slice for which we haven't found a non-missing value yet. This is because in the end the initialized array should contain missing
only if the slice contains only missing
values, in which case an error is thrown.
(Note that the current code just returns a single value, which is the smallest/largest value in the whole array. This relies on the assumption that all entries in the array can be compared with <
, which isn't necessarily the case. So returning the smallest/largest value for each slice would make sense, in which case the strategy I describe above should work.)
Regarding your question, I don't think such a conversion method exists, but there's a related issue: #26681.
end | ||
|
||
# Iterate until we've encountered at least one non-missing value in each slice, | ||
# and return the min/max non-missing value of all clices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# and return the min/max non-missing value of all clices | |
# and return the min/max non-missing value of all slices |
# non-missing value in each slice | ||
if v0 === missing | ||
v0 = nonmissingval(f, $f2, itr, R) | ||
R = similar(A1, typeof(v0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
end | ||
end | ||
else | ||
filled = fill!(similar(R, Bool), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we could simply use missing
here to indicate a slice for which we haven't found a non-missing value yet. This is because in the end the initialized array should contain missing
only if the slice contains only missing
values, in which case an error is thrown.
(Note that the current code just returns a single value, which is the smallest/largest value in the whole array. This relies on the assumption that all entries in the array can be compared with <
, which isn't necessarily the case. So returning the smallest/largest value for each slice would make sense, in which case the strategy I describe above should work.)
Regarding your question, I don't think such a conversion method exists, but there's a related issue: #26681.
Bumping this, as someone mentioned it on slack. I would assume this is too big a change to get in by the 1.6 feature freeze? |
Bump as I stumbled on this just now :) Was this given up on? |
The first commit fixes things likeEDIT: first commit moved to #28089.mapreduce(cos, +, Union{Int,Missing}[1], dims=1)
. See #26709 and #27457. A deeper revamp is needed, but given that the current code doesn't make sense, that sounds like an improvement.The second commit adds an implementation of
mapreduce
over dimensions forSkipMissing{<:AbstractArray}
, closely based on the existingAbstractArray
methods. This allows doing things which currently don't work likemapreduce(skipmissing([1 2; missing 4]), dims=1)
to compute column sums while skipping missing values.Unfortunately, the simple idea I had of just wrapping the user-provided reduction operation in a function which skips missing values does not work, because
_mapreducedim!
callsmapreduce_impl
for efficiency when possible, and it fails when a slice contains only missing values (cf. #27743). Also performance would probably be worse, at least for now.I'm not really happy about the amount of duplication this requires, but that's the only solution I could find. Apart from the boilerplate dispatching methods, we could avoid duplicating
_mapreducedim!
andreducedim_initarray0
but that would introduce several not-so-prettyA isa SkipMissing
checks. So maybe maintaining two very similar sets of functions in parallel isn't so bad. It should be quite easy to apply changes to both, as long as one doesn't forget that two methods exist.Replaces #27818.