Add a skipmissing kwarg to select/transform/combine #2258

matthieugomez · 2020-05-15T13:00:26Z

I was thinking that it would be nice to have a skipmissing keyword argument in transform/select/combine

The idea is that it would only use rows where none of the observations for the variables used in the call are missing. (It is possible because user must already specify which variables are used in a call to transform/select/combine).

That would make it easier to skip missing without, and, in particular, to handle the case of several variables (or weights).

transform(df, [:x, :y] => (x,y) -> corr(x,y), skipmissing = true)

bkamins · 2020-05-15T13:14:10Z

This would essentially be a shorthand for transform!(dropmissing(df, cols), ...).

The benefit would be that you would have automatically resolved cols argument to dropmissing.

The problem is that it would break the invariant of transform that we do not change the number of rows in the returned value.

Given the functionality is easy enough to be obtained via dropmissing I am not 100% sure if it is worth to add it. But let us wait for other users to comment on this.

matthieugomez · 2020-05-15T13:34:15Z

In my mind it would not change the number of rows. It would just compute the function using rows where none of the cols is missing (and return a missing value for rows where one of the cols is missing).

pdeffebach · 2020-05-15T16:49:02Z

Something like

function wrapper(fun, x, y)
    sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
    sub_out = fun(sx, sy)
    full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
    full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases

    return full_out
end

Note that currently corr(sx, sy) doesn't work... see here.

nalimilan · 2020-05-18T10:11:54Z

Yes I guess this would be quite convenient. In practice most people who work with data would almost never have to use the skipmissing function directly.

But to be efficient, the implementation cannot always collect the result. The only way to be really fast in all cases is to wrap the columns in skipmissing or skipmissings (depending on the number of columns). So for cor to work we still need something like https://github.com/JuliaLang/Statistics.jl/pull/34 (which collects the result because it's relatively fast, but that's not the case for all functions).

matthieugomez · 2020-05-18T13:48:23Z

Alternatively, it could apply the function on a view of columns no?

nalimilan · 2020-05-18T15:01:07Z

No, since you would need to go over all values first just to find out whether they are missing or not. That's quite costly if all you want to do is e.g. compute the mean. That's why we have skipmissing/skipmissings: it's basically a lazy view.

pdeffebach · 2020-05-18T16:08:24Z

I just looked over the select and transform code. Thanks to @bkamins's hard work, this would actually not be too hard to implement. It would require more if statements for sure, but it's not crazy.

nalimilan · 2020-07-07T20:01:54Z

Above we took the case of a function returning a scalar, in which case the value can just be broadcasted to match the expected number of rows. But we haven't discussed the situation where you would pass a function returning the same number of rows as the input, like + or ByRow(f). In that case, @matthieugomez noted that we should avoid calling the function on rows with missing values, and use missing for them. This is equivalent to passmissing.

Implementing this is easy in the ByRow case, as we can just wrap the function in passmissing. But it's more tricky for the general case: if we wrap the columns in skipmissings as I suggested above, the function will can then return either a scalar (as in the case discussed above) or a vector with as many elements as there are non-missing values in the input. In the latter case, we have to copy values to a new vector with one entry per row, inserting missing values in the right places. This is trivial to do e.g. using a SubArray view of non-missing rows. But that requires computing the indices of these rows, which skipmissings doesn't do (for performance reasons).

So we would have to lose a bit of performance to be able to support this. Maybe it's not the end of the world given that it would make things simpler: we would be able to pass the user-provided function a SubArray of listwise complete observations instead of a SkipMissings object, so everything would work like for plain Vector (e.g. +). People who use functions that return a scalar (which isn't the most common case for select and transform) and want something faster can use skipmissings manually.

Finally, note that this approach won't work for combine, as we don't know the expected number of rows. There, skipmissing=true would just skip rows with missing, without filling the corresponding entries with missing. That's probably OK but it's worth highlighting. To stress the difference, we could call the argument to select/transform passmissing instead of skipmissing, though that would make things harder to grasp for users.

bkamins · 2020-08-06T17:20:30Z

I answer to #2314 (comment) here, per suggestion of @nalimilan.

With select, skipmissing would apply the functions on row for which none of the argument is missing

This is not what select does by default. It does not apply functions to rows. It passes the whole vectors to the functions. So I am not clear what would be passed to such functions.

With combine, skipmissing would apply the functions on rows for which none of the argument is missing.

Again combine by default passes whole columns. Now what you propose is problematic in e.g. the following case:

julia> df = DataFrame(x=[1,missing,2], y=1:3)
3×2 DataFrame
│ Row │ x       │ y     │
│     │ Int64?  │ Int64 │
├─────┼─────────┼───────┤
│ 1   │ 1       │ 1     │
│ 2   │ missing │ 2     │
│ 3   │ 2       │ 3     │

julia> combine(df, :x => identity, :y => identity)
3×2 DataFrame
│ Row │ x_identity │ y_identity │
│     │ Int64?     │ Int64      │
├─────┼────────────┼────────────┤
│ 1   │ 1          │ 1          │
│ 2   │ missing    │ 2          │
│ 3   │ 2          │ 3          │

as with skipmissing it would be not clear how many rows should be retained in such a function.

By the way, I am not sure why combine(df, :x) or combine(df, :x => ByRow(.))

Well they are accepted because they follow standard rules:

combine(df, :x) gets expanded to combine(df, :x => identity => :x) and all is consistent what happens
combine(df, :x => ByRow(f)) is just a shorthand for combine(df, :x => x -> f.(x)) and again all is consistent

(so in short - please remember that we ALWAYS transform any expression to a canonical form src => fun => dst before evaluation, the shorthand forms are helpers that just make things easier to write - but every such expression has a well defined interpretation using src => fun => dst minilanguage)

matthieugomez · 2020-08-06T17:31:51Z

I understood your question as asking the desired behavior, not how to implement it with the current implementation of these functions in DataFrames. As for `combine(df, :x => identity, :y => identity, skipmissing = true)`, it would report "ERROR: ArgumentError: New columns must have the same length as old columns". Exactly the same as what happens when doing `combine(df, :x => collect ∘ skipmissing, :y => collect ∘ skipmissing)`.

bkamins · 2020-08-06T17:55:23Z

Then for combine this can be the rule you propose and it is consistent. I understand that for ByRow you want combine to behave like select described below (i.e. if row contains missing store missing in the result).

However, for select I would need to understand more exactly what you want. So you are saying the following for e.g. select(df, :x => fun => :y, skipmissing=true) should happen (not that it has to be implemented this way but to make it precise what should happen):

allocate a new column y initially holding missing
find all indices of non-missing rows in :x, call it idxs
pass df.x[idxs] to fun and collect the result res
perform pseudo-broadcasting of res into view(y, idxs) (following the standard rules of pseudo-broadcasting in DataFrames.jl)
store y in the variable :y in the target data frame

Did I understand correctly what you would like to have?

matthieugomez · 2020-08-06T18:04:12Z

Then for combine this can be the rule you propose and it is consistent. I understand that for ByRow you want combine to behave like select described below (i.e. if row contains missing store missing in the result).

No, I think combine should just return the result on the function applied on collect(skipmissing(x)). I don't really think that anyone would use combine with ByRow anyway, so it does not really matter. But the rule is just more consistent this way.

However, for select I would need to understand more exactly what you want. So you are saying the following for e.g. select(df, :x => fun => :y, skipmissing=true) should happen (not that it has to be implemented this way but to make it precise what should happen):

allocate a new column y initially holding missing

find all indices of non-missing rows in :x, call it idxs

pass df.x[idxs] to fun and collect the result res

perform pseudo-broadcasting of res into view(y, idxs) (following the standard rules of pseudo-broadcasting in DataFrames.jl)

store y in the variable :y in the target data frame

Yes exactly.

bkamins · 2020-08-06T18:33:18Z

No, I think combine should just return the result on the function applied on collect(skipmissing(x)).

Ah - OK, so then combine and select would be significantly different indeed.

I don't really think that anyone would use combine with ByRow anyway

Actually if we define the rule you propose it will make sense 😄 - depending on what you want in the output you will use combine or select.

pdeffebach · 2020-08-06T22:06:24Z

Ah - OK, so then combine and select would be significantly different indeed.

I think collect(skipmissing(x)) was meant to be an approximation. The skipmissing part of both functions should be the same. But I agree that we need to be stricter about the length of the returned columns in combine because we don't have a reference length we know everything should be and fill in missing accordingly.

However see #2211. Are we over-thinking this? Perhaps we can just operate on a SubDataFrame and overload those functions instead of having a skipmissing keyword argument.

bkamins · 2020-08-06T22:55:35Z

I think the difference is relevant with piping. When we make filter produce views and then allow adding columns to SubDataFrame that is not subsetted on columns you will be able to write something like:

@pipe df |> filter(:x => !ismissing, _, view=true) |> transform(_, :x => abs) |> parent |> ... some more piping ...

and the issue is that you have to use parent to unwrap the filter you have applied.

Having skipmissing argument for e.g. transform would not require this wrapping/unwrappng, and you would just write:

@pipe df |> |> transform(_, :x => abs, skipmissing=true) |> ... some more piping ...

so I would be ok with adding this skipmissing kwarg.

But for me (as you probably know 😄) the crucial thing is to precisely understand the desired functionality and the dimensions here are:

filter (this is a separate case for sure - that is why I opened a separate issue for it)
combine vs select/transform combined with "whole column" vs ByRow transformation (and here we have 4 possibilities and it would be great if we all had a clear vision and consensus how each of them should work)

nalimilan · 2020-08-07T08:46:57Z

I agree with @matthieugomez 's interpretation. select and transform return as many rows as the input so it's natural that we require the user function to return as many (non-missing rows) as we passed to it, and fill others with missing. OTC combine can return anything, so we should just pass the user function a view of complete rows and take whatever it returns. ByRow shouldn't get any special treatment, it's just the application of the general rules to a broadcasted function.

pdeffebach · 2020-08-07T13:08:58Z

Having skipmissing argument for e.g. transform would not require this wrapping/unwrappng, and you would just write:
@pipe df |> |> transform(_, :x => abs, skipmissing=true) |> ... some more piping ...

Sorry to have this discussion across 2 issues, but I think it would be good to "un-filter" automatically in a transform call since that is what we do with a grouped data frame.

What I think you are saying, though, is that this would require a special type because it is a slightly different logic than what one would expect from a normal SubDataFrame

bkamins · 2020-08-07T14:50:12Z

What I think you are saying, though, is that this would require a special type because it is a slightly different logic than what one would expect from a normal SubDataFrame.

Yes

jtrakk · 2021-09-12T20:43:39Z

SkipMissing([:a,:b]) => or => SkipMissing(corr) => would be more flexible than skipmissing=true because it can be applied to individual operations rather than the whole transform call.

bkamins added decision feature labels May 15, 2020

matthieugomez changed the title ~~Add a skipmissing kwarg~~ Add a skipmissing kwarg to select/transform/combine May 18, 2020

nalimilan mentioned this issue Jul 7, 2020

Skipping missing values more easily #2314

Open

pdeffebach mentioned this issue Aug 6, 2020

Creating new columns on a view should fill in missings everywhere else. #2211

Closed

bkamins mentioned this issue Aug 8, 2020

Improwe workflows with filtered DataFrame #2354

Open

nalimilan mentioned this issue Oct 15, 2020

Add a function for skippmissing and then "spread"ing the result JuliaData/Missings.jl#121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a skipmissing kwarg to select/transform/combine #2258

Add a skipmissing kwarg to select/transform/combine #2258

matthieugomez commented May 15, 2020 •

edited

Loading

bkamins commented May 15, 2020

matthieugomez commented May 15, 2020 •

edited

Loading

pdeffebach commented May 15, 2020 •

edited

Loading

nalimilan commented May 18, 2020

matthieugomez commented May 18, 2020 •

edited

Loading

nalimilan commented May 18, 2020

pdeffebach commented May 18, 2020

nalimilan commented Jul 7, 2020

bkamins commented Aug 6, 2020

matthieugomez commented Aug 6, 2020 via email •

edited by nalimilan

Loading

bkamins commented Aug 6, 2020 •

edited

Loading

matthieugomez commented Aug 6, 2020 •

edited

Loading

bkamins commented Aug 6, 2020

pdeffebach commented Aug 6, 2020

bkamins commented Aug 6, 2020

nalimilan commented Aug 7, 2020 •

edited

Loading

pdeffebach commented Aug 7, 2020

bkamins commented Aug 7, 2020

jtrakk commented Sep 12, 2021

Add a skipmissing kwarg to select/transform/combine #2258

Add a skipmissing kwarg to select/transform/combine #2258

Comments

matthieugomez commented May 15, 2020 • edited Loading

bkamins commented May 15, 2020

matthieugomez commented May 15, 2020 • edited Loading

pdeffebach commented May 15, 2020 • edited Loading

nalimilan commented May 18, 2020

matthieugomez commented May 18, 2020 • edited Loading

nalimilan commented May 18, 2020

pdeffebach commented May 18, 2020

nalimilan commented Jul 7, 2020

bkamins commented Aug 6, 2020

matthieugomez commented Aug 6, 2020 via email • edited by nalimilan Loading

bkamins commented Aug 6, 2020 • edited Loading

matthieugomez commented Aug 6, 2020 • edited Loading

bkamins commented Aug 6, 2020

pdeffebach commented Aug 6, 2020

bkamins commented Aug 6, 2020

nalimilan commented Aug 7, 2020 • edited Loading

pdeffebach commented Aug 7, 2020

bkamins commented Aug 7, 2020

jtrakk commented Sep 12, 2021

matthieugomez commented May 15, 2020 •

edited

Loading

matthieugomez commented May 15, 2020 •

edited

Loading

pdeffebach commented May 15, 2020 •

edited

Loading

matthieugomez commented May 18, 2020 •

edited

Loading

matthieugomez commented Aug 6, 2020 via email •

edited by nalimilan

Loading

bkamins commented Aug 6, 2020 •

edited

Loading

matthieugomez commented Aug 6, 2020 •

edited

Loading

nalimilan commented Aug 7, 2020 •

edited

Loading