@where can't deal with NAs #58

diegozea · 2016-07-20T17:26:35Z

@where(datos, :x_13 .== "#0_PHY") throws the following error because the column has missing values:

LoadError: DataArrays.NAException("cannot index an array with a DataArray containing NA values")
while loading In[8], in expression starting on line 1

 in to_index at /home/dzea/.julia/v0.4/DataArrays/src/indexing.jl:76
 in getindex at /home/dzea/.julia/v0.4/DataArrays/src/indexing.jl:173
 in getindex at /home/dzea/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:281
 in where at /home/dzea/.julia/v0.4/DataFramesMeta/src/DataFramesMeta.jl:164

datos[:x_13] .== "#0_PHY" works and returns a DataArrays.DataArray{Bool,1}, but datos[datos[:x_13] .== "#0_PHY",:] throws the same error.

The text was updated successfully, but these errors were encountered:

nalimilan · 2016-07-20T21:45:49Z

datos[datos[:x_13] .== "#0_PHY",:] throws the same error.

So that's not related to DataFramesMeta. I'm afraid this is a regression I introduced when porting DataArrays to Julia 0.5. Please file an issue there so that we don't lose track of it.

nalimilan · 2016-07-23T10:24:35Z

So that's actually a design decision in DataArrays. Reopening this issue so that we find a convenient syntax to skip missing values in @where. It could even be made the default only in that case.

garborg · 2016-07-23T13:42:54Z

I think the design decision came from a conviction that filtering is precisely where implicitly dropping on null shouldn't happen. Agreed that the mismatch between Julia behavior and the SQL verbs used here is awkward, though, especially with the community's desire to have shared API for functionality common to in-memory tables, SQL, etc.

I've been wondering about opting in/out at the code block level for each of "lifting" and treating nulls as false. That would be one way to handle both paradigms, other than wrapping filtering expressions with some sort of 'ifnull' call.

garborg · 2016-07-23T13:47:08Z

Or maybe the main filtering function for native tables would have a different name (like filter), and where would be (implemented on top, or separately) in the shared API, and behave like you suggested already (like in SQL)?

EDIT: I suppose the filter behavior could be in the shared API, as well -- not entirely sure about the performance of a SQL implementation, though, or whether we should worry about that (as where would be available).

nalimilan · 2016-07-25T12:49:52Z

I tend to think DataFramesMeta convenience macros should perform both automatic lifting and dropping, as in SQL. This is both practical, efficient (as it translates directly to SQL requests for backing data bases) and easy to document (as SQL is well-known).

We can still provide variants with different semantics if needed though, e.g. via filter or a keyword argument.

garborg · 2016-07-26T00:00:22Z

I think I agree about lifting operations within tables (I don't know enough to have an opinion about in the language overall, outside tables).

But I don't think departing from the arguably safer native behavior around filtering nulls should be a default even for a tables-specific API (especially one that's not resigned to targeting only SQL query strings).

Where I've seen filtering out nulls silently being a bad thing: First with teams that are rushed and either don't notice they've introduced nulls or just know the modeling better than the data and underlying processes (i.e. don't know ballpark figures well enough to second guess how many records were cut, or that any were cut). And second, when the source data changes in unexpected ways after initial development.

Generally ,sometimes returning a bad result is worse than indicating a problem and failing to return a result. In those cases, being explicit about dropping nulls can be easier and more efficient than workarounds. Of course things cut both ways -- I just want to emphasize that having to be explicit about where to drop nulls can be more ergonomic than working in a system where dropping null is a default.

In my mind, an ideal solution would make both filtering behaviors explicit and concise (personally, if anything, I'd want the native, non-sql behavior a bit more ergonomic. But mapping to native SQL, where correct, is something to encourage, too).

The two-verbs approach doesn't feel orthogonal to handling null values elsewhere in queries, and having to type a bit more to get that native-Julia-feeling mapping from null to false sounds find personally (if there was developer manpower to translate it to efficient SQL, etc., under the hood), but two verbs seems to strike a decent balance, if it's something like filter being a native verb and where matching SQL.

dmbates · 2016-07-26T18:57:43Z

On a somewhat related note, it would be helpful to allow DataArrays.DataArray{Bool, 1} in the second and subsequent arguments. At least I found that I needed to write an expression like

@where(results, convert(Vector{Bool}, map(x -> ismatch(r"^Dyestuff:", x), :dsmodel)))

to extract the rows of results in which the dsmodel value started with Dyestuff:

It would be convenient if the convert(Vector{Bool} could be avoided.

davidagold · 2016-07-26T19:19:12Z

@dmbates what if you do

@where(results, bitbroadcast(x -> ismatch(r"^Dyestuff:", x), :dsmodel))

?

dmbates · 2016-07-26T21:02:09Z

@davidagold that works, thanks. When CategoricalArrays and NullableArrays are used in DataFrames I think some of these problems will go away. One of the characteristics of the DataArrays package is that if you start with a DataArray or PooledDataArray most operations end up producing another DataArray of some sort.

nalimilan · 2017-10-07T14:54:29Z

Even though I advocated dropping nulls silently above, I now think we should take a different approach. Indeed, we currently propagate nulls everywhere by default. Unless we decide to skip them silently (which would make sense in some specific contexts), it would be more consistent and safer to throw an error by default in the presence of nulls. Indeed the common mistakes done in SQL wouldn't happen with the safe behavior.

A possibility is to add an argument to where/filter which would skip missing values. This would have to be very easy to set since it will be very common, but it will make it more explicit (than e.g. in SQL) that null values are discarded and that this isn't a completely neutral operation. Here's a list of ideas:

where(df, cond, null=false/true): the value of the argument is used as a replacement when null is found.
where(df, cond, skipnull=true/false): simpler, whether to skip nulls or not; could have the advantage of reusing a vocabulary which is familiar (from Nulls.skip and probably similar skipnull arguments to various functions).
where(df, cond, true/false): less explicit, but shorter to type, which is important since it's a very common need.

An argument against this idea is that I couldn't find a language which works that way. For example, both SQL and dplyr drop nulls silently. OTOH the sample size is quite small (AFAICT Pandas doesn't have this problem since it doesn't really have a NA type and uses NaN instead).

Another reasonable approach would be to reject nulls in filter, but drop them silently with where, which would match the SQL operator of the same name. I'm not sure that's a good idea since people will end up using where all the time, defeating the main purpose of this difference (safety).

tshort · 2017-10-07T15:30:05Z

I'm good with the third argument (I think I like one of the last two options). I'd prefer to have it optional in that it'll default to erroring just as datos[datos[:x_13] .== "#0_PHY",:] would.

pdeffebach · 2021-03-07T20:34:42Z

This is an old issue! But still not resolved.

I suspect the solution will be found in Missings.jl, see related discussion here.

nalimilan closed this as completed Jul 20, 2016

diegozea mentioned this issue Jul 20, 2016

Indexing with missing values JuliaStats/DataArrays.jl#207

Closed

nalimilan reopened this Jul 23, 2016

davidagold mentioned this issue Jul 23, 2016

References davidagold/StructuredQueries.jl#4

Open

garborg mentioned this issue Aug 20, 2016

Allow operations mixing Nullable and scalar JuliaStats/NullableArrays.jl#85

Open

ti-s mentioned this issue Oct 28, 2016

Should comparison operators return Bool values? queryverse/DataValues.jl#7

Open

nalimilan mentioned this issue Sep 27, 2017

Change comparison operators to return null when one of the operands is null JuliaData/Missings.jl#33

Merged

pdeffebach added this to the 1.X milestone Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@where can't deal with NAs #58

@where can't deal with NAs #58

diegozea commented Jul 20, 2016 •

edited

Loading

nalimilan commented Jul 20, 2016

nalimilan commented Jul 23, 2016

garborg commented Jul 23, 2016

garborg commented Jul 23, 2016 •

edited

Loading

nalimilan commented Jul 25, 2016

garborg commented Jul 26, 2016 •

edited

Loading

dmbates commented Jul 26, 2016

davidagold commented Jul 26, 2016

dmbates commented Jul 26, 2016

nalimilan commented Oct 7, 2017

tshort commented Oct 7, 2017

pdeffebach commented Mar 7, 2021

@where can't deal with NAs #58

@where can't deal with NAs #58

Comments

diegozea commented Jul 20, 2016 • edited Loading

nalimilan commented Jul 20, 2016

nalimilan commented Jul 23, 2016

garborg commented Jul 23, 2016

garborg commented Jul 23, 2016 • edited Loading

nalimilan commented Jul 25, 2016

garborg commented Jul 26, 2016 • edited Loading

dmbates commented Jul 26, 2016

davidagold commented Jul 26, 2016

dmbates commented Jul 26, 2016

nalimilan commented Oct 7, 2017

tshort commented Oct 7, 2017

pdeffebach commented Mar 7, 2021

diegozea commented Jul 20, 2016 •

edited

Loading

garborg commented Jul 23, 2016 •

edited

Loading

garborg commented Jul 26, 2016 •

edited

Loading