-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@where can't deal with NAs #58
Comments
So that's not related to DataFramesMeta. I'm afraid this is a regression I introduced when porting DataArrays to Julia 0.5. Please file an issue there so that we don't lose track of it. |
So that's actually a design decision in DataArrays. Reopening this issue so that we find a convenient syntax to skip missing values in |
I think the design decision came from a conviction that filtering is precisely where implicitly dropping on null shouldn't happen. Agreed that the mismatch between Julia behavior and the SQL verbs used here is awkward, though, especially with the community's desire to have shared API for functionality common to in-memory tables, SQL, etc. I've been wondering about opting in/out at the code block level for each of "lifting" and treating nulls as |
Or maybe the main filtering function for native tables would have a different name (like EDIT: I suppose the |
I tend to think DataFramesMeta convenience macros should perform both automatic lifting and dropping, as in SQL. This is both practical, efficient (as it translates directly to SQL requests for backing data bases) and easy to document (as SQL is well-known). We can still provide variants with different semantics if needed though, e.g. via |
I think I agree about lifting operations within tables (I don't know enough to have an opinion about in the language overall, outside tables). But I don't think departing from the arguably safer native behavior around filtering nulls should be a default even for a tables-specific API (especially one that's not resigned to targeting only SQL query strings). Where I've seen filtering out nulls silently being a bad thing: First with teams that are rushed and either don't notice they've introduced nulls or just know the modeling better than the data and underlying processes (i.e. don't know ballpark figures well enough to second guess how many records were cut, or that any were cut). And second, when the source data changes in unexpected ways after initial development. Generally ,sometimes returning a bad result is worse than indicating a problem and failing to return a result. In those cases, being explicit about dropping nulls can be easier and more efficient than workarounds. Of course things cut both ways -- I just want to emphasize that having to be explicit about where to drop nulls can be more ergonomic than working in a system where dropping null is a default. In my mind, an ideal solution would make both filtering behaviors explicit and concise (personally, if anything, I'd want the native, non-sql behavior a bit more ergonomic. But mapping to native SQL, where correct, is something to encourage, too). The two-verbs approach doesn't feel orthogonal to handling null values elsewhere in queries, and having to type a bit more to get that native-Julia-feeling mapping from null to |
On a somewhat related note, it would be helpful to allow @where(results, convert(Vector{Bool}, map(x -> ismatch(r"^Dyestuff:", x), :dsmodel))) to extract the rows of It would be convenient if the |
@dmbates what if you do @where(results, bitbroadcast(x -> ismatch(r"^Dyestuff:", x), :dsmodel)) ? |
@davidagold that works, thanks. When |
Even though I advocated dropping nulls silently above, I now think we should take a different approach. Indeed, we currently propagate nulls everywhere by default. Unless we decide to skip them silently (which would make sense in some specific contexts), it would be more consistent and safer to throw an error by default in the presence of nulls. Indeed the common mistakes done in SQL wouldn't happen with the safe behavior. A possibility is to add an argument to
An argument against this idea is that I couldn't find a language which works that way. For example, both SQL and dplyr drop nulls silently. OTOH the sample size is quite small (AFAICT Pandas doesn't have this problem since it doesn't really have a Another reasonable approach would be to reject nulls in |
I'm good with the third argument (I think I like one of the last two options). I'd prefer to have it optional in that it'll default to erroring just as |
This is an old issue! But still not resolved. I suspect the solution will be found in Missings.jl, see related discussion here. |
@where(datos, :x_13 .== "#0_PHY")
throws the following error because the column has missing values:datos[:x_13] .== "#0_PHY"
works and returns aDataArrays.DataArray{Bool,1}
, butdatos[datos[:x_13] .== "#0_PHY",:]
throws the same error.The text was updated successfully, but these errors were encountered: