Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @subset #263

Merged
merged 15 commits into from
Jun 29, 2021
Merged

Add @subset #263

merged 15 commits into from
Jun 29, 2021

Conversation

pdeffebach
Copy link
Collaborator

@pdeffebach pdeffebach commented Jun 25, 2021

This is an initial attempt to add @subset.

I was able to entirely replace @where by calling skipmissing=true as a keyword argument, which is great.

We are in a real bind with regards to keyword arguments. With the move to using :block we can't just support keyword argument handling like @subset(df, ...; skipmissing = true). So I added the flag @skipmissing and re-factored the macro-flags a little bit.

julia> df = DataFrame(a = [1, missing], b = [3, 4]);

julia> @subset df @skipmissing begin 
           :a .== 1
           :b .== 3
       end
1×2 DataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      1      3

But adding tests is still a pain, because the tests have missing so we would have to add @skipmissing everywhere. Is this a time when we should break with the DataFrames API and make skipmissing=true the default? That would make people's lives a bit easier when upgrading from @where to @subset.

@nalimilan This is a design decision, so I would appreciate your input.

@pdeffebach
Copy link
Collaborator Author

Okay, I've made skipmissing=true the default with subset and removed the @skipmissing flag. This seems like the easiest way forward. Unlike DataFrames.jl, we will treat missing as false in @subset.

All docs are added and tests added and pass. So this could be merged.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes the API much more consistent with DataFrames.

Regarding skipmissing, I think it would be good to outline a general plan. Should DataFramesMeta automatically skip/propagate missing values everywhere? We discussed adding a keyword argument to do that in DataFrames at JuliaData/DataFrames.jl#2314. It hasn't been implemented at this point, but it would make sense to decide whether we would like to enable it by default eventually in DataFramesMeta.

docs/src/index.md Outdated Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated
Comment on lines 347 to 348
create_args_vector(arg) -> vec, wrap_byrow

Normalize a single input to a vector of expressions,
with a `wrap_byrow` flag indicating that the
expressions should operate by row.

If `arg` is a single `:block`, it is unnested.
Otherwise, return a single-element array.
Also removes line numbers.

If `arg` is of the form `@byrow ...`, then
`wrap_byrow` is returned as `true`.
create_args_vector(arg) -> vec, outer_flags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the contents of the docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a correct docstring.

@@ -763,13 +705,13 @@ end
@test nrow(d) == 1

d = @where df begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to deprecated.jl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved.

@@ -0,0 +1,143 @@
module TestSubset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests with GroupedDataFrame?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, ported from @where.

test/subset.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Collaborator Author

w.r.t. missings.

I think that adding transform! with SubDataFrame goes along way to emulating Stata's if syntax. But you are right it doesn't help with missings.

I think something along the lines of This PR in Missings.jl is the solution. Since we are constructing anonymous functions we can just add a spreadmissing(anon) when we need to. I hope we can make it performant. I don't know if it should be default in case people compare the speed to data.table, but I am open to the idea. It may also help people who don't like row-wise since a lot of the benefit of row-wise functions is dealing with missings.

That's a long term strategy. Maybe in the meantime we should just continue to treat missings as false since it's the default behavior with @where currently.

src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
@testset "@subset with a grouped data frame" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test @subset! with GroupedDataFrame?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

@pdeffebach
Copy link
Collaborator Author

Okay ready for merging.

@pdeffebach
Copy link
Collaborator Author

Thank you!

@pdeffebach pdeffebach merged commit fea3dee into JuliaData:master Jun 29, 2021
@pdeffebach pdeffebach deleted the add_subset branch June 29, 2021 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants