Add `@subset` #263

pdeffebach · 2021-06-25T00:58:37Z

This is an initial attempt to add @subset.

I was able to entirely replace @where by calling skipmissing=true as a keyword argument, which is great.

We are in a real bind with regards to keyword arguments. With the move to using :block we can't just support keyword argument handling like @subset(df, ...; skipmissing = true). So I added the flag @skipmissing and re-factored the macro-flags a little bit.

julia> df = DataFrame(a = [1, missing], b = [3, 4]);

julia> @subset df @skipmissing begin 
           :a .== 1
           :b .== 3
       end
1×2 DataFrame
 Row │ a       b     
     │ Int64?  Int64 
─────┼───────────────
   1 │      1      3

But adding tests is still a pain, because the tests have missing so we would have to add @skipmissing everywhere. Is this a time when we should break with the DataFrames API and make skipmissing=true the default? That would make people's lives a bit easier when upgrading from @where to @subset.

@nalimilan This is a design decision, so I would appreciate your input.

pdeffebach · 2021-06-26T19:50:04Z

Okay, I've made skipmissing=true the default with subset and removed the @skipmissing flag. This seems like the easiest way forward. Unlike DataFrames.jl, we will treat missing as false in @subset.

All docs are added and tests added and pass. So this could be merged.

nalimilan

Thanks! That makes the API much more consistent with DataFrames.

Regarding skipmissing, I think it would be good to outline a general plan. Should DataFramesMeta automatically skip/propagate missing values everywhere? We discussed adding a keyword argument to do that in DataFrames at JuliaData/DataFrames.jl#2314. It hasn't been implemented at this point, but it would make sense to decide whether we would like to enable it by default eventually in DataFramesMeta.

docs/src/index.md

src/macros.jl

src/parsing.jl

nalimilan · 2021-06-27T10:09:46Z

src/parsing.jl

-   create_args_vector(arg) -> vec, wrap_byrow
-
-Normalize a single input to a vector of expressions,
-with a `wrap_byrow` flag indicating that the
-expressions should operate by row.
-
-If `arg` is a single `:block`, it is unnested.
-Otherwise, return a single-element array.
-Also removes line numbers.
-
-If `arg` is of the form `@byrow ...`, then
-`wrap_byrow` is returned as `true`.
+   create_args_vector(arg) -> vec, outer_flags


Why remove the contents of the docstring?

I will add a correct docstring.

nalimilan · 2021-06-27T10:10:47Z

test/dataframes.jl

@@ -763,13 +705,13 @@ end
    @test nrow(d) == 1

    d = @where df begin


Move this to deprecated.jl?

nalimilan · 2021-06-27T10:11:52Z

test/subset.jl

@@ -0,0 +1,143 @@
+module TestSubset


Can you add tests with GroupedDataFrame?

added, ported from @where.

test/subset.jl

Co-authored-by: Milan Bouchet-Valat <[email protected]>

…Meta.jl into add_subset

Co-authored-by: Milan Bouchet-Valat <[email protected]>

pdeffebach · 2021-06-27T23:43:21Z

w.r.t. missings.

I think that adding transform! with SubDataFrame goes along way to emulating Stata's if syntax. But you are right it doesn't help with missings.

I think something along the lines of This PR in Missings.jl is the solution. Since we are constructing anonymous functions we can just add a spreadmissing(anon) when we need to. I hope we can make it performant. I don't know if it should be default in case people compare the speed to data.table, but I am open to the idea. It may also help people who don't like row-wise since a lot of the benefit of row-wise functions is dealing with missings.

That's a long term strategy. Maybe in the meantime we should just continue to treat missings as false since it's the default behavior with @where currently.

src/parsing.jl

nalimilan · 2021-06-28T07:17:52Z

test/subset.jl

Also test @subset! with GroupedDataFrame?

Co-authored-by: Milan Bouchet-Valat <[email protected]>

…Meta.jl into add_subset

pdeffebach · 2021-06-28T22:51:51Z

Okay ready for merging.

pdeffebach · 2021-06-29T13:25:05Z

Thank you!

pdeffebach added 5 commits June 24, 2021 16:53

inital commit

4514eb0

fix tests

2e5de90

no more skipmissing

eca50d4

tests

2ddee38

update index.md

c815d1a

nalimilan reviewed Jun 27, 2021

View reviewed changes

pdeffebach and others added 5 commits June 27, 2021 15:32

add docstring

f2dde1e

Apply suggestions from code review

7c14a20

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Apply suggestions from code review

f83161f

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Merge branch 'add_subset' of https://github.com/pdeffebach/DataFrames…

430d398

…Meta.jl into add_subset

Update test/subset.jl

6f956dc

Co-authored-by: Milan Bouchet-Valat <[email protected]>

nalimilan reviewed Jun 28, 2021

View reviewed changes

pdeffebach and others added 5 commits June 28, 2021 14:36

switching

a75d817

@subset! with gd

7991b8c

Update src/parsing.jl

fa09626

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/parsing.jl

1c485c8

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Merge branch 'add_subset' of https://github.com/pdeffebach/DataFrames…

8d55422

…Meta.jl into add_subset

nalimilan approved these changes Jun 29, 2021

View reviewed changes

pdeffebach merged commit fea3dee into JuliaData:master Jun 29, 2021

pdeffebach deleted the add_subset branch June 29, 2021 13:25

etpinard mentioned this pull request Jul 23, 2021

Fix @where deprecation warning #271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `@subset` #263

Add `@subset` #263

pdeffebach commented Jun 25, 2021 •

edited

Loading

pdeffebach commented Jun 26, 2021

nalimilan left a comment

nalimilan Jun 27, 2021

pdeffebach Jun 27, 2021

nalimilan Jun 27, 2021

pdeffebach Jun 27, 2021

nalimilan Jun 27, 2021

pdeffebach Jun 27, 2021

pdeffebach commented Jun 27, 2021

nalimilan Jun 28, 2021

pdeffebach Jun 28, 2021

pdeffebach commented Jun 28, 2021

pdeffebach commented Jun 29, 2021

		@@ -763,13 +705,13 @@ end
		@test nrow(d) == 1

		d = @where df begin

Add @subset #263

Add @subset #263

Conversation

pdeffebach commented Jun 25, 2021 • edited Loading

pdeffebach commented Jun 26, 2021

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Jun 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Jun 28, 2021

pdeffebach commented Jun 29, 2021

Add `@subset` #263

Add `@subset` #263

pdeffebach commented Jun 25, 2021 •

edited

Loading