Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrameMacros.jl and DataFramesMeta.jl #2793

Closed
bkamins opened this issue Jun 20, 2021 · 25 comments
Closed

DataFrameMacros.jl and DataFramesMeta.jl #2793

bkamins opened this issue Jun 20, 2021 · 25 comments
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Jun 20, 2021

@jkrumbiegel and @pdeffebach - I know you are in contact and I really believe that having both https://github.com/jkrumbiegel/DataFrameMacros.jl and https://github.com/JuliaData/DataFramesMeta.jl is a good thing at this stage of ecosystem maturity that can help us all to converge to the most user-friendly designs (even maybe in some scenarios one or the other can be preferred?).

What I feel would be really great is that we could jointly come up with some guidance for the users comparing both and putting them in the documentation of DataFrames.jl in https://dataframes.juliadata.org/latest/man/querying_frameworks/ section (ideally before JuliaCon2021).

I know you both are really committed to supporting JuliaData so I hope this is something that is doable. My mindset is the following:

  • DataFrames.jl syntax is clearly not very user friendly (unfortunately given the design decisions we made it has to be)
  • therefore we need the Meta-package that newcomers to DataFrames.jl can use without having to go through a steep learning curve of DataFrames.jl;
  • having said that - I am afraid that we should give them some guidance of DataFrameMacros.jl vs DataFramesMeta.jl as otherwise they might be confused (I do not say we should recommend one or the other - probably we should point out the differences and help them choose)

I would be obliged if you commented what you think here. Thank you!

@bkamins bkamins added this to the patch milestone Jun 20, 2021
@bkamins bkamins added the doc label Jun 20, 2021
@jkrumbiegel
Copy link
Contributor

I would currently characterize the difference such that DataFramesMeta is closer to how DataFrames is built. I've made DataFrameMacros so that it's most convenient for me personally, and I don't shy away from shorthands such as @r, @c and @t, but I assume this could overlap with the requirements of a lot of people that don't do the most demanding analyses but who want simple and clear syntax in the most common cases.

Choosing row-based transforms etc. by default brings the syntax a bit closer to what people might be used to from R, as their vectorized primitives always look "scalar-like". I think because we do have fast loops it's actually a boon that we can write row-based logic which is still fast. I especially enjoy that when working with strings etc. I also like the @m flag which makes it easier to write if anything missing then missing else ... kind of logic.

The differences could change a bit in the future, as Peter said that DataFramesMeta plans to make some changes that could bring both packages closer together.

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 20, 2021

In terms of concrete changes that should be happening soon

  • :x instead of x.
  • A flag for treated missing as false, preferrably recursively, see here
  • A flag for passmissing, like @m
  • An @astable flag, like @t, see an early attempt here
  • $ instead of cols. It seems very feasible, I've explored some niche effects to consider here.

These are on the list, and I understand you have been frustrated with the slow pace of changes. Hopefully I can pick up the pace a bit to get these things merged in.

The main sticking point is row-wise by default. Would you be okay with @rtransform etc? Or do you think that as long as @transform is column-wise by default in DataFramesMeta.jl, you will always want to maintain DataFramesMacros?

EDIT: One other thing. Currently, DataFramesMeta.jl is very hard to make changes to because it was written without MacroTools. Do you think if we re-write some parts of the package using MacroTools.jl, it would lower the barrier of entry to contributing? Then it might be easier for you to help implement the changes you want.

@bkamins
Copy link
Member Author

bkamins commented Jun 20, 2021

Thank you both for commenting. If I understand things correctly @pdeffebach is planning to get at almost the same design as DataFrameMacros.jl in the long term (except for some details in syntax). Is this correct?

If this is so then the question to @jkrumbiegel is - how would you prefer both packages to be promoted in DataFrames.jl documentation (or maybe you see DataFrameMacros.jl as your personal project for exploring things and of course anyone is welcome to use it, but at the same time you could support development of DataFramesMeta.jl to reach the state of the same usability as DataFrameMacros.jl?).


My personal opinion on what @pdeffebach plans:

:x instead of x.

agreed

A flag for treated missing as false, preferrably recursively

agreed

A flag for passmissing

agreed

An @astable flag, like @t

agreed

$ instead of cols.

agreed


A side question to you both. If I want a target column name as e.g. "sum x" (with space), how would that be expressed in both packages in their target design? Thank you!

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Jun 20, 2021

or maybe you see DataFrameMacros.jl as your personal project for exploring things and of course anyone is welcome to use it, but at the same time you could support development of DataFramesMeta.jl to reach the state of the same usability as DataFrameMacros.jl?

Yeah I support that statement, if DataFramesMeta ends up so similar to what I have now that there's almost no point in having two packages, that's fine by me. I can of course always use what I want, which is still nice to have. I often tend towards really short and sweet solutions, and I understand that others don't want to go that way and be more verbose, yet maybe less ambiguous.

I think that the backend design of my package isn't actually too bad and that it supports some things that DataFramesMeta didn't, so if those features are copied, also good.

About the side question, DataFrameMacros supports anything left of the = sign, this is just copied as-is to the target field. So to use a string, just do:

@transform(df, "new column" = :a * :b)

@bkamins
Copy link
Member Author

bkamins commented Jun 20, 2021

This is fantastic - I really believe that with this approach we can get very far.

I understand that others don't want to go that way and be more verbose, yet maybe less ambiguous.

Indeed in DataFramesMeta.jl I think that we would probably prefer something that is easy to understand (e.g. when selecting names) so that newcomers can easily pick it up, even at the expense of some verbosity (assuming that it is still terse).

so if those features are copied, also good.

I hope that @pdeffebach would be open to both using the API ideas as well as internal design (original DataFramesMeta.jl was written very long ago so redesign of internals is probably needed anyway).

@Transform(df, "new column" = :a * :b)

👍 This is fantastic

@vjd
Copy link

vjd commented Jun 21, 2021

Thank you all for discussing this. Personally, having tried both these packages now, in addition to using DataFrames.jl as is, I feel the DataFramesMacro.jl kind of approach would definitely lower the barrier to entry as almost everyone migrating to Julia from other ecosystems would want the rowwise thinking. Having said that, I am in complete support of merging the two solutions if feasible, with a possible rewrite of DataFramesMeta, or using the DataFramesMacros as the starting point to combine Peter's ideas, especially if a rewrite of the internals are warranted for based on Bogumil's idea. I am happy to contribute in any way I can!

@bkamins bkamins modified the milestones: patch, 1.x Jun 23, 2021
@matthieugomez
Copy link
Contributor

matthieugomez commented Jul 7, 2021

DataFrameMacros.jl looks really nice (I like the row-wise default as well as the short macro flags). Since it looks like a complete rewrite also allowed @jkrumbiegel to get a much cleaner code, why not use it as a starting point instead? Sorry if I am missing something — but what is the point of doing a series of breaking changes in DataFramesMeta.jl just to end up in the same place, with a more complicated codebase?

@pdeffebach
Copy link
Contributor

pdeffebach commented Jul 7, 2021

Maybe I am missing something but what is the point of doing a series of breaking changes in DataFramesMeta.jl just to end up with the same place, with a more complicated code?

I think that's exactly the point. We want to keep the user-base of DataFramesMeta happy, meaning we need to have a deprecation period for changes. This means we have to keep the old (complicated) code on hand for a while.

But yeah, we recently added MacroTools as a dependency to DataFramesMeta, so once we deprecate the old stuff a re-write that uses MacroTools.postwalk is in order, which will certainly borrow from DataFrameMacros.

EDIT: DataFramesMeta also has a ton of documentation, which is more work than the code-base itself (see the recent :x on LHS PR). Updating that iteratively seems like an easier task perhaps.

@bkamins
Copy link
Member Author

bkamins commented Jul 7, 2021

@pdeffebach - based on user feedback. The feature

the row-wise default

is critical. Therefore in target design of DataFramesMeta.jl we should provide this option also (I know it was discussed but I want to stress that this is the key thing based on the feedback).

@matthieugomez The challenging part of DataFrameMacros.jl is:

All macros except @combine work row-wise by default.

i.e. it is not always row-wise and the same expression passed to @combine and @select will behave differently (which seems innocent, but I fear that in the long run it might confuse users - I know the documentation of DataFrameMacros.jl is very clear about this fact, but still I feel a bit uneasy here).
I would prefer in DataFramesMeta.jl not to have such inconsistency if possible (e.g. by using some consistent macro naming scheme).

@pdeffebach
Copy link
Contributor

Putting aside the row-wise discussion. Just want to note that I just merged :y = f(:x), so we are making pretty rapid progress towards other desired API changes.

@vjd
Copy link

vjd commented Jul 8, 2021

t is not always row-wise and the same expression passed to @combine and @select will behave differently (which seems innocent, but I fear that in the long run it might confuse users

Isn't this the default and how dplyr also works that just for summarize and select, the operations are column-wise? I am not sure why it would be confusing.

Therefore in target design of DataFramesMeta.jl we should provide this option also (I know it was discussed but I want to stress that this is the key thing based on the feedback).

Absolutely agree.

but what is the point of doing a series of breaking changes in DataFramesMeta.jl just to end up in the same place, with a more complicated codebase?

I sort of agree with this comment. Just trying to understand. If it is just about documentation, which is essential and the core element of any package, perhaps the work being put into re-writing can be spent on documentation. OTOH, I do see the fact that there may be many existing users who need to be kept in mind. Difficult decision, but I think sometimes starting with a clean slate is not too bad. I have a lot of respect and appreciation to the amount of work that went in all these years into DataFramesMeta.

@pdeffebach
Copy link
Contributor

OTOH, I do see the fact that there may be many existing users who need to be kept in mind. Difficult decision, but I think sometimes starting with a clean slate is not too bad.

DataFramesMeta appears in many tutorials across the web. It has a large existing user-base that currently uses deprecated functionality. It would be a big deal for JuliaData to put DataFramesMeta.jl in maintenance mode and tell everyone to use a new package. Especially when DataFramesMeta.jl is pre-1.0 and can make the requested changes, with deprecations to help out the existing userbase, while it transitions to an updated API.

@matthieugomez
Copy link
Contributor

matthieugomez commented Jul 8, 2021 via email

@bkamins
Copy link
Member Author

bkamins commented Jul 8, 2021

I am not sure why it would be confusing.

and

there is no way to expect a different behavior.

These comments are very interesting. I am most likely biased here because I know how these operations are implemented internally. And internally, select, transform, and combine both for AbstractDataFrame and GroupedDataFrame arguments have exactly the same implementation (with only some minor differences in post-processing).


However, note that in dplyr mutate etc. takes whole columns (i.e. it operates columnwise not rowwise). The only difference (and this is the Julia design choice) that MOST (not all) functions in R are vectorized by default so the users do not see the distinction.

So in other words mutate is, in Julia parlance, column-wise, but functions provided by R base are row-wise (not mutate, which is column wise). The problem we are trying to solve is that in Julia the "base" functions are not row wise.

Here is an example:

> f <- function(x) {
+   if (x > 0) {
+     return(1)
+   } else {
+     return(0)
+   }
+ }
> df = data.frame(a=c(-1,1))
> mutate(df, sign=f(a))
   a sign
1 -1    0
2  1    0
Warning message:
Problem with `mutate()` column `sign`.
i `sign = f(a)`.
i the condition has length > 1 and only the first element will be used 

And this is clearly not what you expect.

So the tension is that in dplyr the user learns that:

mutate works row-wise

but this is not true. mutate works column-wise, but functions you pass to mutate are mostly vectorized.

I understand that in Julia ecosystem we want to simplify the transition from dplyr, so we want to vectorize functions by default, to make the user's life easier.

The way DataFrameMacros.jl takes is to vectorize by default (to make life easier) and allow disabling vectorization by a switch. This is a very nice design, but note that this is a work-around the fact that Julia Base is not vectorized by default (which is a deliberate design choice in Julia Base AFAICT).


In summary: my point is that current design of select, transform and combine (all functions are column-wise) is consistent both internally (within the Julia ecosystem) and with dplyr.
It is DataFrameMacros.jl that diverges here by sacrificing consistency to provide more convenience.

In particular if you take a documentation fo mutate from dplyr https://dplyr.tidyverse.org/reference/mutate.html and try to reproduce with transform-like function what is advertised there, i.e.:

you immediately notice that they are NOT POSSIBLE to be reproduced by an operation that works row-wise. Just one example:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))

clearly this operation (and this is a FIRST example in the manual explaining how mutate works) requires mutate to work column-wise.


I have written this long post to explain the DESIGN considerations. However, I am fully aware that they are not the same as CONVENIENCE considerations. If I understand the current thinking in DataFramesMeta.jl design roadmap (@pdeffebach could you confirm please) is that it is planned (apart to providing a @byrow modifier) to add @rselect and @rtransform macros that would work row-wise by default, while @select and @transform would work column-wise.

Personally (but note that this is opinionated and I do not want to enforce my point of view) I find it a preferable design as:

  • we make sure that @transform and transform correspond and do the same (the same with other functions);
  • we make sure that there is a clear visual signal in @rtransform that we switch to vectorized mode (I understand that for users coming from R vectorized mode is kind of default, but - like it or not - it is not a default in Julia Base by design, so I personally prefer to have an explicit way to signal this)

@vjd and @matthieugomez - to highlight my point please consider what happens if you write the following in R:

df = data.frame(a=c(-1,1))
mutate(df, r=runif(1))
mutate(df, r=runif(2))
mutate(df, r=runif(3))

vs. what DataFrameMacros.jl produces:

df = DataFrame(a=[-1,1])
@select(df, :r=rand())
@select(df, :r=rand(1))
@select(df, :r=rand(2))
@select(df, :r=rand(3))
@select(df, :r=@c rand())
@select(df, :r=@c rand(1))
@select(df, :r=@c rand(2))
@select(df, :r=@c rand(3))

On purpose I am not showing the output in both cases as I think it is better to try to think of the result first and then see it and then consider the issues of how what dplyr does should translate to what we need to do in Julia.

@matthieugomez
Copy link
Contributor

matthieugomez commented Jul 8, 2021

Thanks. I guess my main point is that it should (at least) as convenient to do row-wise manipulations compared to column-wise operations since it makes the code so much cleaner to write. This is esp. true in Julia, because (i) there is not implicit broadcasting (ii) support for missing values is really sparse.

Just to give some comparisons, since it is always useful to compare with other frameworks:

  • Query.jl and Volcanito.jl also choose row-wise by default.
  • In Stata, the equivalent of filter (keep) is row-wise. The equivalent of transform is split into two commands: gen for row-wise, egen for column-wise. The equivalent of combine (collapse) is column-wise.
  • In R, as you mention, dplyr (or data.table) is indeed column-wise by default. This is inevitable because everything in R is a vector by default(e.g. c(1) == 1 return TRUE, while rnorm() returns "Error in rnorm() : argument "n" is missing, with no default"). Interestingly, dplyr also provide rowwise for these rare cases where one truly needs row-wise operations
    library(dplyr)
    df <- tibble(x = list(c(1, 2), c(3, 4)))
    df %>% rowwise() %>% mutate(x_mean = mean(x))

The way DataFrameMacros.jl takes is to vectorize by default (to make life easier) and allow disabling vectorization by a switch. This is a very nice design, but note that this is a work-around the fact that Julia Base is not vectorized by default (which is a deliberate design choice in Julia Base AFAICT).

My interpretation is this Julia Base did this to clarify the difference between row-wise and colum-wise operations. I don't think DataFrameMacros.jl violates this since it still respects this distinction.

vs. what DataFrameMacros.jl produces:

I think what DataFrameMacros produces in this case makes a lot of sense!

If I understand the current thinking in DataFramesMeta.jl design roadmap (@pdeffebach could you confirm please) is that it is planned (apart to providing a @byrow modifier) to add @rselect and @rtransform macros that would work row-wise by default, while @select and @Transform would work column-wise.

I think I'd be happy with @rsubset/@rselect/@rtransform too!

@bkamins
Copy link
Member Author

bkamins commented Jul 8, 2021

I think what DataFrameMacros produces in this case makes a lot of sense!

I agree that what DataFrameMacros.jl makes a lot of sense (I could predict the result without runing the code - which is a good sign it has a good design). I wanted to highlight there that this is different from what dplyr does - i.e. one should not think that if someone knows dplyr well then DataFramesMacros.jl does the same thing as it does something different.

I don't think DataFrameMacros.jl violates this since it still respects this distinction.

What I try to say is that Julia Base requires explicit vectorization, while DataFrameMacros.jl does it by default. And I agree that in many situations it is what is most convenient.


In summary:

  1. I do not want to say that DataFramesMacros.jl is flawed (@jkrumbiegel does a really good job with his packages). I just wanted to highlight that it is not the same as dplyr.
  2. We have spent days on discussions if select/transform/subset in DataFrames.jl should be row-wise by default, or whole-column by default (actually @matthieugomez contributed a lot there 👍). And we took the whole-column path because it is more flexible, accepting that ByRow has to be written in DataFrames.jl and expecting that macro-packages would provide a more convenient interface (and this is what actually happens - which is excellent)
  3. What I believe is that in DataFramesMeta.jl adding @rselect etc. will just resolve the issue we discuss (actually - as @matthieugomez has noted it is very similar to Stata with egen vs gen). I hope that requiring users to learn that there is @rselect vs @select is not that hard with the benefit that they will better understand the whole ecosystem as a consequence (i.e. they will be able to better predict and choose in what scenarios they should use which function). We only need to discuss if @rselect etc. is a good name (here probably @pdeffebach is the one to have a last word, but maybe @vjd can comment on this). The benefit of the fact that @select stays whole-column, while @rselect is rowwise, is that if users, for some reasons, need to use DataFrames.jl (without macro packages) they will clearly see that @select is the same that select (and @rselect is select with ByRow).

Thank you for all the input in the discussion.

@jkrumbiegel
Copy link
Contributor

It's true that R doesn't work row-wise, that it just looks that way because of vectorization. For my personal needs, I decided that the gain from omitting broadcasting dots and getting access to ternary operations etc. weighed more than the complications from the rowwise-columnwise difference between combine and the rest.

I think a lot of people, especially those who are not very proficient in other languages either, have an intuitive feeling how much friction they expect when using a certain software tool, and I have a motivation to get them on board as well by lowering the visual complexity of the simple dataframes operations. Hence, Chain.jl and DataFrameMacros.jl. I really wanted the simplest operations to be as simple as possible, so no broadcasting dot for something like :a + :b. As we always deal with same-length columns, I think the power of broadcasting for dealing with different dimensions does not fully come into play here, anyway.

In my opinion, once you get used to how things work in Julia, you can easily switch between a row-wise and a column-wise workflow anyway, as broadcasting doesn't incur any mental overhead anymore (especially if it's only broadcasting of columns that all have the same length, guaranteed). So I'm deliberately making choices geared towards beginners and simpler workflows, for anything unusual you might want default dataframes syntax anyway.

@pdeffebach
Copy link
Contributor

Thanks for all the discussion, it has been very informative and clarifying.

I still agree with Bogumil that telling users "all macros operate by row, except @combine or when provided a grouped data frame" is not ideal. I don't like different behavior about how functions are generated depending on the input data frame type.

Given that @transform df @byrow did not catch on, I think we should try @rtransform etc.

I was initially going to prioritize moving cols to $, but it looks like @r* should take priority. I will work on it now.

@pdeffebach
Copy link
Contributor

PR is up here.

@vjd
Copy link

vjd commented Jul 9, 2021

Super discussion and sorry for the late response. @bkamins thank you for summarizing the discussion and the details on the why and what, this will be a self-documenting issue as to the design choices of the path that we are taking. I support the use of @rselect (and variants) based alternatives to the functions to keep the consistency. If we step back one level, the whole reason we are embarked on this discussion, or personally I am invested in this is because I have been and still train many students who are entering into the data analysis space for the first time and as @jkrumbiegel pointed out, I would prefer that they are provided an easy on-boarding platform. If we lose these newcomers at hello, we are at risk of not growing enough and most importantly not fast enough. For analysts who get comfortable over time, or for seasoned Julia users, I guess any of the packages in the discussion here should be fine and they will choose the best one based on needs. But for newcomers it is important to provide an opinionated way that helps them kick off quickly.

I believe that along with the flexibility of Julia in general, Chain.jl, DataFramesMeta that @pdeffebach just updated, and obviously the fantastic Makie+AlegbraOfGrpahics by @jkrumbiegel we will have a strong competitor to the tidyverse ecosystem (not saying that is what we are striving for, but at least to win the hearts of the those are switching over)

@bkamins
Copy link
Member Author

bkamins commented Jul 9, 2021

Agreed that @jkrumbiegel and @pdeffebach are doing a fantastic job for making the Julia ecosystem more accessible.

I fully agree with:

I would prefer that they are provided an easy on-boarding platform. If we lose these newcomers at hello, we are at risk of not growing enough and most importantly not fast enough.

Currently DataFrameMacros.jl fits this purpose perfectly. My question to you is if you think that with @rtransform etc. the DataFramesMeta.jl will also be easy to pick up by the newcomers?

@pdeffebach
Copy link
Contributor

I think I'd be happy with @rsubset/@rselect/@rtransform too!

@rsubset, @rselect, @rtransform etc. just got merged into master!

So I think the next step for feature parity is to add a @passmissing functionality.

This will be done in two ways, a @passmissing flag (@m in DataFrameMacros, probably we will have to choose a somewhat longer name) for row-wise operations and a spreadmissings feature for column-wise operations.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2021

Closing this as we have resolved most issues here. Any outstanding things should be handled by issues to the mentioned packages.

@bkamins bkamins closed this as completed Nov 22, 2021
@jeremiahpslewis
Copy link

@bkamins Quick question on this topic...is there a docs page within DataFrames.jl which shows the two meta/macro packages and explains the pros/cons? AFAICT only Query.jl and DataFramesMeta.jl are mentioned here: https://github.com/JuliaData/DataFrames.jl/blob/5f22e27a281cba95ae93240705d858fcf592b32b/docs/src/man/querying_frameworks.md

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2021

No - currently there is none.
In general package maintainers are welcome to make such additions (for Query.jl and DataFramesMeta.jl it has been done many years ago).
Also feel free to open a PR if you would like to add/discuss this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants