-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrameMacros.jl and DataFramesMeta.jl #2793
Comments
I would currently characterize the difference such that DataFramesMeta is closer to how DataFrames is built. I've made DataFrameMacros so that it's most convenient for me personally, and I don't shy away from shorthands such as Choosing row-based transforms etc. by default brings the syntax a bit closer to what people might be used to from R, as their vectorized primitives always look "scalar-like". I think because we do have fast loops it's actually a boon that we can write row-based logic which is still fast. I especially enjoy that when working with strings etc. I also like the The differences could change a bit in the future, as Peter said that DataFramesMeta plans to make some changes that could bring both packages closer together. |
In terms of concrete changes that should be happening soon
These are on the list, and I understand you have been frustrated with the slow pace of changes. Hopefully I can pick up the pace a bit to get these things merged in. The main sticking point is row-wise by default. Would you be okay with EDIT: One other thing. Currently, DataFramesMeta.jl is very hard to make changes to because it was written without MacroTools. Do you think if we re-write some parts of the package using MacroTools.jl, it would lower the barrier of entry to contributing? Then it might be easier for you to help implement the changes you want. |
Thank you both for commenting. If I understand things correctly @pdeffebach is planning to get at almost the same design as DataFrameMacros.jl in the long term (except for some details in syntax). Is this correct? If this is so then the question to @jkrumbiegel is - how would you prefer both packages to be promoted in DataFrames.jl documentation (or maybe you see DataFrameMacros.jl as your personal project for exploring things and of course anyone is welcome to use it, but at the same time you could support development of DataFramesMeta.jl to reach the state of the same usability as DataFrameMacros.jl?). My personal opinion on what @pdeffebach plans:
agreed
agreed
agreed agreed
agreed A side question to you both. If I want a target column name as e.g. |
Yeah I support that statement, if DataFramesMeta ends up so similar to what I have now that there's almost no point in having two packages, that's fine by me. I can of course always use what I want, which is still nice to have. I often tend towards really short and sweet solutions, and I understand that others don't want to go that way and be more verbose, yet maybe less ambiguous. I think that the backend design of my package isn't actually too bad and that it supports some things that DataFramesMeta didn't, so if those features are copied, also good. About the side question, DataFrameMacros supports anything left of the @transform(df, "new column" = :a * :b) |
This is fantastic - I really believe that with this approach we can get very far.
Indeed in DataFramesMeta.jl I think that we would probably prefer something that is easy to understand (e.g. when selecting names) so that newcomers can easily pick it up, even at the expense of some verbosity (assuming that it is still terse).
I hope that @pdeffebach would be open to both using the API ideas as well as internal design (original DataFramesMeta.jl was written very long ago so redesign of internals is probably needed anyway).
👍 This is fantastic |
Thank you all for discussing this. Personally, having tried both these packages now, in addition to using DataFrames.jl as is, I feel the DataFramesMacro.jl kind of approach would definitely lower the barrier to entry as almost everyone migrating to Julia from other ecosystems would want the rowwise thinking. Having said that, I am in complete support of merging the two solutions if feasible, with a possible rewrite of DataFramesMeta, or using the DataFramesMacros as the starting point to combine Peter's ideas, especially if a rewrite of the internals are warranted for based on Bogumil's idea. I am happy to contribute in any way I can! |
DataFrameMacros.jl looks really nice (I like the row-wise default as well as the short macro flags). Since it looks like a complete rewrite also allowed @jkrumbiegel to get a much cleaner code, why not use it as a starting point instead? Sorry if I am missing something — but what is the point of doing a series of breaking changes in DataFramesMeta.jl just to end up in the same place, with a more complicated codebase? |
I think that's exactly the point. We want to keep the user-base of DataFramesMeta happy, meaning we need to have a deprecation period for changes. This means we have to keep the old (complicated) code on hand for a while. But yeah, we recently added MacroTools as a dependency to DataFramesMeta, so once we deprecate the old stuff a re-write that uses EDIT: DataFramesMeta also has a ton of documentation, which is more work than the code-base itself (see the recent |
@pdeffebach - based on user feedback. The feature
is critical. Therefore in target design of DataFramesMeta.jl we should provide this option also (I know it was discussed but I want to stress that this is the key thing based on the feedback). @matthieugomez The challenging part of DataFrameMacros.jl is:
i.e. it is not always row-wise and the same expression passed to |
Putting aside the row-wise discussion. Just want to note that I just merged |
Isn't this the default and how dplyr also works that just for
Absolutely agree.
I sort of agree with this comment. Just trying to understand. If it is just about documentation, which is essential and the core element of any package, perhaps the work being put into re-writing can be spent on documentation. OTOH, I do see the fact that there may be many existing users who need to be kept in mind. Difficult decision, but I think sometimes starting with a clean slate is not too bad. I have a lot of respect and appreciation to the amount of work that went in all these years into DataFramesMeta. |
DataFramesMeta appears in many tutorials across the web. It has a large existing user-base that currently uses deprecated functionality. It would be a big deal for JuliaData to put DataFramesMeta.jl in maintenance mode and tell everyone to use a new package. Especially when DataFramesMeta.jl is pre-1.0 and can make the requested changes, with deprecations to help out the existing userbase, while it transitions to an updated API. |
@bkamins While I'm a sucker for consistency, the fact that `combine` operates column-wise should be pretty self-explanatory --- there is no way to expect a different behavior.
|
and
These comments are very interesting. I am most likely biased here because I know how these operations are implemented internally. And internally, However, note that in dplyr So in other words Here is an example:
And this is clearly not what you expect. So the tension is that in dplyr the user learns that:
but this is not true. I understand that in Julia ecosystem we want to simplify the transition from dplyr, so we want to vectorize functions by default, to make the user's life easier. The way DataFrameMacros.jl takes is to vectorize by default (to make life easier) and allow disabling vectorization by a switch. This is a very nice design, but note that this is a work-around the fact that Julia Base is not vectorized by default (which is a deliberate design choice in Julia Base AFAICT). In summary: my point is that current design of In particular if you take a documentation fo
you immediately notice that they are NOT POSSIBLE to be reproduced by an operation that works row-wise. Just one example:
clearly this operation (and this is a FIRST example in the manual explaining how I have written this long post to explain the DESIGN considerations. However, I am fully aware that they are not the same as CONVENIENCE considerations. If I understand the current thinking in DataFramesMeta.jl design roadmap (@pdeffebach could you confirm please) is that it is planned (apart to providing a Personally (but note that this is opinionated and I do not want to enforce my point of view) I find it a preferable design as:
@vjd and @matthieugomez - to highlight my point please consider what happens if you write the following in R:
vs. what DataFrameMacros.jl produces:
On purpose I am not showing the output in both cases as I think it is better to try to think of the result first and then see it and then consider the issues of how what dplyr does should translate to what we need to do in Julia. |
Thanks. I guess my main point is that it should (at least) as convenient to do row-wise manipulations compared to column-wise operations since it makes the code so much cleaner to write. This is esp. true in Julia, because (i) there is not implicit broadcasting (ii) support for missing values is really sparse. Just to give some comparisons, since it is always useful to compare with other frameworks:
My interpretation is this Julia Base did this to clarify the difference between row-wise and colum-wise operations. I don't think
I think what
I think I'd be happy with |
I agree that what DataFrameMacros.jl makes a lot of sense (I could predict the result without runing the code - which is a good sign it has a good design). I wanted to highlight there that this is different from what dplyr does - i.e. one should not think that if someone knows dplyr well then DataFramesMacros.jl does the same thing as it does something different.
What I try to say is that Julia Base requires explicit vectorization, while DataFrameMacros.jl does it by default. And I agree that in many situations it is what is most convenient. In summary:
Thank you for all the input in the discussion. |
It's true that R doesn't work row-wise, that it just looks that way because of vectorization. For my personal needs, I decided that the gain from omitting broadcasting dots and getting access to ternary operations etc. weighed more than the complications from the rowwise-columnwise difference between I think a lot of people, especially those who are not very proficient in other languages either, have an intuitive feeling how much friction they expect when using a certain software tool, and I have a motivation to get them on board as well by lowering the visual complexity of the simple dataframes operations. Hence, Chain.jl and DataFrameMacros.jl. I really wanted the simplest operations to be as simple as possible, so no broadcasting dot for something like In my opinion, once you get used to how things work in Julia, you can easily switch between a row-wise and a column-wise workflow anyway, as broadcasting doesn't incur any mental overhead anymore (especially if it's only broadcasting of columns that all have the same length, guaranteed). So I'm deliberately making choices geared towards beginners and simpler workflows, for anything unusual you might want default dataframes syntax anyway. |
Thanks for all the discussion, it has been very informative and clarifying. I still agree with Bogumil that telling users "all macros operate by row, except Given that I was initially going to prioritize moving |
PR is up here. |
Super discussion and sorry for the late response. @bkamins thank you for summarizing the discussion and the details on the why and what, this will be a self-documenting issue as to the design choices of the path that we are taking. I support the use of I believe that along with the flexibility of Julia in general, Chain.jl, DataFramesMeta that @pdeffebach just updated, and obviously the fantastic Makie+AlegbraOfGrpahics by @jkrumbiegel we will have a strong competitor to the tidyverse ecosystem (not saying that is what we are striving for, but at least to win the hearts of the those are switching over) |
Agreed that @jkrumbiegel and @pdeffebach are doing a fantastic job for making the Julia ecosystem more accessible. I fully agree with:
Currently DataFrameMacros.jl fits this purpose perfectly. My question to you is if you think that with |
So I think the next step for feature parity is to add a This will be done in two ways, a |
Closing this as we have resolved most issues here. Any outstanding things should be handled by issues to the mentioned packages. |
@bkamins Quick question on this topic...is there a docs page within DataFrames.jl which shows the two meta/macro packages and explains the pros/cons? AFAICT only Query.jl and DataFramesMeta.jl are mentioned here: https://github.com/JuliaData/DataFrames.jl/blob/5f22e27a281cba95ae93240705d858fcf592b32b/docs/src/man/querying_frameworks.md |
No - currently there is none. |
@jkrumbiegel and @pdeffebach - I know you are in contact and I really believe that having both https://github.com/jkrumbiegel/DataFrameMacros.jl and https://github.com/JuliaData/DataFramesMeta.jl is a good thing at this stage of ecosystem maturity that can help us all to converge to the most user-friendly designs (even maybe in some scenarios one or the other can be preferred?).
What I feel would be really great is that we could jointly come up with some guidance for the users comparing both and putting them in the documentation of DataFrames.jl in https://dataframes.juliadata.org/latest/man/querying_frameworks/ section (ideally before JuliaCon2021).
I know you both are really committed to supporting JuliaData so I hope this is something that is doable. My mindset is the following:
I would be obliged if you commented what you think here. Thank you!
The text was updated successfully, but these errors were encountered: