Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mergeduplicates keyword to handle makeunique=false #3366

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

leei
Copy link

@leei leei commented Jul 31, 2023

Add the dupcol keyword In all constructors and joins that allow makeunique=true to generate new column names when there are duplicates among the constructed or joined columns. Allows for extensible actions to be taken when these duplicate columns are detected:

  • :error throws an error when duplicate column names are given or created (equivalent to makeunique=false, the default),
  • :makeunique creates new column names for the duplicate columns using the same approach as makeunique=true), and
  • :update which coalesces values by updating the values from the first duplicate column with non-missing values from each subsequent duplicate column name.

N.B. This approach is extensible to other possible actions such as :ignore or :overwrite if so desired.

Also warns that makeunique=true as deprecated but maintains its functionality.

@leei
Copy link
Author

leei commented Jul 31, 2023

Response to #2243

@bkamins bkamins added this to the 1.7 milestone Aug 5, 2023
@bkamins
Copy link
Member

bkamins commented Aug 5, 2023

Thank you for the proposal. It requires a design discussion before reviewing the implementation.

The options I see as alternatives to your design proposals are (based on my opinion and some short discussion with @nalimilan):

  • keep makeunique as it is with Bool accepted as a value; for selected functions (that is joins, hcat[!] and insertcols[!]) allow to pass a Function (any function that takes two values and returns what should be produced). For example makeunique=coalesce would replace missing values in left with value in the right. (this is my preferred design)
  • The same but with additional kwarg dupcol that would accept a Function, which, if passed, overrides makeunique (this is something I like less, as it is not needed to introduce a new kwarg I think)

@nalimilan - could you please comment on what you prefer? (or maybe some other option). In particular I do not think we need the update! function as I think hcat[!] and insertcols[!] cover all that is needed.

@leei
Copy link
Author

leei commented Aug 6, 2023 via email

@nalimilan
Copy link
Member

I agree it seems simpler to use the existing makeunique argument for this. The name isn't ideal, but we have to keep supporting this argument anyway to avoid breakage.

@bkamins
Copy link
Member

bkamins commented Aug 12, 2023

OK - so the change would be to allow passing a Function as makeunique kwarg in: joins, hcat[!] and insertcols[!]

  • implementation of the functionality.
  • full test coverage
  • docstring update
  • manual update
  • NEWS.md update

Thank you!

@jariji
Copy link
Contributor

jariji commented Aug 12, 2023

Literally ::Function or is any callable ok?

@bkamins
Copy link
Member

bkamins commented Aug 12, 2023

I preferred Function to be more strict (and it is easy enough to wrap anything into a function). Allowing any callable is potentially risky as it would have to be Any and in the future in case we had some value that we would want to treat in a yet different way.

Do you see some concrete example of a useful callable that is not a function? Note that this function requires 2 positional arguments to be passed (and typically, non-function callables accept only one anyway - of course, this is not a 100% correct rule either, but just an intuition)

@jariji
Copy link
Contributor

jariji commented Aug 12, 2023

No concrete example, just a general style, but DataFrames.jl has its own style so I'm not too concerned about it. And I guess using Bool is already doing a bit of concrete-type dispatch anyway.

@kescobo
Copy link
Contributor

kescobo commented Aug 13, 2023

One argument for a new keyword is that currently, makeunique only affects column names, not the columns themselves. If I understand correctly, when passing a function, you're altering the column values, leaving the names alone.

That said, I was recently wishing for this functionality, so 👍 from me!

Side note - maybe needs a different issue, but could also be an argument for reserving makeunique::Function - I also recently had something like the following crop up, and yearned for a way to provide a function for how to modify names when makeunique = true

julia> DataFrame(:x => rand(5), :x=>rand(5), :x_1=>rand(5), :x_1=>rand(5); makeunique=true)
5×4 DataFrame
 Row │ x          x_2          x_1       x_1_1
     │ Float64    Float64      Float64   Float64
─────┼──────────────────────────────────────────────
   1 │ 0.537446   0.704216     0.752834  0.837377
   2 │ 0.554652   0.582872     0.465181  0.00778113
   3 │ 0.231824   0.846138     0.145939  0.606117
   4 │ 0.427691   0.000135526  0.820666  0.430972
   5 │ 0.0992564  0.233478     0.510694  0.187573

It took me a while to sort out which columns were actually duplicated (this was in a table with thousands of columns - I know, not ideal) I would have liked to be able to do something like makeunique = (orig, dup) -> string(dup, "__", randstring(3))

@bkamins
Copy link
Member

bkamins commented Aug 13, 2023

I also recently had something like the following crop up, and yearned for a way to provide a function for how to modify names when makeunique = true

This is a good point. Maybe indeed we should add both, i.e. makeunique::Function to allow for modifying the way how duplicate column names are generated (I also had this requirement in the past).

And another kwarg for merging columns (which, if set, would disable makeunique). If we go this way then the question is if dupcol is a good name?

@bkamins
Copy link
Member

bkamins commented Aug 24, 2023

@leei - why have you closed this PR? I thought we had a conclusion that this is a valuable addition (just the design should be a bit different).

@leei
Copy link
Author

leei commented Aug 24, 2023

Strange. Must have happened automatically when I moved the commit to a new branch.

Not sure how to resolve this since I want to preserve the discussion but that previous pull is a bit wrongheaded. I have a new PR that contains different changes that overload the makeunique kw in the way described above.

I still have that commit around but I like the "overload makeunique" solution best that takes three kinds of args:

  1. true/false for make a new column or raise an error,
  2. a Function of two parameters that determines how to combine two columns to the target, and
  3. one of a set of keywords that handle some of 1 and 2

The currently allowed keywords are:

  • :makeunique – create a new uniquely named column
  • :error – raise an Error
  • :update – update the left-hand column with non-missing values from the right
  • :ignore – ignore the duplicated column

Should I just add a new PR with the updated patch, restore the commit on main that this PR is based on, or both?

… columns in joins and DataFrame constructors
@leei
Copy link
Author

leei commented Aug 24, 2023

The alternative implementation is #3373

@bkamins
Copy link
Member

bkamins commented Aug 25, 2023

I have a new PR that contains different changes that overload the makeunique kw in the way described above.

An alternative is to locally REBASE the PR and then force-push it to GitHub.


Having said that what I would propose (in respect for your time) is that we first discuss the final design and then this design is implemented. This is the standard procedure that we always follow in DataFrames.jl (because there is usually a lot of discussion around every design decision). Also if we have a design decision it does not mean that it would have to be "in full" implemented in one PR (it could, but it does not have to, as ).

In particular what @kescobo commented on is that it would be good to mentally separate two operations:

  • if we keep all the columns and how we name them (this is currently handled by makeunique)
  • if we merge duplicate columns (this could be handled by a separate kwarg)

On the other hand maybe indeed what you propose is good. How I understand it is the following. We have only makeunique kwarg that accepts:

  • true - the same as :makeunique; (for back compatibility)
  • false - the same as :error; (for back compatibility)
  • :makeunique - generate a new unique name for the column
  • :error - error on duplicate
  • :update - merge columns with coalesce
  • :ignore - ignore duplicate column

If this is the proposal maybe we could extend it with two extra options (non-conflicting with what you proposed):

  • :makeunique => function to rename columns when making them unique (this could be a separate PR as I am not currently clear how it would work in general - i.e. should we rename only duplicate columns or potentially all columns and what should be the input and output, maybe eg the signature (name, dup_count) -> tuple with dup_count values could be expected)
  • :update => function (a more general merging rule could be allowed by function - again - it could be a separate PR)

@leei, @nalimilan, @kescobo, @jariji - can you please comment on this. So that we have a clear vision where we go before moving forward with the implementation?

Thank you!

@jariji
Copy link
Contributor

jariji commented Aug 25, 2023

Can somebody give a motivating example for when I would want update or ignore? Also concerned that these options seem noncommutative.

@kescobo
Copy link
Contributor

kescobo commented Aug 25, 2023

This is my opinion, very lightly held. That is, I think both approaches are improvements, and neither is so obviously correct / better today I would argue strongly for it.

To "make (cols) unique" implies that you will end up with multiple columns. makeunique = :update as proposed seems like a logical contradiction.

Also, I can see situations where it might be useful to be able to use a function other than coalesce when merging. To flesh out the proposal a bit:

  • two separate kwargs, :makeunique is for retaining duplicate columns, mergeduplicates is for combining them (I'm not married to the name)
  • both can take a Bool or a function
  • both are false by default, and an error is thrown if duplicates are detected (current behavior)
  • only one can be other than false
  • for :makeunique, true uses current behavior and a function handles how columns are renamed.
  • for :mergeduplicates, true does coalesce and a function handles how items across duplicates are merged

I'm not totally clear on what the arguments and expected outputs for the function versions are, but my first instinct is:

  • :makeunique - colname and n as arguments, vector (or tuple) of length n as output.
  • :mergeduplicates - vector (or tuple) of values as argument, whatever as output (will become the eltype of the new column)

@leei
Copy link
Author

leei commented Aug 25, 2023 via email

@kescobo
Copy link
Contributor

kescobo commented Aug 25, 2023

And I do apologize for just going ahead and coding this w/o the design discussion.

FWIW - I don't think anyone has a problem with this - it's mostly out of knowledge that your time is precious too, and no one wants you to do a bunch of coding that ends up getting tossed in the bin because folks disagree with the design. If you needed / wanted this anyway, I don't think there's anything wrong with providing a concrete example as an opening salvo, as long as you won't be offended if the consensus lands in a different place (and it doesn't seem like you will be so 👍 )

@nalimilan
Copy link
Member

It's a good point that we may want to allow passing a function to makeunique in the future to generate custom unique names. Given this, @kescobo's proposal of having two separate arguments which cannot be set at the same time sounds like a good solution. Then dupcols/mergeduplicates can be nothing (default) or any function: passing coalesce is simple enough for the common case.

@bkamins
Copy link
Member

bkamins commented Aug 29, 2023

it's mostly out of knowledge that your time is precious too, and no one wants you to do a bunch of coding that ends up getting tossed in the bin because folks disagree with the design

100% agreed and this is what I meant.
I want to respect your time as I know (and I really do 😄) how much time coding things in DataFrames.jl takes.


To me having a single kw for “what do I do with duplicates” is a better option and adding a second kw that handles “how do I generate new column names” makes more sense.

Let me summarize the requirements we have. The options we need to support are

  • error on duplicate (current: makeunique=false)
  • autogenerate new column name on duplicate (current: makeunique=true)
  • autogenerate new column name on duplicate while allowing to customize the way the new column is named (current: not supported)
  • merge duplicate column with the existing column (current: not supported), with particular cases:
    • replace only missing, achieved by the coalesce function
    • keep the old column, achieved by the (x, y) -> x function

Now the question is how to handle this with keyword arguments. We have the following options:

  1. squeeze all the options into makeunique (this is doable, I described in mergeduplicates keyword to handle makeunique=false #3366 (comment) how this can be done)
  2. use two separate kwargs: a) makeunique with current syntax, potentially in the future allowing function that would generate the name of the new column; b) dupcols/mergeduplicates that would signal that column merging is required (and then makeunique is ignored)
  3. use two separate kwargs: a) makeunique that handles all cases like in point 1, except function for generation of new column names that would be a separate kwarg.

So my opinion is that I would not do point 3. For me doing 1 or 2 is OK. I have a slight preference for point 1 (all in makeunique) as this will be probably simpler for users to learn, but 2 is also OK.

@nalimilan
Copy link
Member

squeeze all the options into makeunique (this is doable, I described in #3366 (comment) how this can be done)

I'm not a fan of the :makeunique => function and :update => function approach. It works, but it's an unusual pattern which in effect recreates two keyword arguments inside a single keyword argument. So I'm more in favor of 2 (unless we can find another trick to use a single argument).

@kescobo
Copy link
Contributor

kescobo commented Aug 30, 2023

Let me propose a (4) that is perhaps the worst of all worlds 😅. It's kind of merging @bkamins 1/3, but addresses @nalimilan 's concern

  • as with point 1, decisions about what to do are handled by makeunique, eg rename duplicates or merge them, with the proposed defaults.
  • a new kwarg (handledups?) that determines how that gets done. In the case of renames, it's a function that handles the renaming, and in the case of merging it's a function that handles the column merge

@bkamins
Copy link
Member

bkamins commented Aug 30, 2023

a new kwarg (handledups?) that determines how that gets done.

This is something I think we should not do as the meaning of this kwarg would change depending on other kwarg. This is something that we should avoid as writing makeunique=x, handledups=y when x and y are variables cannot be interpreted statically (i.e. you need to know the value of x to know what y means)


Given @nalimilan comment let me propose the following rules:

  • makeunique stays as it is for now. It allows only Bool for now. In the future (it can be this PR or some other PR - no need to rush with this - we can add support for passing Function that would change the way the column names are generated when a duplicate is encountered; I recommend a separate PR, as getting this right is very complex see https://github.com/JuliaData/DataFrames.jl/blob/main/src/other/utils.jl#L77 implementation. The point is that when you generate a deduplicated name it then might itself generate a duplicate, so essentially the de-duplicating function would need to take a whole vector of names and deduplicate it as a whole - name deduplication cannot be done locally on a single column name level)
  • new mergeduplicates kwarg (I prefer this name as it is explicit what the kwarg does). This kwarg accepts either nothing (when it is ignored and makeunique is respected) or alternatively user can pass a Function. In this case makeunique must be false (i.e. passing makeunique=true will error). The function must take a vararg argument and return a single value. Example implementations would be:
    • coalesce: returns first non-missing value (or missing if all are missing)
    • first∘tuple: a first duplicate column
    • last∘tuple: a last duplicate column
    • mean∘tuple: mean of duplicate columns

Note that it is crucial that the function accepts more than 2 arguments as in some cases we can have data that introduce more than 2 duplicate columns.

Also when implementing mergeduplicates we need to precisely document how it works (and there will be two modes):

  • process all data at once (e.g. DataFrame(x=1,x=2,x=3, makeunique=mean∘tuple) will work this way)
  • process data pairwise (e.g. currently hcat(df, df, df, makeunique=mean∘tuple) will work this way; the same with joins)

Fortunately for coalesce, first∘tuple and last∘tuple which are probably most common both options produce the same.

What do you think?

@kescobo
Copy link
Contributor

kescobo commented Aug 30, 2023

Makes sense to me 👍

behaviour to match bkamins comment in
JuliaData#3366

Is now used to pass a Function to handle cases where
makequnique=false by combining those values (passed
as parameters) into a returned result.
@leei
Copy link
Author

leei commented Sep 19, 2023

Just updated this PR to conform to @bkamins proposal above,


Horizontally concatenate data frames.

If `makeunique=false` (the default) column names of passed objects must be unique.
If `makeunique=true` then duplicate column names will be suffixed
with `_i` (`i` starting at 1 for the first duplicate).

If `makeunique=false` and `mergeduplicates` is a Function then duplicate column names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `makeunique=false` and `mergeduplicates` is a Function then duplicate column names
If `makeunique=false` and `mergeduplicates` is a `Function` then duplicate column names

Comment on lines +1548 to +1549
will be combined by this function with the column named overwritten by the results of
the function on all values from the duplicated column(s).
Copy link
Member

@bkamins bkamins Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what this sentence means. Probably a native speaker would be better to say how to fix it.

What I can say is how this will work mergeduplicates(mergeduplicates(x,y),z) where x, y and z are consecutive duplicates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I say "ouch". I am very much a native English speaker. That said I've struggled with this wording to make it clearer... more struggle then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following be clearer?

If makeunique=false and mergeduplicates is a Function then duplicate columns
will be combined by invoking the function with all values from those columns.
e.g. mergeduplicates=coalesce will use the first non-missing value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crucial issue is, as I have commented, that the behavior or mergeduplicates will differ depending on the function in which it is used (unless we change the current implementation).

In functions like insertcols or DataFrame it will pass all duplicate columns as consecutive arguments to a SINGLE CALL to mergeduplicates function.

In functions like hcat or *join it will perform a recursive call taking at most two duplicate columns at a time.


julia> df3.A === df1.A
true
```
"""
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, copycols::Bool=true)
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates::Union{Nothing, Function}=nothing, copycols::Bool=true)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking it that you're suggesting that mergeduplicates be typed everywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

insert!(_columns(dfp), col_ind, item_new)
else
# Just update without adding to index
merge = get(mergecolumns, name, (dfp=dfp, cols=[]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you need dfp for here?

@bkamins
Copy link
Member

bkamins commented Sep 25, 2023

I have started looking at this PR and left some initial general comments (but they would apply to all code not only the parts that I have commented).

However, having seen the amount of code affected I would really prefer if you split the PR into several PRs changing individual functions (e.g. hcat separately, DataFrame separately, insertcols+insertcols! separately).

It would greatly help me with the review (it is really hard to review large PRs). This is especially crucial as I need to make sure that we have an appropriate test coverage for every function changed. Would it be OK with you to make such a change? (we can keep this PR as a working reference, and these new PRs could be done as a cut-out from this one)

@leei
Copy link
Author

leei commented Sep 25, 2023 via email

makeunique::Bool=false, copycols::Bool=true)
u = add_names(index(df1), index(df2), makeunique=makeunique)
makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true)
u = add_names(index(df1), index(df2), makeunique=true, mergeduplicates=mergeduplicates)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation looks incorrect. Where do you use the mergeduplicates if passed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's presence is necessary to prevent the error throw in add_names.

That said, I don't see any use of merge or merge! in the codebase, so the change was for completeness, since add_names gets used in hcat! and needs to be sensitive to mergeduplicates there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do not understand why. add_names works on index so it cannot take into account mergeduplicates anyway. Why and how do you think it should?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - I saw implementation of add_name that you have changed. I just feel that add_names should not handle the mergeduplicates kwarg because it makes the design non-obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to further clarify my thinking. We need to make sure that the design we use is simple. The reason is that this package is large and it is important to make sure that future developers have an easy way to understand how everything works. That is the reason I recommend taking "small steps", as this ensures that these small steps have clean design.

@@ -128,8 +138,8 @@ function Base.push!(x::Index, nm::Symbol)
return x
end

function Base.merge!(x::Index, y::AbstractIndex; makeunique::Bool=false)
adds = add_names(x, y, makeunique=makeunique)
function Base.merge!(x::Index, y::AbstractIndex; makeunique::Bool=false, mergeduplicates=nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function for sure does not need mergeduplicates kwarg as it does not take a data frame as a source

@bkamins
Copy link
Member

bkamins commented Sep 25, 2023

I’d like to but... this functionality dives deep down into the basic initialization functionality of DataFrame and Index.

That is one of my points (I have added comments). The functionality should not affect Index at all I think. The reason is that Index is unaware of column values, so it can only handle makeunique as it does currently.

E.g. I think starting with insertcols+insertcols! as a separate PR would be an easy variant to handle. Then we could have DataFrame constructor PR (also easy). Things will be harder then (for hcat or *join) but we can discuss them later.

@leei
Copy link
Author

leei commented Nov 2, 2023

OK, I've finally found the time to revise the implementation to cover the cases you've raised and break it down into a series of commits that are relatively self-contained. It is on a different branch in my repo though. Thid PR won't seem to let me change the branch. Is that because it's in review? In any case, the revised PR is on my updateindex branch.

@kescobo
Copy link
Contributor

kescobo commented Nov 3, 2023

Thid PR won't seem to let me change the branch. Is that because it's in review?

No, but you can rename and then force push, I think

@leei leei changed the title New dupcol keyword to replace makeunique mergeduplicates keyword to handle makeunique=false Nov 3, 2023
@leei leei marked this pull request as draft November 3, 2023 18:39
@bkamins
Copy link
Member

bkamins commented Nov 3, 2023

It is OK to open a new PR. Then in this PR we keep history (if needed). And once things are finalized the historical PR can be just closed.

@leei
Copy link
Author

leei commented Nov 8, 2023

The new PR tracking after all of this discussion is #3401

@leei leei mentioned this pull request Nov 8, 2023
@bkamins bkamins modified the milestones: 1.7, 1.x Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants