mergeduplicates keyword to handle makeunique=false #3366

leei · 2023-07-31T17:58:15Z

Add the dupcol keyword In all constructors and joins that allow makeunique=true to generate new column names when there are duplicates among the constructed or joined columns. Allows for extensible actions to be taken when these duplicate columns are detected:

:error throws an error when duplicate column names are given or created (equivalent to makeunique=false, the default),
:makeunique creates new column names for the duplicate columns using the same approach as makeunique=true), and
:update which coalesces values by updating the values from the first duplicate column with non-missing values from each subsequent duplicate column name.

N.B. This approach is extensible to other possible actions such as :ignore or :overwrite if so desired.

Also warns that makeunique=true as deprecated but maintains its functionality.

leei · 2023-07-31T18:00:56Z

Response to #2243

bkamins · 2023-08-05T21:23:16Z

Thank you for the proposal. It requires a design discussion before reviewing the implementation.

The options I see as alternatives to your design proposals are (based on my opinion and some short discussion with @nalimilan):

keep makeunique as it is with Bool accepted as a value; for selected functions (that is joins, hcat[!] and insertcols[!]) allow to pass a Function (any function that takes two values and returns what should be produced). For example makeunique=coalesce would replace missing values in left with value in the right. (this is my preferred design)
The same but with additional kwarg dupcol that would accept a Function, which, if passed, overrides makeunique (this is something I like less, as it is not needed to introduce a new kwarg I think)

@nalimilan - could you please comment on what you prefer? (or maybe some other option). In particular I do not think we need the update! function as I think hcat[!] and insertcols[!] cover all that is needed.

leei · 2023-08-06T00:50:29Z

I like the idea of adding a function option to this, but ‘makeunique’ seemed the wrong keyword to capture this functionality, it’s not explanatory anymore, which is why I moved it to ‘dupcol’. That said I’d be fine with ‘makeunique’ being either Bool or a function. That makes the signature simpler and would be more flexible. My implementation would easily adapt to that if that’s where you want to go.

nalimilan · 2023-08-12T13:47:53Z

I agree it seems simpler to use the existing makeunique argument for this. The name isn't ideal, but we have to keep supporting this argument anyway to avoid breakage.

bkamins · 2023-08-12T20:53:37Z

OK - so the change would be to allow passing a Function as makeunique kwarg in: joins, hcat[!] and insertcols[!]

Thank you!

jariji · 2023-08-12T21:13:16Z

Literally ::Function or is any callable ok?

bkamins · 2023-08-12T21:45:28Z

I preferred Function to be more strict (and it is easy enough to wrap anything into a function). Allowing any callable is potentially risky as it would have to be Any and in the future in case we had some value that we would want to treat in a yet different way.

Do you see some concrete example of a useful callable that is not a function? Note that this function requires 2 positional arguments to be passed (and typically, non-function callables accept only one anyway - of course, this is not a 100% correct rule either, but just an intuition)

jariji · 2023-08-12T22:26:58Z

No concrete example, just a general style, but DataFrames.jl has its own style so I'm not too concerned about it. And I guess using Bool is already doing a bit of concrete-type dispatch anyway.

kescobo · 2023-08-13T01:37:28Z

One argument for a new keyword is that currently, makeunique only affects column names, not the columns themselves. If I understand correctly, when passing a function, you're altering the column values, leaving the names alone.

That said, I was recently wishing for this functionality, so 👍 from me!

Side note - maybe needs a different issue, but could also be an argument for reserving makeunique::Function - I also recently had something like the following crop up, and yearned for a way to provide a function for how to modify names when makeunique = true

julia> DataFrame(:x => rand(5), :x=>rand(5), :x_1=>rand(5), :x_1=>rand(5); makeunique=true)
5×4 DataFrame
 Row │ x          x_2          x_1       x_1_1
     │ Float64    Float64      Float64   Float64
─────┼──────────────────────────────────────────────
   1 │ 0.537446   0.704216     0.752834  0.837377
   2 │ 0.554652   0.582872     0.465181  0.00778113
   3 │ 0.231824   0.846138     0.145939  0.606117
   4 │ 0.427691   0.000135526  0.820666  0.430972
   5 │ 0.0992564  0.233478     0.510694  0.187573

It took me a while to sort out which columns were actually duplicated (this was in a table with thousands of columns - I know, not ideal) I would have liked to be able to do something like makeunique = (orig, dup) -> string(dup, "__", randstring(3))

bkamins · 2023-08-13T08:43:10Z

I also recently had something like the following crop up, and yearned for a way to provide a function for how to modify names when makeunique = true

This is a good point. Maybe indeed we should add both, i.e. makeunique::Function to allow for modifying the way how duplicate column names are generated (I also had this requirement in the past).

And another kwarg for merging columns (which, if set, would disable makeunique). If we go this way then the question is if dupcol is a good name?

bkamins · 2023-08-24T10:12:30Z

@leei - why have you closed this PR? I thought we had a conclusion that this is a valuable addition (just the design should be a bit different).

leei · 2023-08-24T21:18:45Z

Strange. Must have happened automatically when I moved the commit to a new branch.

Not sure how to resolve this since I want to preserve the discussion but that previous pull is a bit wrongheaded. I have a new PR that contains different changes that overload the makeunique kw in the way described above.

I still have that commit around but I like the "overload makeunique" solution best that takes three kinds of args:

true/false for make a new column or raise an error,
a Function of two parameters that determines how to combine two columns to the target, and
one of a set of keywords that handle some of 1 and 2

The currently allowed keywords are:

:makeunique – create a new uniquely named column
:error – raise an Error
:update – update the left-hand column with non-missing values from the right
:ignore – ignore the duplicated column

Should I just add a new PR with the updated patch, restore the commit on main that this PR is based on, or both?

… columns in joins and DataFrame constructors

leei · 2023-08-24T22:29:19Z

The alternative implementation is #3373

bkamins · 2023-08-25T07:10:54Z

I have a new PR that contains different changes that overload the makeunique kw in the way described above.

An alternative is to locally REBASE the PR and then force-push it to GitHub.

Having said that what I would propose (in respect for your time) is that we first discuss the final design and then this design is implemented. This is the standard procedure that we always follow in DataFrames.jl (because there is usually a lot of discussion around every design decision). Also if we have a design decision it does not mean that it would have to be "in full" implemented in one PR (it could, but it does not have to, as ).

In particular what @kescobo commented on is that it would be good to mentally separate two operations:

if we keep all the columns and how we name them (this is currently handled by makeunique)
if we merge duplicate columns (this could be handled by a separate kwarg)

On the other hand maybe indeed what you propose is good. How I understand it is the following. We have only makeunique kwarg that accepts:

true - the same as :makeunique; (for back compatibility)
false - the same as :error; (for back compatibility)
:makeunique - generate a new unique name for the column
:error - error on duplicate
:update - merge columns with coalesce
:ignore - ignore duplicate column

If this is the proposal maybe we could extend it with two extra options (non-conflicting with what you proposed):

:makeunique => function to rename columns when making them unique (this could be a separate PR as I am not currently clear how it would work in general - i.e. should we rename only duplicate columns or potentially all columns and what should be the input and output, maybe eg the signature (name, dup_count) -> tuple with dup_count values could be expected)
:update => function (a more general merging rule could be allowed by function - again - it could be a separate PR)

@leei, @nalimilan, @kescobo, @jariji - can you please comment on this. So that we have a clear vision where we go before moving forward with the implementation?

Thank you!

jariji · 2023-08-25T07:22:29Z

Can somebody give a motivating example for when I would want update or ignore? Also concerned that these options seem noncommutative.

kescobo · 2023-08-25T11:21:57Z

This is my opinion, very lightly held. That is, I think both approaches are improvements, and neither is so obviously correct / better today I would argue strongly for it.

To "make (cols) unique" implies that you will end up with multiple columns. makeunique = :update as proposed seems like a logical contradiction.

Also, I can see situations where it might be useful to be able to use a function other than coalesce when merging. To flesh out the proposal a bit:

two separate kwargs, :makeunique is for retaining duplicate columns, mergeduplicates is for combining them (I'm not married to the name)
both can take a Bool or a function
both are false by default, and an error is thrown if duplicates are detected (current behavior)
only one can be other than false
for :makeunique, true uses current behavior and a function handles how columns are renamed.
for :mergeduplicates, true does coalesce and a function handles how items across duplicates are merged

I'm not totally clear on what the arguments and expected outputs for the function versions are, but my first instinct is:

:makeunique - colname and n as arguments, vector (or tuple) of length n as output.
:mergeduplicates - vector (or tuple) of values as argument, whatever as output (will become the eltype of the new column)

leei · 2023-08-25T16:22:17Z

My original proposal was to replace `makeunique` with `dupcol` , which indicates the action when faced with duplicate columns as a result of the join. Logical, simple and can easily handle both keyword and Function arguments. Big con: Very disruptive of the existing API, since it extends the `makeunique` functionality and until `makeunique` is removed it’s possible to provide contradictory instructions e.g. `makeunique=true` and `dupcol=:update`. While I agree that retaining `makeunique` and extending it to take `Bool`, `Symbol` and `Function` arguments is aesthetically dodgy, it’s much less disruptive of the existing API. If there was a simple way to deprecate such a universally used kw and replace it with one that’s better named, I’d be all for it. As far as the particular keyword choices? If I’ve got new data coming (e.g. from an external API call, say asking for records changed since the last update time) and I don’t know which are new and which are updates then a join with `makeunique=:update` is exactly what I need. In fact for that case I’d probably also want `outerjoin!` `makeunique=:ignore` is also pretty obvious to me in cases where I’m planning to ignore the duplicated columns afterward, esp. when I’m exploring calculations, hit an error from duplicated columns, and want to see my results before deciding what I really want to do. To me there are a few options. Bite the bullet and introduce a new kw (`dupcol` or other name) with deprecation warnings for `makeunique` (i.e. this PR), or override `makeunique` and tolerate the aesthetics (i.e. the other PR). It is of course possible to introduce the new semantics on `makeunique` and then add a subsequent change to the kw name. Kevin’s is of course another take, since it doubles down on `makeunique`’s current meaning and retains it, while allowing for a `dupcol`-like extension. To me having a single kw for “what do I do with duplicates” is a better option and adding a second kw that handles “how do I generate new column names” makes more sense. And I do apologize for just going ahead and coding this w/o the design discussion. Different repos have different cultures. I need the functionality and created the fork for my own purposes (see the `:update` example above)

kescobo · 2023-08-25T20:07:20Z

And I do apologize for just going ahead and coding this w/o the design discussion.

FWIW - I don't think anyone has a problem with this - it's mostly out of knowledge that your time is precious too, and no one wants you to do a bunch of coding that ends up getting tossed in the bin because folks disagree with the design. If you needed / wanted this anyway, I don't think there's anything wrong with providing a concrete example as an opening salvo, as long as you won't be offended if the consensus lands in a different place (and it doesn't seem like you will be so 👍 )

nalimilan · 2023-08-27T19:59:51Z

It's a good point that we may want to allow passing a function to makeunique in the future to generate custom unique names. Given this, @kescobo's proposal of having two separate arguments which cannot be set at the same time sounds like a good solution. Then dupcols/mergeduplicates can be nothing (default) or any function: passing coalesce is simple enough for the common case.

bkamins · 2023-08-29T18:02:28Z

it's mostly out of knowledge that your time is precious too, and no one wants you to do a bunch of coding that ends up getting tossed in the bin because folks disagree with the design

100% agreed and this is what I meant.
I want to respect your time as I know (and I really do 😄) how much time coding things in DataFrames.jl takes.

To me having a single kw for “what do I do with duplicates” is a better option and adding a second kw that handles “how do I generate new column names” makes more sense.

Let me summarize the requirements we have. The options we need to support are

error on duplicate (current: makeunique=false)
autogenerate new column name on duplicate (current: makeunique=true)
autogenerate new column name on duplicate while allowing to customize the way the new column is named (current: not supported)
merge duplicate column with the existing column (current: not supported), with particular cases:
- replace only missing, achieved by the coalesce function
- keep the old column, achieved by the (x, y) -> x function

Now the question is how to handle this with keyword arguments. We have the following options:

squeeze all the options into makeunique (this is doable, I described in mergeduplicates keyword to handle makeunique=false #3366 (comment) how this can be done)
use two separate kwargs: a) makeunique with current syntax, potentially in the future allowing function that would generate the name of the new column; b) dupcols/mergeduplicates that would signal that column merging is required (and then makeunique is ignored)
use two separate kwargs: a) makeunique that handles all cases like in point 1, except function for generation of new column names that would be a separate kwarg.

So my opinion is that I would not do point 3. For me doing 1 or 2 is OK. I have a slight preference for point 1 (all in makeunique) as this will be probably simpler for users to learn, but 2 is also OK.

nalimilan · 2023-08-30T07:21:23Z

squeeze all the options into makeunique (this is doable, I described in #3366 (comment) how this can be done)

I'm not a fan of the :makeunique => function and :update => function approach. It works, but it's an unusual pattern which in effect recreates two keyword arguments inside a single keyword argument. So I'm more in favor of 2 (unless we can find another trick to use a single argument).

kescobo · 2023-08-30T12:55:20Z

Let me propose a (4) that is perhaps the worst of all worlds 😅. It's kind of merging @bkamins 1/3, but addresses @nalimilan 's concern

as with point 1, decisions about what to do are handled by makeunique, eg rename duplicates or merge them, with the proposed defaults.
a new kwarg (handledups?) that determines how that gets done. In the case of renames, it's a function that handles the renaming, and in the case of merging it's a function that handles the column merge

bkamins · 2023-08-30T13:23:25Z

a new kwarg (handledups?) that determines how that gets done.

This is something I think we should not do as the meaning of this kwarg would change depending on other kwarg. This is something that we should avoid as writing makeunique=x, handledups=y when x and y are variables cannot be interpreted statically (i.e. you need to know the value of x to know what y means)

Given @nalimilan comment let me propose the following rules:

makeunique stays as it is for now. It allows only Bool for now. In the future (it can be this PR or some other PR - no need to rush with this - we can add support for passing Function that would change the way the column names are generated when a duplicate is encountered; I recommend a separate PR, as getting this right is very complex see https://github.com/JuliaData/DataFrames.jl/blob/main/src/other/utils.jl#L77 implementation. The point is that when you generate a deduplicated name it then might itself generate a duplicate, so essentially the de-duplicating function would need to take a whole vector of names and deduplicate it as a whole - name deduplication cannot be done locally on a single column name level)
new mergeduplicates kwarg (I prefer this name as it is explicit what the kwarg does). This kwarg accepts either nothing (when it is ignored and makeunique is respected) or alternatively user can pass a Function. In this case makeunique must be false (i.e. passing makeunique=true will error). The function must take a vararg argument and return a single value. Example implementations would be:
- coalesce: returns first non-missing value (or missing if all are missing)
- first∘tuple: a first duplicate column
- last∘tuple: a last duplicate column
- mean∘tuple: mean of duplicate columns

Note that it is crucial that the function accepts more than 2 arguments as in some cases we can have data that introduce more than 2 duplicate columns.

Also when implementing mergeduplicates we need to precisely document how it works (and there will be two modes):

process all data at once (e.g. DataFrame(x=1,x=2,x=3, makeunique=mean∘tuple) will work this way)
process data pairwise (e.g. currently hcat(df, df, df, makeunique=mean∘tuple) will work this way; the same with joins)

Fortunately for coalesce, first∘tuple and last∘tuple which are probably most common both options produce the same.

What do you think?

kescobo · 2023-08-30T19:23:20Z

Makes sense to me 👍

behaviour to match bkamins comment in JuliaData#3366 Is now used to pass a Function to handle cases where makequnique=false by combining those values (passed as parameters) into a returned result.

leei · 2023-09-19T17:55:23Z

Just updated this PR to conform to @bkamins proposal above,

bkamins · 2023-09-25T20:39:53Z