Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy regarding in-place operations #1695

Closed
nalimilan opened this issue Jan 24, 2019 · 99 comments
Closed

Policy regarding in-place operations #1695

nalimilan opened this issue Jan 24, 2019 · 99 comments
Milestone

Comments

@nalimilan
Copy link
Member

Some in-place operations like append! operate in-place on column vectors, but others like allowmissing! replace vectors. We should check whether this difference is justified and whether it is the most useful behavior.

For example, it might not be very useful for append! to resize column vectors in-place, since that does not save any memory and prevents it from changing the column eltypes. OTOH that could be a useful difference from vcat if one wants to preserve the original eltypes.

@nalimilan nalimilan added this to the 1.0 milestone Jan 24, 2019
@bkamins
Copy link
Member

bkamins commented Feb 1, 2019

Today I hit a problem with dropmissing! related to this. I created a dataframe with a subset of two columns to perform pairwise deletion of missings and it corrupted the original wide DataFrame. I knew what to do (at least I hope I did 😄), but probably newcomers to DataFrames.jl will be caught off-guard here unless we properly document it.

@nalimilan
Copy link
Member Author

A possible idea: use a ! suffix for operations that mutate the data frame but not the column vectors (the majority), and a !! suffix for operations that also mutate the vectors (which are not that common, and often trap users who don't want to mutate column vectors). That would be both explicit, consistent are still relatively convenient.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2019

In general silently mutating vector type except for df.x = something (in this case it is obvious it can be mutating) should be signaled to the user so !! seems reasonable. As I have said in #1716 an implicit type mutation without signaling can lead to nasty bugs IMO.

@pdeffebach
Copy link
Contributor

Continuing from #1716

So is your proposal to make push! to do autopromotion (and create a new vector) and push!! to mutate the vector in place?

This kind of confusion shows why this kind of syntax-y solution is less than ideal. Though I do think it's interesting to think of a syntax that shows clearly the nested structure of a DataFrame

@nalimilan
Copy link
Member Author

So the idea was that a single ! only mutates the data frame, so it can do anything to the object (including replacing column vectors), but !! can also mutate the column vectors (so it can't change their type). That's not really in contradiction with Base, since the single ! version still doesn't change the type of the data frame, just like push(Int[], 1.0) converts 1.0 to Int since the type cannot change.

That said, I'm not saying that's the best solution.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2019

I agree that there are pros and cons so my idea was to list all the affected functions and then see how many changes we would have to make.

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

@nalimilan Given the discussion we had yesterday with @oxinabox I think I can explain both arguments behind !! briefly:

  • what I and @oxinabox initially thought: is that !! signals a more dangerous thing than !; with this reasoning a more dangerous thing is to mutate vectors withing a data frame than to replace them by some other vectors (why: because these vectors can be used somewhere else and mutating them might break something)
  • what @nalimilan proposes is ! mutates only data and !! mutates data and metadata

Probably both reasonings are valid, but I just wanted to spell out the difference between them.

Now - regarding mutating vectors in place. Here we have a double-risk after the recent changes allowing to propagate @inbounds in getindex. Why - because we do check vector length only at adding/creation time. Later we only check first column for its length and assume that all else matches. Without propagating @inbounds this is OK, but when we started to propagate them you can start getting segfaults (and I guess @oxinabox can confirm this 😄 - although this is probably not the case he has encountered) in strange places (of course due to the error on the programmer side - but the error might be tricky to spot).

@oxinabox
Copy link
Contributor

oxinabox commented Feb 13, 2019

I guess really there are 4 kinds of things

  • 0 Operations that just add columns. These are always safe
  • 1 basic mutating options that just change the values in existing columns. These are always safe
  • 2 operations that change the size of existing columns
  • 3 operations that replace columns with new columns at same index

2 and 3 are both dangerous.

I wonder if we don't actually want a new Array type that is guarded against mutation except by a particular set of methods that are not exported by DataFrames.

So noone can actually mutate columns except via intended interfaces.

@nalimilan
Copy link
Member Author

nalimilan commented Feb 13, 2019

* what I and @oxinabox initially thought: is that `!!` signals a more dangerous thing than `!`; with this reasoning a more dangerous thing is to mutate vectors withing a data frame than to replace them by some other vectors (why: because these vectors can be used somewhere else and mutating them might break something)

* what @nalimilan  proposes is `!` mutates only data and `!!` mutates data and metadata

@bkamins I don't think that's accurate. At least in my comment above I (meant to) propose the first interpretation. But I think @oxinabox said the reverse on Slack.

I wonder if we don't actually want a new Array type that is guarded against mutation except by a particular set of methods that are not exported by DataFrames.

@oxinabox I'm not sure that would be a good protection nor convenient.

  • Good protection: even if we wrap all columns in a special array type, users can always resize the original array. And actually resizing the vectors isn't the worst possible operation, since at least you're likely to get an error at some point: but if you sort the values and still use that vector in another data frame, the consistency of observations will be silently lost.
  • Convenient: you would never be able to push! or append! to a data frame, except using special functions; but we don't need to use a custom array type for that, we can just do what we want with/in our DataFrame methods. And returning these custom array types from e.g. df.col would be weird for users.

I've had a look at our current mutating methods. Here are a few thoughts about what would happen if we required a !! for methods which mutate column vectors:

  • For allowmissing!, disallowmissing! and categorical!, this is perfectly fine since by definition these methods replace some columns with new vectors.
  • For deletecols!, insertcols!, permutecols!, names! and rename! it's also fine since these affect the data frame but not the vectors.
  • For push! and append! it would be both a good and a bad thing. Good, since it would allow changing the type of the columns if needed using promote_type, like vcat. Bad, since by default we would make a copy on each addition. Since that's not consistent with what happens for vectors, people won't necessarily think about push!! and append!!.
  • For filter!, deleterows!, unique! and dropmissing! this convention would also be a bit annoying but safer, since they would replace column vectors with copies before mutating them. Variants with !! would be needed to actually mutate vectors.
  • For sort! it would be quite weird to allocate new vectors before sorting them. I guess we could say that ! functions are allowed to mutate column vectors as long as they don't resize them. But even that is very dangerous if these vectors are used in another data frame (and you may well not notice it).

Overall I guess the main argument in favor of changing the current behavior would be to make it less appealing to use in-place operations (in the strong sense of operations that mutate column vectors). Indeed currently it looks nicer to do sort!(df, :x) than df = sort(df, :x). OTOH it's one of the strengths of Julia that you can easily avoid copies (and worse, the compiler isn't as good as R currently if you don't use in-place operations since it doesn't do copy-on-write to avoid copies where possible).

Crazy idea: maybe we could have a global reference count keeping track of the column vectors that are currently used by a data frame (which would be automatically updated by a DataFrame finalizer). When doing operations which would mutate column vectors used elsewhere, we could throw an error, or at least print a warning to protect again data corruption (safety matters a lot, and in R it's kind of implicit due to copying semantics). Of course, that wouldn't protect you from things like sort!(df.x), but it's more obvious that this kind of thing is a bad idea anyway.

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

First let me reply to @oxinabox:

  • 0 Operations that just add columns. These are always safe

The problem is that these operations are chiefly setindex! and setproperty and they can either add a column or mutate an existing column and you only know which thing they do at run time.

  • 1 basic mutating options that just change the values in existing columns. These are always safe

They are not always safe if you have created as GroupedDataFrame based on some AbstractDataFrame.

I wonder if we don't actually want a new Array type that is guarded against mutation

Actually it was recently proposed, https://github.com/bkamins/ReadOnlyArrays.jl, but did not get much traction (so for the time being this idea is on hold). Although we intended to use it for a bit different purpose.


Now @nalimilan:

At least in my comment above I (meant to) propose the first interpretation.

I think it is best to discuss concrete methods (what you do below), as this avoids ambiguity.

Looking at your list I think we can stay with only ! (as most of the time the conclusion is that !! would be annoying) and concentrate on improving the documentation and add methods without ! (so that there is always an option to do a super-safe thing). In particular:

  • For allowmissing!, disallowmissing! and categorical!: add information in the documentation that they create new vectors ONLY IF NEEDED; add a methods for allowmissing, disallowmissing, categorical that support AbstractDataFrame as argument (and decide if they should do copy when this operation would be no-op or not).
  • For deletecols!, insertcols!, permutecols!, names! and rename!: decide if we want to have deletecols, insertcols, permutecols (we already have rename).
  • For push! and append!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; I would promote using vcat for other use-cases (EDIT we should add required functionalities to vcat and reference it in the documentation; the decision is if we want to add push method that is like vcat but adds one row)
  • For filter!, deleterows!, unique! and dropmissing!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; I would add deleterows method for completeness (all other methods we already have)
  • For sort!: I would leave them as is and ADD A STRONG WARNING in the documentation about risk of using them; we have sort variant already

Whenever I say ADD A STRONG WARNING this should include:

  • explaining the risks
  • pointing at methods without ! as safe alternatives (that is why I proposed to add them everywhere where they are missing)

maybe we could have a global reference count keeping track of the column vectors that are currently used by a data frame (which would be automatically updated by a DataFrame finalizer).

I do not think this would work well as there is no guarantee when GC runs the finalizer.

@nalimilan
Copy link
Member Author

Unfortunately docs can only help mitigating issues, but they are not really a fix (and many users don't read docstrings until they have a problem). The fact that you've been trapped by dropmissing! shows that even knowing the problem doesn't fully protect from it. I'm particularly concerned by sort!, which could give a bad reputation to Julia if some people screw up their analyses with that.

Another crazy idea: change df[cols] to copy the column vectors by default, and require @view/view to get a SubDataFrame. That would be consistent with what getindex does with arrays (remember that the design of DataFrames predates the existence of array views in Julia).

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

I agree and we have two ways to go here:

  1. what I have proposed above: define functions without ! for every operation and advocate using them; have functions with ! and warn users not to use them unless they know what they are doing
  2. the !! option: make functions without ! and with ! safe (i.e. would not mutate vectors, ! works in the same DataFrame but creates new vectors) and !! variants would mutate vectors

Actually looking at this and thinking about it more - I start to like !! option after all as !! is a really clear sign of something dangerous. So +1 for your proposal, even if it would be unwieldy in some cases.

We still have another issue here (that I have sprinkled in several places in my comments): some functions (like e.g. EDIT: dropmissing disallowmissing) return the original vector when no action is needed and create a new vector in other cases (this is a convert vs constructor distinction in Base). Any thoughts about this issue (apart from the fact that it should be clearly documented). If we go the way I feel we started moving towards - they should always create a new vector if they create a new data frame.

Another crazy idea: change df[cols] to copy the column vectors by default

If we wanted to go this way actually I would do the following (we could also skip step 2 and stay at step 1 - but this is a long term decision):

  1. In the first step disallow df[col] and df[cols] indexing altogether (and retain only df.col to get a source vector and df[:, col] and df[:, cols] and @view)
  2. In the second step (probably after at least 1 year) reintroduce them as df[row] or df[rows] (so in the long term single index would select rows not columns which would be consistent with row-based interpretation of a DataFrame)

@bkamins bkamins pinned this issue Feb 13, 2019
@nalimilan
Copy link
Member Author

I agree and we have two ways to go here:

1. what I have proposed above: define functions without `!` for every operation and advocate using them; have functions with `!` and warn users not to use them unless they know what they are doing

2. the `!!` option: make functions without `!` and with `!` safe (i.e. would not mutate vectors, `!` works in the same `DataFrame` but creates new vectors) and `!!` variants would mutate vectors

Actually looking at this and thinking about it more - I start to like !! option after all as !! is a really clear sign of something dangerous. So +1 for your proposal, even if it would be unwieldy in some cases.

I didn't say I'm completely convinced by my proposal. ;-) Let me think about it...

We still have another issue here (that I have sprinkled in several places in my comments): some functions (like e.g. disallowmissing) return the original vector when no action is needed and create a new vector in other cases (this is a convert vs constructor distinction in Base). Any thoughts about this issue (apart from the fact that it should be clearly documented). If we go the way I feel we started moving towards - they should always create a new vector if they create a new data frame.

Yes, disallowmissing should probably do that by default when we introduce it (#1720).

If we wanted to go this way actually I would do the following (we could also skip step 2 and stay at step 1 - but this is a long term decision):

1. In the first step disallow `df[col]` and `df[cols]` indexing altogether (and retain only `df.col` to get a source vector and `df[:, col]` and `df[:, cols]` and `@view`)

2. In the second step (probably after at least 1 year) reintroduce them as `df[row]` or `df[rows]` (so in the long term single index would select rows not columns which would be consistent with row-based interpretation of a `DataFrame`)

That's indeed appealing. But we need to provide a syntax to access a vector without copy when it's not a literal symbol: is getproperty(x, col) OK? Or should we export another function?

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

getproperty(x, col) OK? Or should we export another function?

We could add getcol for example, but I am not sure if actually we need it as we have getproperty as you have noticed.

This is needed also in cases when you want to have a programmatic access to the column (e.g. in a function when column name is passed as a Symbol via a parameter).

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

If we introduce a rule that transformation of AbstractDataFrame that produces a DataFrame never reuses the column from the source actually the need for !! is a bit lower I think as the probability that two data frames will share the same underlying vectors is low.

Note that you still will be able to reuse columns, but this will require a special treatment, in which case I guess the user will know what one is doing, e.g. this would still reuse the columns from source:

DataFrame(eachol(df, false), names(df))

@bkamins
Copy link
Member

bkamins commented Feb 13, 2019

If we introduce a rule that transformation of AbstractDataFrame that produces a DataFrame never reuses the column

In particular we should remember to decide about copy(::DataFrame) and DataFrame(::DataFrame).

@bkamins
Copy link
Member

bkamins commented Feb 17, 2019

Just to recap my understanding of the desirable policy from Slack:

  1. we introduce the concept of column ownership by a DataFrame which means that DataFrame assumes that it can safely mutate the object as long as it stays internally consistent;
  2. we do not perform copy-on-create of vectors passed to a DataFrame for performance reasons; from the moment of its creation an DataFrame object considers itself as an owner of the columns
  3. when SubDataFrame, DataFrameRow or GroupedDataFrame is created they are not owners and may be corrupted if the owner is mutated (conclusions: users should be warned not to mutate parent DataFrame if they use one of these views). In practice this means the views should be used only for short-term operations and not passed around; also view creation is now fully flexible and fast in DataFrames.jl so they are the preferred method to subset a DataFrame if you care about performance for short-term operations.
  4. If one uses eachcol one has to be careful not to mutate the columns one gets (especially not to resize them). The same with direct column access via getproperty
  5. finally if an operation that takes an AbstractDataFrame creates a fresh DataFrame it always allocates fresh columns (so that the newly returned DataFrame can assume that it has the ownership of its columns) (edited)
  6. we could drop the df[col] and df[cols] column getindex/setindex! methods as they are dangerous (and leave only getproperty and setproperty!) - ADDITION after thinking about it I think we can leave them as they might be useful in some cases, as long as they behave as expected (i.e. df[cols] performs a copy of the columns)
  7. ADDINION - if we go this way then we do not need a ! and !! distinction I think.

I am summarizing this again, because I think we will not get much new input in this discussion and we should decide if we should go this way (then I will, in particular, update #1646 and a PR cleaning up whole code-base to follow "ownership" rule should be implemented).

@pdeffebach
Copy link
Contributor

Regarding point 6:

6. we could drop the df[col] and df[cols] column getindex/setindex! methods as they are dangerous (and leave only getproperty and setproperty!) - ADDITION after thinking about it I think we can leave them as they might be useful in some cases, as long as they behave as expected (i.e. df[cols] performs a copy of the columns)

I think we should either deprecate df[cols] (I admit I can get used to df[:, col]), or keep it, but I think having df[col] and df.col having different behaviors, with one copying and one not, would be confusing for the user.

@bkamins
Copy link
Member

bkamins commented Feb 17, 2019

If we left df[col] I think it would do the same as df.col - i.e. return the vector without copying it (except that then col in df[col] can be an integer and any symbol). This is consistent with the rules above as df[col] returns a vector not a DataFrame.

@bkamins
Copy link
Member

bkamins commented Feb 20, 2019

If we go towards the "ownership" rule we should make sure how we handle stackdf and meltdf.

@bkamins
Copy link
Member

bkamins commented Feb 20, 2019

Regarding earlier issues to sum up wat I think. We would have:

  1. df[col] that returns a vector without a copy
  2. df[cols] that returns a DataFrame making a copy (because it returns a DataFrame) (change)
  3. sub_df[col] that returns a view
  4. sub_df[col] that returns a SubDataFrame (so no copy is made as SubDataFrame does not claim to have an ownership)

And we would have an inconsistency between points 2 and 4. If this is OK, then we can leave this syntax. Any thoughts on this?

If in the future GroupedDataFrame would be <:AbstractDataFrame we will have to make similar decisions there.

@pdeffebach
Copy link
Contributor

We also have some wiggle room with df[:, cols] vs. df[cols]. Perhaps one could be a copy and one would not? Or one be a sub-dataframe and one not?

@bkamins
Copy link
Member

bkamins commented Feb 20, 2019

Thank you for the comments (this issue is bugging all of us for some time with no ideal idea what to do 😄).

Perhaps one could be a copy and one would not?

This is the current design and it is fully mentally consistent. The problem is that it is safe and even me (who has implemented this) keep forgetting that df[cols] is dangerous.

Or one be a sub-dataframe and one not?

Making df[cols] return a SubDataFrame is a possible solution to our problem. This would be a second exception in pair with df[row, cols] which also returns a view. Then only df[rows, cols] would return a DataFrame. However, I think that this is more risky than defining that df[cols] is equivalent to df[:, cols]. @nalimilan - what is your opinion?

@nalimilan
Copy link
Member Author

I'd find it a bit weird to have df[cols] return a SubDataFrame. The df[row, cols] exception is really just because technically we cannot make it fast with named tuple, better not add more exceptions. Also we already have view to get a SubDataFrame.

@bkamins
Copy link
Member

bkamins commented Feb 20, 2019

So do we make df[cos] make a copy or better deprecate it altogether? (making a copy is safe and convenient, but deprecating it will give us more freedom for the future after 1.0 release). I would vote for making a copy.

@nickeubank
Copy link
Contributor

Sorry, overlapping comments. But re: @bkamins

Do you have any other relatively simple rule that we could define? (I initially wanted DataFrame constructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did).

As I was trying to say but maybe wasn't coherent in saying, I don't think that a constructor feels like "working with individual columns", so I don't think it's a violation to make the copy by default. One could do a one column constructor (DataFrame(x=1:20)), but in general constructors are for many-column operations, so I don't think they violate the rule.

Or if we framed differently, maybe the rule can be "functions that can only get or set a single column never copy". Since DataFrame can set more than one column at once, we're good.

@pdeffebach
Copy link
Contributor

Regarding setcol: setcol(df, c::Symbol) = a isn't valid syntax (it defines a setcol method), so we have to use setindex! and setproperty!.

Right, sorry about that. I get that setcol is an unnecessary function. (Though if we eventually add a mutate function we will have to revisit this.

I think my point still stands, though, about unnecessary intermediate copying. I wonder if

  • select(df, col) should never make a copy.
  • setindex! should always make a copy.

This way , the only time a user would need to worry about messing up a data frame is when they venture out of a data frame. In which case we assume they know what they are doing.

x = df.a
filter!(x)

@bkamins
Copy link
Member

bkamins commented Mar 23, 2019

setindex! should always make a copy.

I say no - this would be a performance problem ()

select(df, col) should never make a copy.

This is I guess what the consensus is now (i.e. it returns a single vector that is not a copy)

"functions that can only get or set a single column never copy"

this is a good rule.

So I update my summary above to reflect @pdeffebach and @nalimilan recommendations (i.e. DataFrame copies by default, and we drop getcol idea)

@nickeubank
Copy link
Contributor

setindex! should always make a copy.
I say no - this would be a performance problem ()

I agree -- it's a ! function, so I think it's ok to not copy.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2019

I have updated #1695 (comment).
The only thing we still need to decide is how do we want users to express the operation:

df.x = y

if x is not a valid identifier (so that df.x will not parse properly). The operation should make column :x in df get value y without copying it.

It could be either:

  • setcol!(df, :x, y) (essentially insert_single_column! function exposed - maybe with some tweaks)
  • defer this to a more general mutate! function (that will probably allow to insert multiple columns using e.g. Pairs)

@bkamins
Copy link
Member

bkamins commented Mar 23, 2019

Just to be clear, we need a substitute for df.x = y in case x is not a valid identifier because we will deprecate df[:x] = y.

@pdeffebach
Copy link
Contributor

I would really like it to look as close to df.x = y as possible. I don't want users to be like "well, I would put this in a function, but then I have to change and use some different notation". Going from df$col to df[[x]] is a big annoyance for me in R.

I propose df.sym(x) where sym(c::Symbol) does makes a custom type we can dispatch on. Or use ^ instead of sym or some other valid unicode character.

Ultimately this will make the logic a lot easier as well, as you have dot syntax for non-copying and then some other function syntax for copying. And the logic is preserved with or without literals.

@pdeffebach
Copy link
Contributor

It's also import to realize how much of a pain parentheses are. something like setcol!(df, x, y) makes lengthy map and array comprehensions more annoying.

@nickeubank
Copy link
Contributor

Ugh bummer. I really hate that the df.x behavior will depend on whether :x is already defined. pandas has that problem and drives me nuts, though at least here it won’t be a silent failure (in pandas it “works” in that it just sticks y into a property, but doesn’t make a column).

No hopes of preserving that behavior with the df.x=y syntax with some clever syntactic sugar behind the scenes?

@bkamins
Copy link
Member

bkamins commented Mar 23, 2019

df.x = y will work also when :x is not defined and then the column will be added. Of course df.x will fail if :x is undefined for reading.

We are only talking about the case when you have a column name that is an invalid identifier. So here all stays unchanged, but because we want to deprecate df[:x] syntax for reading and writing we should define some other means to get column and set column using a function.

For getting a column we have already settled for select(df, :x) syntax.

What is left is the syntax for setting the column. Up to my understanding we have two natural choices:

  • function working on single column like setcol!(df, v, :x)
  • a more general function possibly changing many columns, e.g. mutate!(df, :x=>v2, :y=>v2), where single column operation is a special case

@nickeubank
Copy link
Contributor

nickeubank commented Mar 23, 2019

Oh -- ok, sorry, please excuse my ignorance. What is an example where we have an invalid identifier? Do you mean something like Symbol("this has space") where you can't use the df-dot (df.) notation because julia can't parse df.this has a space?

In that case, I'm fine with setcol!. Seems descriptive and simple.

[edited for clarity]

@bkamins
Copy link
Member

bkamins commented Mar 23, 2019

Your understanding is correct 😄.

In fact setcol! would be defined very simply (if we decided to add it):

setcol!(df::DataFrame, col, v) = insert_single_column!(df, col, v)

so we are essentially exposing an already existing inner function.

My only hesitation is related to the decision that we might want to introduce mutate! function, in which case setcol! would be redundant (@nalimilan - any opinion here)?

@nalimilan
Copy link
Member Author

This is exactly the rule I have in mind - and 3 follows this rule. Do you have any other relatively simple rule that we could define? (I initially wanted DataFrame constructor to make a copy by default but it would be against this rule so that is why I have proposed 3 the way I did). Also - the general thinking should be that DataFrame constructor normally does not copy. The constructor DataFrame(::DataFrame) in practice will be almost never used in "standard" data analysis workflows I think.

It would be even simpler to say "constructors do not copy columns". If we don't have common uses cases for DataFrame(::DataFrame), maybe that would be better.

There was some discussion to make select(df, col) return a single column DataFrame (so that select always returns a DataFrame and in this way users can safely use it in their data pipelines). That is why I initially proposed a separate function. But if we agree that select(df, col) returns a single column I will update my post above to reflect this. (my only objection was that then user has to learn that select returns different objects depending on the type of its argument and also depending on this type it either makes a copy or not - this might be confusing for newcomers).

Yes, it may be unexpected for select to return a single column given that it's a term taken from SQL. Another argument for select(df, col) to return a DataFrame would be to also allow select(df, col1, col2) and things like select(df, :col1, z = :col2 => -), for consistency with by/combine and a possible mutate!/transform!.

Anyway this issue is relatively self-contained, as it just concerns convenience methods for getproperty and setproperty!. We should probably start without these, and discuss this in a dedicated issue (this one is already quite long), trying to synchronize with other table packages if possible.

@oxinabox
Copy link
Contributor

If we don't have common uses cases for DataFrame(::DataFrame)

DataFrame(adf) is the function one calls, when adf is a SubDataFrame and you want to get back something that you can mutate safely without changing the original.
It would be weird if that was copying for some AbstractDataFrames, and not copying for others.

@bkamins
Copy link
Member

bkamins commented Mar 24, 2019

Anyway this issue is relatively self-contained

Agreed. I have written down in the "main" post my current thinking, but let us discuss it in a separate thread (I will open it when we have finished underlying infrastructure implementation)

It would be weird if that was copying for some AbstractDataFrames, and not copying for others.

This is what is a current consensus so 👍. A nice thing is that catch-all DataFrame(x) from other/tables.jl will inherit this behavior automatically 😄.

OK - I will start implementing the agreed functionality. The fist step is copying columns behavior ("data ownership" PR update).

@nalimilan
Copy link
Member Author

#1772 adds column(df, col) as an equivalent of getproperty(df, col). That would replace getcol(df, col), with the advantage that it's consistent with the JuliaDB API.

If we do that, adding setcol!(df, col, v) wouldn't be very consistent. Should it be setcolumn!(df, col, v) or column!(df, col, v) instead? As noted above, we could also use mutate!(df; col => v) or transform!(df; col => v) for that, though the syntax is a bit subtle (; with pair to generate a keyword argument).

@bkamins
Copy link
Member

bkamins commented Apr 15, 2019

column(df, col) as an equivalent of getproperty(df, col)

Just to be precise column also accepts integer indexing (at least under current design, but I think we should keep it).

I think that, if we add such a function then column! would be a good name. But given we are not sure if we want to deprecate df[col] = v maybe we do not need it. Do you know what name JuliaDB uses for this operation?

@nalimilan
Copy link
Member Author

I think that, if we add such a function then column! would be a good name. But given we are not sure if we want to deprecate df[col] = v maybe we do not need it.

If we don't deprecate df[col] and df[col] = v, maybe we don't need columns either? If that's only for consistency with JuliaDB, we can add this later once we have stabilized our own API.

Do you know what name JuliaDB uses for this operation?

JuliaDB uses setcol(t, col, v) (there's no in-place equivalent). That function also supports a broader interface which is so flexible that it's equivalent to mutate!/transform!. But I don't really like that approach: the name is singular even if you can set multiple columns, it's not consistent with columns, and the function mixes setcol(t, col, v), setcol(t, col => f) and setcol(t, col => f, col2 => f2) which have very different behaviors despite their similarity.

@bkamins
Copy link
Member

bkamins commented Apr 15, 2019

That I why in #1772 I wanted to give a reference implementation so that it is clear that column is essentially getindex with the restriction to allow only a single column. I am OK to revert adding column as in my view it was added only for consistency with Julia DB (until we stabilize API we do not know if it is needed or not).

@piever
Copy link

piever commented Apr 15, 2019

JuliaDB uses setcol(t, col, v) (there's no in-place equivalent). That function also supports a broader interface which is so flexible that it's equivalent to mutate!/transform!. But I don't really like that approach: the name is singular even if you can set multiple columns, it's not consistent with columns, and the function mixes setcol(t, col, v), setcol(t, col => f) and setcol(t, col => f, col2 => f2) which have very different behaviors despite their similarity.

Yes, I also think that's not ideal. Would a better approach be to deprecate setcol(t, col, v) in favor of setcol(t, col => v) and rename it to mutate? I think originally it only supported setting one column, but then it became more feature-rich to the point that the original name no longer makes so much sense.

From what I understand mutate is the correct term here as it keeps the remaining columns, unlike transform which drops them and is actually a special case of IndexedTables.select (which means I should also rename the JuliaDBMeta macro to @mutate I guess).

@nalimilan
Copy link
Member Author

Why do you think transform should drop columns? Is that in reference to some implementation? Contrary to "select", nothing in the term "transform" indicates that remaining columns should be dropped (it's essentially a synonym of "mutate" AFAICT).

@piever
Copy link

piever commented Apr 15, 2019

From this discourse post, but my intuition agrees with yours, which is why I named the macro @transform in JuliaDBMeta. Maybe a better pair of functions would be:

  • transform (keep columns)
  • select (drop columns)

@bkamins bkamins unpinned this issue May 30, 2019
@bkamins
Copy link
Member

bkamins commented Sep 2, 2019

I would close this issue. Is there anything left from it on the table? (maybe the transform function, but if we feel it should be added we should open a separate PR/Issue for this)

@nalimilan
Copy link
Member Author

So our current policy is that ! functions may either replace or mutate the column vectors, depending on what makes sense: when possible, they mutate the vectors. This has the advantage of being efficient (when introducing !! variants for mutation could have trapped users into using slow functions, especially for push! and append!). See the list I wrote at #1695 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants