-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy regarding in-place operations #1695
Comments
Today I hit a problem with |
A possible idea: use a |
In general silently mutating vector type except for |
Continuing from #1716
This kind of confusion shows why this kind of syntax-y solution is less than ideal. Though I do think it's interesting to think of a syntax that shows clearly the nested structure of a DataFrame |
So the idea was that a single That said, I'm not saying that's the best solution. |
I agree that there are pros and cons so my idea was to list all the affected functions and then see how many changes we would have to make. |
@nalimilan Given the discussion we had yesterday with @oxinabox I think I can explain both arguments behind
Probably both reasonings are valid, but I just wanted to spell out the difference between them. Now - regarding mutating vectors in place. Here we have a double-risk after the recent changes allowing to propagate |
I guess really there are 4 kinds of things
2 and 3 are both dangerous. I wonder if we don't actually want a new Array type that is guarded against mutation except by a particular set of methods that are not exported by DataFrames. So noone can actually mutate columns except via intended interfaces. |
@bkamins I don't think that's accurate. At least in my comment above I (meant to) propose the first interpretation. But I think @oxinabox said the reverse on Slack.
@oxinabox I'm not sure that would be a good protection nor convenient.
I've had a look at our current mutating methods. Here are a few thoughts about what would happen if we required a
Overall I guess the main argument in favor of changing the current behavior would be to make it less appealing to use in-place operations (in the strong sense of operations that mutate column vectors). Indeed currently it looks nicer to do Crazy idea: maybe we could have a global reference count keeping track of the column vectors that are currently used by a data frame (which would be automatically updated by a |
First let me reply to @oxinabox:
The problem is that these operations are chiefly
They are not always safe if you have created as
Actually it was recently proposed, https://github.com/bkamins/ReadOnlyArrays.jl, but did not get much traction (so for the time being this idea is on hold). Although we intended to use it for a bit different purpose. Now @nalimilan:
I think it is best to discuss concrete methods (what you do below), as this avoids ambiguity. Looking at your list I think we can stay with only
Whenever I say ADD A STRONG WARNING this should include:
I do not think this would work well as there is no guarantee when GC runs the finalizer. |
Unfortunately docs can only help mitigating issues, but they are not really a fix (and many users don't read docstrings until they have a problem). The fact that you've been trapped by Another crazy idea: change |
I agree and we have two ways to go here:
Actually looking at this and thinking about it more - I start to like We still have another issue here (that I have sprinkled in several places in my comments): some functions (like e.g. EDIT:
If we wanted to go this way actually I would do the following (we could also skip step 2 and stay at step 1 - but this is a long term decision):
|
I didn't say I'm completely convinced by my proposal. ;-) Let me think about it...
Yes,
That's indeed appealing. But we need to provide a syntax to access a vector without copy when it's not a literal symbol: is |
We could add This is needed also in cases when you want to have a programmatic access to the column (e.g. in a function when column name is passed as a |
If we introduce a rule that transformation of Note that you still will be able to reuse columns, but this will require a special treatment, in which case I guess the user will know what one is doing, e.g. this would still reuse the columns from source:
|
In particular we should remember to decide about |
Just to recap my understanding of the desirable policy from Slack:
I am summarizing this again, because I think we will not get much new input in this discussion and we should decide if we should go this way (then I will, in particular, update #1646 and a PR cleaning up whole code-base to follow "ownership" rule should be implemented). |
Regarding point 6:
I think we should either deprecate |
If we left |
If we go towards the "ownership" rule we should make sure how we handle |
Regarding earlier issues to sum up wat I think. We would have:
And we would have an inconsistency between points 2 and 4. If this is OK, then we can leave this syntax. Any thoughts on this? If in the future |
We also have some wiggle room with |
Thank you for the comments (this issue is bugging all of us for some time with no ideal idea what to do 😄).
This is the current design and it is fully mentally consistent. The problem is that it is safe and even me (who has implemented this) keep forgetting that
Making |
I'd find it a bit weird to have |
So do we make |
Sorry, overlapping comments. But re: @bkamins
As I was trying to say but maybe wasn't coherent in saying, I don't think that a constructor feels like "working with individual columns", so I don't think it's a violation to make the copy by default. One could do a one column constructor ( Or if we framed differently, maybe the rule can be "functions that can only get or set a single column never copy". Since |
Right, sorry about that. I get that I think my point still stands, though, about unnecessary intermediate copying. I wonder if
This way , the only time a user would need to worry about messing up a data frame is when they venture out of a data frame. In which case we assume they know what they are doing.
|
I say no - this would be a performance problem ()
This is I guess what the consensus is now (i.e. it returns a single vector that is not a copy)
this is a good rule. So I update my summary above to reflect @pdeffebach and @nalimilan recommendations (i.e. |
I agree -- it's a |
I have updated #1695 (comment).
if It could be either:
|
Just to be clear, we need a substitute for |
I would really like it to look as close to I propose Ultimately this will make the logic a lot easier as well, as you have dot syntax for non-copying and then some other function syntax for copying. And the logic is preserved with or without literals. |
It's also import to realize how much of a pain parentheses are. something like |
Ugh bummer. I really hate that the df.x behavior will depend on whether :x is already defined. pandas has that problem and drives me nuts, though at least here it won’t be a silent failure (in pandas it “works” in that it just sticks y into a property, but doesn’t make a column). No hopes of preserving that behavior with the df.x=y syntax with some clever syntactic sugar behind the scenes? |
We are only talking about the case when you have a column name that is an invalid identifier. So here all stays unchanged, but because we want to deprecate For getting a column we have already settled for What is left is the syntax for setting the column. Up to my understanding we have two natural choices:
|
Oh -- ok, sorry, please excuse my ignorance. What is an example where we have an invalid identifier? Do you mean something like In that case, I'm fine with [edited for clarity] |
Your understanding is correct 😄. In fact
so we are essentially exposing an already existing inner function. My only hesitation is related to the decision that we might want to introduce |
It would be even simpler to say "constructors do not copy columns". If we don't have common uses cases for
Yes, it may be unexpected for Anyway this issue is relatively self-contained, as it just concerns convenience methods for |
|
Agreed. I have written down in the "main" post my current thinking, but let us discuss it in a separate thread (I will open it when we have finished underlying infrastructure implementation)
This is what is a current consensus so 👍. A nice thing is that catch-all OK - I will start implementing the agreed functionality. The fist step is copying columns behavior ("data ownership" PR update). |
#1772 adds If we do that, adding |
Just to be precise I think that, if we add such a function then |
If we don't deprecate
JuliaDB uses |
That I why in #1772 I wanted to give a reference implementation so that it is clear that |
Yes, I also think that's not ideal. Would a better approach be to deprecate From what I understand |
Why do you think |
From this discourse post, but my intuition agrees with yours, which is why I named the macro
|
I would close this issue. Is there anything left from it on the table? (maybe the |
So our current policy is that |
Some in-place operations like
append!
operate in-place on column vectors, but others likeallowmissing!
replace vectors. We should check whether this difference is justified and whether it is the most useful behavior.For example, it might not be very useful for
append!
to resize column vectors in-place, since that does not save any memory and prevents it from changing the column eltypes. OTOH that could be a useful difference fromvcat
if one wants to preserve the original eltypes.The text was updated successfully, but these errors were encountered: