-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
In general almost all DataFrame and Series methods return new data and thus make a copy if needed (if there was no calculation / data didn't change). But some methods allow you to avoid making this copy with an explicit copy keyword, which defaults to copy=True, but which you can change to copy=False manually to avoid the copy.
Example:
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
# by default a method returns a copy
>>> df2 = df.rename(columns=str.upper)
>>> df2.iloc[0, 0] = 100
>>> df
a b
0 1 3
1 2 4
# explicitly ask not to make a copy
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
a b
0 100 3
1 2 4Now, if Copy-on-Write is enabled, the above behaviour shouldn't happen (because we are updating one dataframe (df) through changing another dataframe (df3)).
In this specific case of rename, it actually already doesn't work anymore like that, and df is not updated:
>>> pd.options.mode.copy_on_write = True
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
a b
0 1 3
1 2 4This is because of how it is implemented under the hood in rename, using result = self.copy(deep=copy), and so this always was already taking a shallow copy of the calling dataframe. With CoW enabled, a "shallow" copy doesn't exist anymore in the original meaning, but now essentially is a "lazy copy with CoW".
But for some other methods, this is actually not yet working
There are several issues/questions here:
- Are we OK with
copy=Falsenow actually meaning a "lazy" copy for all those methods?- I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a
copykeyword.
- I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a
- The case of manually passing
copy=Trueshould still give an actual hard / "eager" copy?- Probably yes (if we keep the keyword, see 3) below), but we should also ensure to test this when CoW is enabled.
- If (in the future with CoW enabled) the default will now be to not return a copy, is it still worth it to keep the
copykeyword?- Currently the default is
copy=True, and so people will typically mostly use it explicitly to setcopy=False. Butcopy=Falsewill become the default in the future, and so will not be needed anymore to specify explicitly. - People can still use
copy=Truein the future to ensure they get a "eager" copy (and not delay the copy / trigger a copy later on). But is that use case worth it to keep the keyword around? (they can always do.copy()instead)
- Currently the default is
DataFrame/Series methods that have a copy keyword (except for the constructors):
-
align -
astype -
infer_objects -
merge -
reindex -
reindex_like -
rename -
rename_axis -
set_axis(only added in 1.5) -
set_flags(default False) -
swapaxes -
swaplevel -
to_numy(default False) -
to_timestamp -
transpose(default False) -
truncate -
tz_convert -
tz_localize -
pd.concat
xref CoW overview issue: #48998