-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle two warnings that pollute the output of sssom-py CLI #561
Conversation
Warning: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy. I just to just not use "inplace" here and the warning does not happen.
Previously this warning happened: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` df.replace("", np.nan, inplace=True) setting the option stops Pandas from even trying to downcast (so it no longer has anything to warn about). I then force the downcast explicitly with infer_objects to ensure that the behaviour remains what it was, but not sure this is what I want here.
df.infer_objects(copy=False) does not work in a platform independent manner; and I dont even know if I need it
src/sssom/util.py
Outdated
|
||
# Context https://github.com/pandas-dev/pandas/issues/57734 | ||
try: | ||
pd.set_option("future.no_silent_downcasting", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to sssom/init.py, not sure this is the right thing to do (can this effect other libraries that might not want that option to be set for some reason?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally be fine having it in sssom/init.py
. This is forcing the behaviour to apply globally, which I believe is a good thing.
That being said, it is true that this could have cascading effects for people who are using sssom
as a library rather than as a command-line tool (people who import sssom
in their own Python projects, rather than calling sssom-py
from the command line), if they also happen to be using pandas
in the rest of their code (no problem if their only use of pandas
is through sssom
).
If you’re not comfortable forcing the “no downcasting” behaviour to all potential users of the sssom
Python module¹, I can think of two options:
(A) Setting the option somewhere at the beginning of the main
function in sssom/cli.py
. This means the option will be enabled globally in all of sssom
when it is used as a CLI tool, but not when it is used as an imported module in another Python project.
It will then be up to the people who imports sssom
to decide whether they themselves want to enable the “no downcasting” behaviour, by setting (or not) the option.
(B) Enabling the option within sssom
, but only where we need it and only for the duration we need it. That is, we enable it before calling replace
here, and disable it afterwards.
A context manager would come in handy here:
class pandas_no_silent_downcasting():
def __init__(self):
try:
self._already_set = pd.get_option('future.no_silent_downcasting')
self._supported = True
except pd.errors.OptionError:
self._supported = False
def __enter__(self):
if self._supported and not self._already_set:
# Entering the context, set the option
pd.set_option('future.no_silent_downcasting', True)
def __exit__(self, type, value, tb):
if self._supported and not self._already_set:
# Leaving the context, unset the option
pd.set_option('future.no_silent_downcasting', False)
It could then be used whenever we have to perform an operation and need to locally make sure the option (if it is available and if it is not already set) is enabled:
with pandas_no_silent_downcasting():
df.replace(...)
This is equivalent to:
pd.set_option('future.no_silent_downcasting', True)
df.replace(...)
pd.set_option('future.no_silent_downcasting', False)
except that it is more “pythonic” and it deals gracefully with the possible absence of that option or the possibility that the option has already been enabled elsewhere.
¹ But it should be noted that this behaviour will be forced upon them sooner or later anyway, since it will be the only behaviour available in Pandas 3.0… Better for those people to start dealing with it now.
src/sssom/util.py
Outdated
df.replace("", np.nan, inplace=True) | ||
# df.infer_objects(copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer_objects
does not work “in place” (and it doesn’t have a inplace
keyword to force it to do so), so that call does nothing since you’re not collecting the results.
You’d need
df = df.infer_objects(copy=False)
As for compatibility with various versions of Pandas: the copy
parameter was introduced in Pandas 2.0. You could leave it out:
df = df.infer_objects()
Without that copy=False
, the default behaviour of infer_objects
would be to make a copy of all values in the frame, even if they are not casted to another type. But this is not a concern here since you are not keeping the original frame anyway.
As for whether the downcasting is necessary to begin with, well, that would depend on the rest of the code downstream from here. For now (before this PR), the data frame is downcasted, as a side-effect of replace
. The question is, is there some code downstream that is depending on the dataframe columns having more precise types than object
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it back with a pandas version check, can you see what you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but as I said you could also have just left the copy
parameter out.
For users of Pandas 1.x, this does not change anything. A copy of the non-type-casted objects will be made no matter what, that’s the only available behaviour.
For users of Pandas 3.x, this also does not change anything, since copy-on-write means that even if a copy of the non-type-casted objects is made, the actual copy would never actually take place (since we’re re-assigning the inferred data frame to the original variable).
For users of Pandas 2.x, no copy=False
parameter means that the non-type-casted objects would be copied for nothing, which strictly speaking is a waste of time and memory. But I strongly suspect the effects on performances would be negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Side note: I love the “should GitHub still exist when you read this”. :D )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I just want to do the thing that changes least about the existing behaviour - which I thought I understand form your explanations would be copy=False
for Pandas 2.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do
df = df.infer_objects(copy=False)
then copy=False
makes absolutely no difference to the result. All it does is making the operation slightly faster (which is why I included it in my example on Slack, but I now regret having done that if it led you to believe it was absolutely necessary), but the data frame you get at the end will be exactly the same as if you omitted the copy
parameter.
What does the copy
parameter does exactly?
Consider the following data frame:
>>> df
A B
0 alice 1
1 bob 2
2 charlie
Let’s create df1
where we replace the empty cell (B2) with np.nan
:
>>> pd.set_option('future.no_silent_downcasting', True)
>>> df1 = df.replace("", np.nan)
>>> df1
A B
0 alice 1
1 bob 2
2 charlie NaN
Now let’s downcast, first with copy=True
(default behaviour, same as no copy
parameter at all):
>>> df2_copy_true = df1.infer_objects(copy=True)
>>> df2_copy_true
A B
0 alice 1.0
1 bob 2.0
2 charlie NaN
Then with copy=False
:
>>> df2_copy_false = df1.infer_objects(copy=False)
>>> df2_copy_false
A B
0 alice 1.0
1 bob 2.0
2 charlie NaN
Observe that df2_copy_true
and df2_copy_false
are the same. So what’s the difference?
The difference is that the A column in df2_copy_false
share its data with the A column in the original df1
frame: A copy was not made, since there was no downcasting needed for that column.
This means that if you modify in place a value in the A column of df2_copy_false
, this will also modify the original df1
:
>>> df2_copy_false.replace("alice", "arthur", inplace=True)
>>> df2_copy_false
A B
0 arthur 1.0
1 bob 2.0
2 charlie NaN
>>> df1
A B
0 arthur 1
1 bob 2
2 charlie NaN
By contrast, df2_copy_true
is completely distinct from the original df1
. Even if the A column needed no downcast, infer_objects(copy=True)
made a complete copy of it.
From the above, you should understand that you do not need to worry about getting different results depending on the presence of copy=False
if all you do is re-assigning the output of infer_objects()
to the original variable – after the assignment the original data frame is no longer there anyway.
So when you do this:
df = df.infer_objects()
the only impact of the absence of copy=False
is that you will needlessly make a temporary copy of columns that didn’t need to be copied, so it’s slightly less efficient than
df = df.infer_objects(copy=False)
Whether you should be concerned about this inefficiency is up to you. I am personally not concerned enough to think that it warrants making the code more complicated by adding this _safe_infer_objects
intermediate, but it’s your call.
In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.
I would in fact be keen to enable copy-on-write globally (pd.set_option('mode.copy_on_write', True)
) right now, both for the performance improvement and to start making sure that we are ready to deal with it since it will become mandatory with Pandas 3.0.
But the concerns stated above for the people who use sssom
as a Python library would obviously also apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would in fact be keen to enable copy-on-write globally
Wait, we are in fact already enabling copy-on-write:
pd.options.mode.copy_on_write = True
So there’s definitely no need for copy=False
when we call infer_objects()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for taking the time to explaining this! Really much appreciated!
As @gouttegd suggested: > I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function. I moved that code to __init__.py hopeing that is the right place for it.
As per @gouttegd suggestion here https://github.com/mapping-commons/sssom-py/pull/561/files#r1835130707, the codeblock has been refactored and simplified.
To ensure the old behaviour is retained, @gouttegd convinced me to not be lazy and actually add the "df.infer_objects" functionality when replacing empty values with nan's. Should github still exist when you read this, you can find his feedback here: https://github.com/mapping-commons/sssom-py/pull/561/files#r1835148927 As he questions, it is not clear this is _strictly needed_, but its probably safer to so if we dont know if its needed.
The rationale is to not force this global option on all users of the library. The main point for the fix is so that sssom-py's CLI print to stdout functionality is not compromised by this output!
As explained by @gouttegd in #561 (comment), we do not actually benefit from the small performance gain enabled by copy=False here, so we decided to drop it (which means a considerable simplification of the code as well)
@matentzn I really appreciate that you took the time to write meaningful commit messages in this PR, but you must realize that your commit messages are all lost if you do a squash merge after that! All that is now left in the history is a single commit with a decidedly non-meaningful message:
When you do a squash merge, you should ensure that the commit message of the single resulting commit accurately describes the final changes. |
See slack... |
See commits for a more detailed description of the changes; This PR should not introduce any functional changes.
In this commit we 1. introduce pd.set_option("future.no_silent_downcasting", True) to heed the warning pandas started recently producing: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` df.replace("", np.nan, inplace=True) 2. Fix a case where we were using inplace=True when updating a pandas data frame in the wrong way, leading to the following warning. We used the opportunity to simplify that piece of code as well, thanks @gouttegd Warning: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy. 3. lastly, we simplified some code where we introduced pandas.DataFrame.infer_objects(), which previously had a complicated conditional depending on the specific pandas version on whether the copy=False parameter is needed. It is not needed, because we also set pandas.mode.copy_on_write to true in #4cad7d6d8b905728f14a90d3d4c6d591b520cd3b.
See commits for a more detailed description of the changes;
This PR should not introduce any functional changes.