-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None #18820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2653750
2eac8b9
7949292
0b15c8f
2c532c3
5ab39cc
43fb6bd
b5424d9
a3939ba
37dfaa7
fcb617e
4b502bd
dfbcaf3
8f7953b
3e3823f
2946659
351be99
a09d3e9
bc7a231
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1362,8 +1362,8 @@ def replace(self, to_replace, value=None, subset=None): | |
| """Returns a new :class:`DataFrame` replacing a value with another value. | ||
| :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are | ||
| aliases of each other. | ||
| Values to_replace and value should contain either all numerics, all booleans, | ||
| or all strings. When replacing, the new value will be cast | ||
| Values to_replace and value must have the same type and can only be numerics, booleans, | ||
| or strings. Value can have None. When replacing, the new value will be cast | ||
| to the type of the existing column. | ||
| For numeric replacements all values to be replaced should have unique | ||
| floating point representation. In case of conflicts (for example with `{42: -1, 42.0: 1}`) | ||
|
|
@@ -1373,8 +1373,8 @@ def replace(self, to_replace, value=None, subset=None): | |
| Value to be replaced. | ||
| If the value is a dict, then `value` is ignored and `to_replace` must be a | ||
| mapping between a value and a replacement. | ||
| :param value: int, long, float, string, or list. | ||
| The replacement value must be an int, long, float, or string. If `value` is a | ||
| :param value: bool, int, long, float, string, list or None. | ||
| The replacement value must be a bool, int, long, float, string or None. If `value` is a | ||
| list, `value` should be of the same length and type as `to_replace`. | ||
| If `value` is a scalar and `to_replace` is a sequence, then `value` is | ||
| used as a replacement for each item in `to_replace`. | ||
|
|
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, subset=None): | |
| |null| null| null| | ||
| +----+------+-----+ | ||
|
|
||
| >>> df4.na.replace('Alice', None).show() | ||
| +----+------+----+ | ||
| | age|height|name| | ||
| +----+------+----+ | ||
| | 10| 80|null| | ||
| | 5| null| Bob| | ||
| |null| null| Tom| | ||
| |null| null|null| | ||
| +----+------+----+ | ||
|
|
||
| >>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show() | ||
| +----+------+----+ | ||
| | age|height|name| | ||
|
|
@@ -1425,12 +1435,13 @@ def all_of_(xs): | |
| valid_types = (bool, float, int, long, basestring, list, tuple) | ||
| if not isinstance(to_replace, valid_types + (dict, )): | ||
| raise ValueError( | ||
| "to_replace should be a float, int, long, string, list, tuple, or dict. " | ||
| "to_replace should be a bool, float, int, long, string, list, tuple, or dict. " | ||
| "Got {0}".format(type(to_replace))) | ||
|
|
||
| if not isinstance(value, valid_types) and not isinstance(to_replace, dict): | ||
| if not isinstance(value, valid_types) and value is not None \ | ||
| and not isinstance(to_replace, dict): | ||
| raise ValueError("If to_replace is not a dict, value should be " | ||
| "a float, int, long, string, list, or tuple. " | ||
| "a bool, float, int, long, string, list, tuple or None. " | ||
| "Got {0}".format(type(value))) | ||
|
|
||
| if isinstance(to_replace, (list, tuple)) and isinstance(value, (list, tuple)): | ||
|
|
@@ -1446,21 +1457,21 @@ def all_of_(xs): | |
| if isinstance(to_replace, (float, int, long, basestring)): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| to_replace = [to_replace] | ||
|
|
||
| if isinstance(value, (float, int, long, basestring)): | ||
| value = [value for _ in range(len(to_replace))] | ||
|
|
||
| if isinstance(to_replace, dict): | ||
| rep_dict = to_replace | ||
| if value is not None: | ||
| warnings.warn("to_replace is a dict and value is not None. value will be ignored.") | ||
| else: | ||
| if isinstance(value, (float, int, long, basestring)) or value is None: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, really confusing. :) |
||
| value = [value for _ in range(len(to_replace))] | ||
| rep_dict = dict(zip(to_replace, value)) | ||
|
|
||
| if isinstance(subset, basestring): | ||
| subset = [subset] | ||
|
|
||
| # Verify we were not passed in mixed type generics." | ||
| if not any(all_of_type(rep_dict.keys()) and all_of_type(rep_dict.values()) | ||
| # Verify we were not passed in mixed type generics. | ||
| if not any(all_of_type(rep_dict.keys()) | ||
| and all_of_type(x for x in rep_dict.values() if x is not None) | ||
| for all_of_type in [all_of_bool, all_of_str, all_of_numeric]): | ||
| raise ValueError("Mixed type replacements are not supported") | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like now we allow something like
df4.na.replace('Alice').show(). We're better add it here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this should be something to be fixed in
DataFrameNaFunctions.replacein this file ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and was thinking of not doing this here as strictly it should be a followup for SPARK-19454. I am fine with doing this here too while we are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows us to do
df4.na.replace('Alice'). I think SPARK-19454 doesn't?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is
.na.replacevs.replace. I think both should be the same though. I just built against this PR and double checked as below:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not noticed that. Why we test
dataframe.na.replaceat the doc test ofdataframe.replace? We should testdataframe.replacehere.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we added an alise for
dataframe.replaceto promote usedataframe.na.replace? The doc says they are aliases anyway. I don't know but I tend to agree with paring doc tests and this looks renamed in ff26767.Let's leave this as is for now. I don't want to make this PR complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm fine with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a JIRA for mismatching default value between
replaceandna.replacein SPARK-21658