-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None #18820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR reopens #16225 |
|
ok to test |
python/pyspark/sql/dataframe.py
Outdated
| for all_of_type in [all_of_bool, all_of_str, all_of_numeric]): | ||
| if not any(key_all_of_type(rep_dict.keys()) and value_all_of_type(rep_dict.values()) | ||
| for (key_all_of_type, value_all_of_type) | ||
| in [all_of_bool, all_of_str, all_of_numeric]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just put None here and keep the various all_of_* variables defined as they were before? Seems like it would be clearer.
|
Hey @nchammas I made the logic much simpler. |
|
What if the field is not nullable? I did a test: I got: The field becomes nullable. |
|
Hi @holdenk and @gatorsmile, while you are here, could you trigger the Jenkins build? Looks I still have some problems with triggering it. |
|
Jenkins test this please. (Let's see if I still have the magic power.) |
|
ok to test |
|
Test build #80229 has finished for PR 18820 at commit
|
Why not? As long as we correctly update the schema from non-nullable to nullable, it seems OK to me. What would we be protecting against by disallowing this? |
This reverts commit fcb617e.
|
Hey @nchammas I don't have strong opinion on this and changed back to what it was. |
|
Test build #80252 has finished for PR 18820 at commit
|
python/pyspark/sql/dataframe.py
Outdated
| "Got {0}".format(type(to_replace))) | ||
|
|
||
| if not isinstance(value, valid_types) and not isinstance(to_replace, dict): | ||
| if not isinstance(value, valid_types + (type(None), )) and not isinstance(to_replace, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would check None by value is None.
| // Replace only the age column | ||
| val out1 = input.na.replace("age", Map( | ||
| // Replace only the age column and with null | ||
| val out1 = input.na.replace("age", Map[Any, Any]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about rather adding a separate test and leaving the existing test as is?
| case (k: String, null) => (k, null) | ||
| case (k: Boolean, null) => (k, null) | ||
| case (k, null) => (convertToDouble(k), null) | ||
| case _ @(k, v) => (convertToDouble(k), convertToDouble(v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use case (k, v) => instead of case _ @(k, v) => ?
| |null| null| null| | ||
| +----+------+-----+ | ||
| >>> df4.na.replace('Alice', None).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like now we allow something like df4.na.replace('Alice').show(). We're better add it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this should be something to be fixed in DataFrameNaFunctions.replace in this file ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and was thinking of not doing this here as strictly it should be a followup for SPARK-19454. I am fine with doing this here too while we are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows us to do df4.na.replace('Alice'). I think SPARK-19454 doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is .na.replace vs .replace. I think both should be the same though. I just built against this PR and double checked as below:
>>> df = spark.createDataFrame([('Alice', 10, 80.0)])>>> df.replace("Alice").first()Row(_1=None, _2=10, _3=80.0)
>>> df.na.replace("Alice").first()Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: replace() takes at least 3 arguments (2 given)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not noticed that. Why we test dataframe.na.replace at the doc test of dataframe.replace? We should test dataframe.replace here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we added an alise for dataframe.replace to promote use dataframe.na.replace? The doc says they are aliases anyway. I don't know but I tend to agree with paring doc tests and this looks renamed in ff26767.
Let's leave this as is for now. I don't want to make this PR complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm fine with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a JIRA for mismatching default value between replace and na.replace in SPARK-21658
|
@bravo-zhang Could you update the PR description to explain what this PR is trying to achieve? So far, it is not clear enough to explain what you did in this PR. Thanks! |
| }.getMessage | ||
| assert(message.equals("Failed to merge fields 'b' and 'b'. " + | ||
| "Failed to merge incompatible data types FloatType and LongType")) | ||
| assert(message === "Failed to merge fields 'b' and 'b'. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not related to this PR. Please revert it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change a valid improvement? I forgot about === when I pushed that commit. I can revert this back but do I need create another PR? With or without JIRA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look a valid improvement but it makes backporting harder sometimes. Let's revert this one if we are fine. We could tell fixing this one when someone (or you) happens to fix some codes around here. I guess it is too trivial for a PR.
| case v: Boolean => replacement | ||
| case _ => replacement.map { case (k, v) => (convertToDouble(k), convertToDouble(v)) } | ||
| // replacementMap is either Map[String, String], Map[Double, Double], Map[Boolean,Boolean] | ||
| // while value can have null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the types are not these three types, what are the behaviors? Could you explain them here? Also, please add negative examples too. Thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If replacement is Map[Any, Any] type, the replacementMap will not be confined to these 3 types.
We tell users to only use doubles, strings and booleans in the replacement map in the method doc. But user can still use df.na.replace("*", Map(10 -> 20, "Alpha" -> "Bravo")). The result is that only fields that have same type as the 1st key in the replacement map will perform replacement. This is due to the implementation of targetColumnType a few lines below.
I'll modify the comments here. But for the negative examples (like the one I mentioned in this comment), do I need explain in the method doc to users?
| * (Scala-specific) Replaces values matching keys in `replacement` map. | ||
| * Key and value of `replacement` map must have the same type, and | ||
| * can only be doubles, strings or booleans. | ||
| * `replacement` map value can have null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not put it here. It should be put in @parm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original comments put many @param up here. I will fix those as well.
|
Could you also add a test case to cover the end-to-end use case the JIRA mentioned? Also put it in the PR description, which will be part of the PR commit. Thanks! |
|
cc @ueshin Could you also take a look the code changes in the Python side? Thanks! |
|
Hi @HyukjinKwon @gatorsmile @viirya I addressed your comments, added more test coverage and provided more info in PR description. |
python/pyspark/sql/dataframe.py
Outdated
| mapping between a value and a replacement. | ||
| :param value: int, long, float, string, or list. | ||
| The replacement value must be an int, long, float, or string. If `value` is a | ||
| :param value: int, long, float, string, list or None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to this pr, but we should add bool to this type list here and in other descriptions?
|
Thanks! Will review it tomorrow. |
|
Test build #80379 has finished for PR 18820 at commit
|
|
Test build #80382 has finished for PR 18820 at commit
|
| assert(out1(4) === Row("Amy", null, null)) | ||
| assert(out1(5) === Row(null, null, null)) | ||
|
|
||
| // Replace String with String and null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a separate test case
test("replace with null") {
...
}
| assert(out2(4) === Row("Amy", null, null)) | ||
| assert(out2(5) === Row(null, null, null)) | ||
|
|
||
| // Replace Double with null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a test case for boolean
|
LGTM except a few minor comments. |
| if value is not None: | ||
| warnings.warn("to_replace is a dict and value is not None. value will be ignored.") | ||
| else: | ||
| if isinstance(value, (float, int, long, basestring)) or value is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool is missing from the types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool inherits int in Python. We could add bool for readability though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, really confusing. :)
| "column name or None. Got {0}".format(type(subset))) | ||
|
|
||
| # Reshape input arguments if necessary | ||
| if isinstance(to_replace, (float, int, long, basestring)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool?
|
Please add the suggested tests then LGTM |
|
Yea, looks much safer. LGTM too except the comments above. |
|
Test build #80459 has finished for PR 18820 at commit
|
|
retest this please |
|
Test build #80464 has finished for PR 18820 at commit
|
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
Currently
df.na.replace("*", Map[String, String]("NULL" -> null))will produce exception.This PR enables passing null/None as value in the replacement map in DataFrame.replace().
Note that the replacement map keys and values should still be the same type, while the values can have a mix of null/None and that type.
This PR enables following operations for example:
df.na.replace("*", Map[String, String]("NULL" -> null))(scala)df.na.replace("*", Map[Any, Any](60 -> null, 70 -> 80))(scala)df.na.replace('Alice', None)(python)df.na.replace([10, 20])(python, replacing with None is by default)One use case could be: I want to replace all the empty strings with null/None because they were incorrectly generated and then drop all null/None data
df.na.replace("*", Map("" -> null)).na.drop()(scala)df.replace(u'', None).dropna()(python)How was this patch tested?
Scala unit test.
Python doctest and unit test.