[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None #18820

jiayue-zhang · 2017-08-02T15:59:55Z

What changes were proposed in this pull request?

Currently df.na.replace("*", Map[String, String]("NULL" -> null)) will produce exception.
This PR enables passing null/None as value in the replacement map in DataFrame.replace().
Note that the replacement map keys and values should still be the same type, while the values can have a mix of null/None and that type.
This PR enables following operations for example:
df.na.replace("*", Map[String, String]("NULL" -> null))(scala)
df.na.replace("*", Map[Any, Any](60 -> null, 70 -> 80))(scala)
df.na.replace('Alice', None)(python)
df.na.replace([10, 20])(python, replacing with None is by default)
One use case could be: I want to replace all the empty strings with null/None because they were incorrectly generated and then drop all null/None data
df.na.replace("*", Map("" -> null)).na.drop()(scala)
df.replace(u'', None).dropna()(python)

How was this patch tested?

Scala unit test.
Python doctest and unit test.

jiayue-zhang · 2017-08-02T16:06:06Z

This PR reopens #16225
Please take a look @gatorsmile @holdenk @HyukjinKwon Thanks!

HyukjinKwon · 2017-08-03T14:59:29Z

ok to test

nchammas · 2017-08-03T17:31:41Z

python/pyspark/sql/dataframe.py

-                   for all_of_type in [all_of_bool, all_of_str, all_of_numeric]):
+        if not any(key_all_of_type(rep_dict.keys()) and value_all_of_type(rep_dict.values())
+                   for (key_all_of_type, value_all_of_type)
+                   in [all_of_bool, all_of_str, all_of_numeric]):


Why not just put None here and keep the various all_of_* variables defined as they were before? Seems like it would be clearer.

jiayue-zhang · 2017-08-03T19:19:26Z

Hey @nchammas I made the logic much simpler.

jiayue-zhang · 2017-08-03T21:03:02Z

What if the field is not nullable? I did a test:

val rows = spark.sparkContext.parallelize(Seq(
  Row("Bravo", 28, 183.5),
  Row("Jessie", 18, 165.8)))
val schema = StructType(Seq(
  StructField("name", StringType, nullable = false),
  StructField("age", IntegerType, nullable = true),
  StructField("height", DoubleType, nullable = true)))
val input1 = spark.createDataFrame(rows, schema)
val output2 = input1.na.replace("name", Map("Bravo" -> null))
input1.printSchema()
output2.printSchema()
output2.show(false)

I got:

root
 |-- name: string (nullable = false)
 |-- age: integer (nullable = true)
 |-- height: double (nullable = true)

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- height: double (nullable = true)

+------+---+------+
|name  |age|height|
+------+---+------+
|null  |28 |183.5 |
|Jessie|18 |165.8 |
+------+---+------+

The field becomes nullable.
I don't think we should allow user to change field nullability while doing replace. What do you guys think? I'm going to let it throw IllegalArgumentException.

HyukjinKwon · 2017-08-03T23:59:38Z

Hi @holdenk and @gatorsmile, while you are here, could you trigger the Jenkins build? Looks I still have some problems with triggering it.

nchammas · 2017-08-04T00:31:38Z

Jenkins test this please.

(Let's see if I still have the magic power.)

HyukjinKwon · 2017-08-04T02:49:24Z

ok to test

SparkQA · 2017-08-04T05:27:52Z

Test build #80229 has finished for PR 18820 at commit dfbcaf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2017-08-04T05:30:23Z

I don't think we should allow user to change field nullability while doing replace.

Why not? As long as we correctly update the schema from non-nullable to nullable, it seems OK to me. What would we be protecting against by disallowing this?

This reverts commit fcb617e.

jiayue-zhang · 2017-08-04T16:02:34Z

Hey @nchammas I don't have strong opinion on this and changed back to what it was.

SparkQA · 2017-08-04T18:38:20Z

Test build #80252 has finished for PR 18820 at commit 3e3823f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-06T06:28:49Z

python/pyspark/sql/dataframe.py

                "Got {0}".format(type(to_replace)))

-        if not isinstance(value, valid_types) and not isinstance(to_replace, dict):
+        if not isinstance(value, valid_types + (type(None), )) and not isinstance(to_replace, dict):


I would check None by value is None.

HyukjinKwon · 2017-08-06T06:36:38Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

-    // Replace only the age column
-    val out1 = input.na.replace("age", Map(
+    // Replace only the age column and with null
+    val out1 = input.na.replace("age", Map[Any, Any](


How about rather adding a separate test and leaving the existing test as is?

HyukjinKwon · 2017-08-06T06:38:50Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

+      case (k: String, null) => (k, null)
+      case (k: Boolean, null) => (k, null)
+      case (k, null) => (convertToDouble(k), null)
+      case _ @(k, v) => (convertToDouble(k), convertToDouble(v))


Could we use case (k, v) => instead of case _ @(k, v) => ?

viirya · 2017-08-07T03:43:31Z

python/pyspark/sql/dataframe.py

        |null|  null| null|
        +----+------+-----+

+        >>> df4.na.replace('Alice', None).show()


Looks like now we allow something like df4.na.replace('Alice').show(). We're better add it here.

Actually, I think this should be something to be fixed in DataFrameNaFunctions.replace in this file ...

and was thinking of not doing this here as strictly it should be a followup for SPARK-19454. I am fine with doing this here too while we are here.

This change allows us to do df4.na.replace('Alice'). I think SPARK-19454 doesn't?

I guess it is .na.replace vs .replace. I think both should be the same though. I just built against this PR and double checked as below:

>>> df = spark.createDataFrame([('Alice', 10, 80.0)])

>>> df.replace("Alice").first()

Row(_1=None, _2=10, _3=80.0)

>>> df.na.replace("Alice").first()

Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: replace() takes at least 3 arguments (2 given)

I've not noticed that. Why we test dataframe.na.replace at the doc test of dataframe.replace? We should test dataframe.replace here.

I assume we added an alise for dataframe.replace to promote use dataframe.na.replace? The doc says they are aliases anyway. I don't know but I tend to agree with paring doc tests and this looks renamed in ff26767.

Let's leave this as is for now. I don't want to make this PR complicated.

OK. I'm fine with this.

I filed a JIRA for mismatching default value between replace and na.replace in SPARK-21658

gatorsmile · 2017-08-07T04:11:38Z

@bravo-zhang Could you update the PR description to explain what this PR is trying to achieve? So far, it is not clear enough to explain what you did in this PR. Thanks!

gatorsmile · 2017-08-07T04:12:39Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

    }.getMessage
-    assert(message.equals("Failed to merge fields 'b' and 'b'. " +
-      "Failed to merge incompatible data types FloatType and LongType"))
+    assert(message === "Failed to merge fields 'b' and 'b'. " +


Nit: not related to this PR. Please revert it back.

Is this change a valid improvement? I forgot about === when I pushed that commit. I can revert this back but do I need create another PR? With or without JIRA?

It does look a valid improvement but it makes backporting harder sometimes. Let's revert this one if we are fine. We could tell fixing this one when someone (or you) happens to fix some codes around here. I guess it is too trivial for a PR.

gatorsmile · 2017-08-07T04:14:32Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

-      case v: Boolean => replacement
-      case _ => replacement.map { case (k, v) => (convertToDouble(k), convertToDouble(v)) }
+    // replacementMap is either Map[String, String], Map[Double, Double], Map[Boolean,Boolean]
+    // while value can have null


If the types are not these three types, what are the behaviors? Could you explain them here? Also, please add negative examples too. Thanks~

If replacement is Map[Any, Any] type, the replacementMap will not be confined to these 3 types.
We tell users to only use doubles, strings and booleans in the replacement map in the method doc. But user can still use df.na.replace("*", Map(10 -> 20, "Alpha" -> "Bravo")). The result is that only fields that have same type as the 1st key in the replacement map will perform replacement. This is due to the implementation of targetColumnType a few lines below.
I'll modify the comments here. But for the negative examples (like the one I mentioned in this comment), do I need explain in the method doc to users?

gatorsmile · 2017-08-07T04:15:41Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

   * (Scala-specific) Replaces values matching keys in `replacement` map.
   * Key and value of `replacement` map must have the same type, and
   * can only be doubles, strings or booleans.
+   * `replacement` map value can have null.


Do not put it here. It should be put in @parm.

The original comments put many @param up here. I will fix those as well.

gatorsmile · 2017-08-07T04:18:25Z

Could you also add a test case to cover the end-to-end use case the JIRA mentioned? Also put it in the PR description, which will be part of the PR commit. Thanks!

gatorsmile · 2017-08-07T04:19:00Z

cc @ueshin Could you also take a look the code changes in the Python side? Thanks!

jiayue-zhang · 2017-08-08T05:10:58Z

Hi @HyukjinKwon @gatorsmile @viirya I addressed your comments, added more test coverage and provided more info in PR description.
One thing that is not clear to user is that they can still use df.na.replace("*", Map(10 -> 20, "Alpha" -> "Bravo")). The behavior is that only fields that have same type as the 1st key in the replacement map will perform replacement(so "Alpha" -> "Bravo" doesn't have effect). This is due to the implementation of targetColumnType. This also creates a discrepancy that in Python we check all keys and values should be of same type while in Scala we don't check. This behavior exists before this PR.
I added 1 line comment // Only fields of targetColumnType will perform replacement. Is it enough for now? If we are to make it more elegant, is it a valid task to accept any replacement map so long as each key-value pair has the same type? Another alternative is to do type check in Scala just like in Python.

ueshin · 2017-08-08T06:29:30Z

python/pyspark/sql/dataframe.py

            mapping between a value and a replacement.
-        :param value: int, long, float, string, or list.
-            The replacement value must be an int, long, float, or string. If `value` is a
+        :param value: int, long, float, string, list or None.


It's not related to this pr, but we should add bool to this type list here and in other descriptions?

gatorsmile · 2017-08-08T06:50:19Z

Thanks! Will review it tomorrow.

SparkQA · 2017-08-08T07:04:51Z

Test build #80379 has finished for PR 18820 at commit 351be99.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-08T10:03:47Z

Test build #80382 has finished for PR 18820 at commit a09d3e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-09T07:02:09Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

    assert(out1(4) === Row("Amy", null, null))
    assert(out1(5) === Row(null, null, null))
+
+    // Replace String with String and null


Create a separate test case

test("replace with null") { ... }

gatorsmile · 2017-08-09T07:05:03Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

+    assert(out2(4) === Row("Amy", null, null))
+    assert(out2(5) === Row(null, null, null))
+
+    // Replace Double with null


Also add a test case for boolean

gatorsmile · 2017-08-09T07:09:00Z

LGTM except a few minor comments.

viirya · 2017-08-09T08:03:40Z

python/pyspark/sql/dataframe.py

            if value is not None:
                warnings.warn("to_replace is a dict and value is not None. value will be ignored.")
        else:
+            if isinstance(value, (float, int, long, basestring)) or value is None:


bool is missing from the types?

bool inherits int in Python. We could add bool for readability though.

Yeah, really confusing. :)

viirya · 2017-08-09T08:04:24Z

python/pyspark/sql/dataframe.py

                             "column name or None. Got {0}".format(type(subset)))

        # Reshape input arguments if necessary
        if isinstance(to_replace, (float, int, long, basestring)):


viirya · 2017-08-09T08:59:05Z

Please add the suggested tests then LGTM

HyukjinKwon · 2017-08-09T09:00:36Z

Yea, looks much safer. LGTM too except the comments above.

SparkQA · 2017-08-09T18:26:40Z

Test build #80459 has finished for PR 18820 at commit bc7a231.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-09T18:32:15Z

retest this please

SparkQA · 2017-08-09T21:10:06Z

Test build #80464 has finished for PR 18820 at commit bc7a231.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-10T00:42:44Z

Thanks! Merging to master.

jiayue-zhang added 9 commits December 8, 2016 21:25

[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None

2653750

Scala test for df.replace with null

2eac8b9

Use pattern matching for null case

7949292

Fix indentation

0b15c8f

Fix Python style check

2c532c3

Merge branch 'master' into spark-14932

5ab39cc

Improve scala doc and pyspark test

43fb6bd

Fix python3 dict.values() syntax

b5424d9

Unify allowed null in Python and Scala

a3939ba

jiayue-zhang mentioned this pull request Aug 2, 2017

[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None #16225

Closed

nchammas reviewed Aug 3, 2017

View reviewed changes

Simplify all_of_type logic

37dfaa7

jiayue-zhang added 3 commits August 3, 2017 15:00

Throw exception when column is not nullable

fcb617e

Merge branch 'master' into spark-14932

4b502bd

Piggybacking a minor improvement on code I recently pushed

dfbcaf3

jiayue-zhang added 2 commits August 4, 2017 08:47

Revert "Throw exception when column is not nullable"

8f7953b

This reverts commit fcb617e.

Improve a comment

3e3823f

HyukjinKwon reviewed Aug 6, 2017

View reviewed changes

Check value is None and new scala test

2946659

viirya reviewed Aug 7, 2017

View reviewed changes

gatorsmile reviewed Aug 7, 2017

View reviewed changes

More tests, better comments

351be99

ueshin reviewed Aug 8, 2017

View reviewed changes

Add bool to comment and Error text

a09d3e9

gatorsmile reviewed Aug 9, 2017

View reviewed changes

viirya reviewed Aug 9, 2017

View reviewed changes

Separate test and boolean test

bc7a231

asfgit closed this in 84454d7 Aug 10, 2017

[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None #18820

[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None #18820

Uh oh!

Conversation

jiayue-zhang commented Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jiayue-zhang commented Aug 2, 2017

Uh oh!

HyukjinKwon commented Aug 3, 2017

Uh oh!

nchammas Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayue-zhang commented Aug 3, 2017

Uh oh!

jiayue-zhang commented Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nchammas commented Aug 4, 2017

Uh oh!

HyukjinKwon commented Aug 4, 2017

Uh oh!

SparkQA commented Aug 4, 2017

Uh oh!

nchammas commented Aug 4, 2017

Uh oh!

jiayue-zhang commented Aug 4, 2017

Uh oh!

SparkQA commented Aug 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jiayue-zhang commented Aug 2, 2017 •

edited

Loading

nchammas Aug 3, 2017 •

edited

Loading

jiayue-zhang commented Aug 3, 2017 •

edited

Loading

HyukjinKwon commented Aug 3, 2017 •

edited

Loading

HyukjinKwon Aug 7, 2017 •

edited

Loading

HyukjinKwon Aug 7, 2017 •

edited

Loading

HyukjinKwon Aug 8, 2017 •

edited

Loading

jiayue-zhang commented Aug 8, 2017 •

edited

Loading

HyukjinKwon Aug 9, 2017 •

edited

Loading