Skip to content

Conversation

@jiayue-zhang
Copy link
Contributor

@jiayue-zhang jiayue-zhang commented Aug 2, 2017

What changes were proposed in this pull request?

Currently df.na.replace("*", Map[String, String]("NULL" -> null)) will produce exception.
This PR enables passing null/None as value in the replacement map in DataFrame.replace().
Note that the replacement map keys and values should still be the same type, while the values can have a mix of null/None and that type.
This PR enables following operations for example:
df.na.replace("*", Map[String, String]("NULL" -> null))(scala)
df.na.replace("*", Map[Any, Any](60 -> null, 70 -> 80))(scala)
df.na.replace('Alice', None)(python)
df.na.replace([10, 20])(python, replacing with None is by default)
One use case could be: I want to replace all the empty strings with null/None because they were incorrectly generated and then drop all null/None data
df.na.replace("*", Map("" -> null)).na.drop()(scala)
df.replace(u'', None).dropna()(python)

How was this patch tested?

Scala unit test.
Python doctest and unit test.

@jiayue-zhang
Copy link
Contributor Author

This PR reopens #16225
Please take a look @gatorsmile @holdenk @HyukjinKwon Thanks!

@HyukjinKwon
Copy link
Member

ok to test

for all_of_type in [all_of_bool, all_of_str, all_of_numeric]):
if not any(key_all_of_type(rep_dict.keys()) and value_all_of_type(rep_dict.values())
for (key_all_of_type, value_all_of_type)
in [all_of_bool, all_of_str, all_of_numeric]):
Copy link
Contributor

@nchammas nchammas Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just put None here and keep the various all_of_* variables defined as they were before? Seems like it would be clearer.

@jiayue-zhang
Copy link
Contributor Author

Hey @nchammas I made the logic much simpler.

@jiayue-zhang
Copy link
Contributor Author

jiayue-zhang commented Aug 3, 2017

What if the field is not nullable? I did a test:

val rows = spark.sparkContext.parallelize(Seq(
  Row("Bravo", 28, 183.5),
  Row("Jessie", 18, 165.8)))
val schema = StructType(Seq(
  StructField("name", StringType, nullable = false),
  StructField("age", IntegerType, nullable = true),
  StructField("height", DoubleType, nullable = true)))
val input1 = spark.createDataFrame(rows, schema)
val output2 = input1.na.replace("name", Map("Bravo" -> null))
input1.printSchema()
output2.printSchema()
output2.show(false)

I got:

root
 |-- name: string (nullable = false)
 |-- age: integer (nullable = true)
 |-- height: double (nullable = true)

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- height: double (nullable = true)

+------+---+------+
|name  |age|height|
+------+---+------+
|null  |28 |183.5 |
|Jessie|18 |165.8 |
+------+---+------+

The field becomes nullable.
I don't think we should allow user to change field nullability while doing replace. What do you guys think? I'm going to let it throw IllegalArgumentException.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 3, 2017

Hi @holdenk and @gatorsmile, while you are here, could you trigger the Jenkins build? Looks I still have some problems with triggering it.

@nchammas
Copy link
Contributor

nchammas commented Aug 4, 2017

Jenkins test this please.

(Let's see if I still have the magic power.)

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80229 has finished for PR 18820 at commit dfbcaf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor

nchammas commented Aug 4, 2017

I don't think we should allow user to change field nullability while doing replace.

Why not? As long as we correctly update the schema from non-nullable to nullable, it seems OK to me. What would we be protecting against by disallowing this?

@jiayue-zhang
Copy link
Contributor Author

Hey @nchammas I don't have strong opinion on this and changed back to what it was.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80252 has finished for PR 18820 at commit 3e3823f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"Got {0}".format(type(to_replace)))

if not isinstance(value, valid_types) and not isinstance(to_replace, dict):
if not isinstance(value, valid_types + (type(None), )) and not isinstance(to_replace, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check None by value is None.

// Replace only the age column
val out1 = input.na.replace("age", Map(
// Replace only the age column and with null
val out1 = input.na.replace("age", Map[Any, Any](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rather adding a separate test and leaving the existing test as is?

case (k: String, null) => (k, null)
case (k: Boolean, null) => (k, null)
case (k, null) => (convertToDouble(k), null)
case _ @(k, v) => (convertToDouble(k), convertToDouble(v))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use case (k, v) => instead of case _ @(k, v) => ?

|null| null| null|
+----+------+-----+
>>> df4.na.replace('Alice', None).show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like now we allow something like df4.na.replace('Alice').show(). We're better add it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this should be something to be fixed in DataFrameNaFunctions.replace in this file ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and was thinking of not doing this here as strictly it should be a followup for SPARK-19454. I am fine with doing this here too while we are here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows us to do df4.na.replace('Alice'). I think SPARK-19454 doesn't?

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is .na.replace vs .replace. I think both should be the same though. I just built against this PR and double checked as below:

>>> df = spark.createDataFrame([('Alice', 10, 80.0)])
>>> df.replace("Alice").first()
Row(_1=None, _2=10, _3=80.0)
>>> df.na.replace("Alice").first()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: replace() takes at least 3 arguments (2 given)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not noticed that. Why we test dataframe.na.replace at the doc test of dataframe.replace? We should test dataframe.replace here.

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we added an alise for dataframe.replace to promote use dataframe.na.replace? The doc says they are aliases anyway. I don't know but I tend to agree with paring doc tests and this looks renamed in ff26767.

Let's leave this as is for now. I don't want to make this PR complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm fine with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a JIRA for mismatching default value between replace and na.replace in SPARK-21658

@gatorsmile
Copy link
Member

@bravo-zhang Could you update the PR description to explain what this PR is trying to achieve? So far, it is not clear enough to explain what you did in this PR. Thanks!

}.getMessage
assert(message.equals("Failed to merge fields 'b' and 'b'. " +
"Failed to merge incompatible data types FloatType and LongType"))
assert(message === "Failed to merge fields 'b' and 'b'. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not related to this PR. Please revert it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change a valid improvement? I forgot about === when I pushed that commit. I can revert this back but do I need create another PR? With or without JIRA?

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look a valid improvement but it makes backporting harder sometimes. Let's revert this one if we are fine. We could tell fixing this one when someone (or you) happens to fix some codes around here. I guess it is too trivial for a PR.

case v: Boolean => replacement
case _ => replacement.map { case (k, v) => (convertToDouble(k), convertToDouble(v)) }
// replacementMap is either Map[String, String], Map[Double, Double], Map[Boolean,Boolean]
// while value can have null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the types are not these three types, what are the behaviors? Could you explain them here? Also, please add negative examples too. Thanks~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If replacement is Map[Any, Any] type, the replacementMap will not be confined to these 3 types.
We tell users to only use doubles, strings and booleans in the replacement map in the method doc. But user can still use df.na.replace("*", Map(10 -> 20, "Alpha" -> "Bravo")). The result is that only fields that have same type as the 1st key in the replacement map will perform replacement. This is due to the implementation of targetColumnType a few lines below.
I'll modify the comments here. But for the negative examples (like the one I mentioned in this comment), do I need explain in the method doc to users?

* (Scala-specific) Replaces values matching keys in `replacement` map.
* Key and value of `replacement` map must have the same type, and
* can only be doubles, strings or booleans.
* `replacement` map value can have null.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not put it here. It should be put in @parm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original comments put many @param up here. I will fix those as well.

@gatorsmile
Copy link
Member

Could you also add a test case to cover the end-to-end use case the JIRA mentioned? Also put it in the PR description, which will be part of the PR commit. Thanks!

@gatorsmile
Copy link
Member

cc @ueshin Could you also take a look the code changes in the Python side? Thanks!

@jiayue-zhang
Copy link
Contributor Author

jiayue-zhang commented Aug 8, 2017

Hi @HyukjinKwon @gatorsmile @viirya I addressed your comments, added more test coverage and provided more info in PR description.
One thing that is not clear to user is that they can still use df.na.replace("*", Map(10 -> 20, "Alpha" -> "Bravo")). The behavior is that only fields that have same type as the 1st key in the replacement map will perform replacement(so "Alpha" -> "Bravo" doesn't have effect). This is due to the implementation of targetColumnType. This also creates a discrepancy that in Python we check all keys and values should be of same type while in Scala we don't check. This behavior exists before this PR.
I added 1 line comment // Only fields of targetColumnType will perform replacement. Is it enough for now? If we are to make it more elegant, is it a valid task to accept any replacement map so long as each key-value pair has the same type? Another alternative is to do type check in Scala just like in Python.

mapping between a value and a replacement.
:param value: int, long, float, string, or list.
The replacement value must be an int, long, float, or string. If `value` is a
:param value: int, long, float, string, list or None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not related to this pr, but we should add bool to this type list here and in other descriptions?

@gatorsmile
Copy link
Member

Thanks! Will review it tomorrow.

@SparkQA
Copy link

SparkQA commented Aug 8, 2017

Test build #80379 has finished for PR 18820 at commit 351be99.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2017

Test build #80382 has finished for PR 18820 at commit a09d3e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(out1(4) === Row("Amy", null, null))
assert(out1(5) === Row(null, null, null))

// Replace String with String and null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a separate test case

test("replace with null") {
  ...
}

assert(out2(4) === Row("Amy", null, null))
assert(out2(5) === Row(null, null, null))

// Replace Double with null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a test case for boolean

@gatorsmile
Copy link
Member

LGTM except a few minor comments.

if value is not None:
warnings.warn("to_replace is a dict and value is not None. value will be ignored.")
else:
if isinstance(value, (float, int, long, basestring)) or value is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool is missing from the types?

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool inherits int in Python. We could add bool for readability though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, really confusing. :)

"column name or None. Got {0}".format(type(subset)))

# Reshape input arguments if necessary
if isinstance(to_replace, (float, int, long, basestring)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool?

@viirya
Copy link
Member

viirya commented Aug 9, 2017

Please add the suggested tests then LGTM

@HyukjinKwon
Copy link
Member

Yea, looks much safer. LGTM too except the comments above.

@SparkQA
Copy link

SparkQA commented Aug 9, 2017

Test build #80459 has finished for PR 18820 at commit bc7a231.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 9, 2017

Test build #80464 has finished for PR 18820 at commit bc7a231.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 84454d7 Aug 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants