[SPARK-25072][PySpark] Forbid extra value for custom Row #22140

xuanyuanking · 2018-08-18T08:42:25Z

What changes were proposed in this pull request?

Add value length check in _create_row, forbid extra value for custom Row in PySpark.

How was this patch tested?

New UT in pyspark-sql

SparkQA · 2018-08-18T09:16:22Z

Test build #94920 has finished for PR 22140 at commit b8c6522.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-08-19T04:27:39Z

cc @HyukjinKwon

HyukjinKwon · 2018-08-21T03:14:31Z

cc @BryanCutler as well since we discussed an issue about this code path before.

BryanCutler · 2018-08-21T20:02:07Z

Does it make any sense to have less values than fields? Maybe we should check that they are equal, wdyt @HyukjinKwon ?

xuanyuanking · 2018-08-22T14:52:30Z

AFAIC, the fix should forbid illegal extra value passing. If less values than fields it should get a AttributeError while accessing as the currently implement, not ban it here? What do you think :) @HyukjinKwon @BryanCutler Thanks.

xuanyuanking · 2018-09-05T13:19:39Z

gental ping @HyukjinKwon @BryanCutler

BryanCutler

Let's just leave to case of less values for another time since you already have this fix. I do think you should move the check to def __call__ in Row just before _create_row is called. It is more user-facing that way.

BryanCutler · 2018-09-05T17:15:16Z

python/pyspark/sql/tests.py

        struct_field = StructField("a", IntegerType())
        self.assertRaises(TypeError, struct_field.typeName)

+    def test_invalid_create_row(slef):


typo: slef -> self

Thanks, done in eb3f506.

BryanCutler · 2018-09-05T17:18:21Z

python/pyspark/sql/types.py


 def _create_row(fields, values):
+    if len(values) > len(fields):
+        raise ValueError("Can not create %s by %s" % (fields, values))


I'd like to improve this message a little, maybe "Can not create Row with fields %s, expected %d values but got %s" % (fields, len(fields), values)

Thanks, improve done and move this check to __call__ in Row. eb3f506

HyukjinKwon · 2018-09-06T03:08:39Z

python/pyspark/sql/tests.py

        self.assertRaises(TypeError, struct_field.typeName)

+    def test_invalid_create_row(slef):
+        rowClass = Row("c1", "c2")


nit: rowClass -> row_class

Thanks, done in eb3f506.

HyukjinKwon · 2018-09-06T03:12:27Z

@BryanCutler, for #22140 (comment), yea, to me it looks less sense actually but seems at least working for now:

from pyspark.sql import Row
rowClass = Row("c1", "c2")
spark.createDataFrame([rowClass(1)]).show()

+---+
| c1|
+---+
|  1|
+---+

I think we should consider disallowing it in 3.0.0 given the test above.

SparkQA · 2018-09-06T12:48:16Z

Test build #95756 has finished for PR 22140 at commit eb3f506.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-09-06T17:15:20Z

yea, to me it looks less sense actually but seems at least working for now:

good point, I guess it only fails when you supply a schema.

BryanCutler

LGTM

## What changes were proposed in this pull request? Add value length check in `_create_row`, forbid extra value for custom Row in PySpark. ## How was this patch tested? New UT in pyspark-sql Closes #22140 from xuanyuanking/SPARK-25072. Lead-authored-by: liyuanjian <[email protected]> Co-authored-by: Yuanjian Li <[email protected]> Signed-off-by: Bryan Cutler <[email protected]> (cherry picked from commit c84bc40) Signed-off-by: Bryan Cutler <[email protected]>

BryanCutler · 2018-09-06T17:19:55Z

merged to master, branch 2.4 and 2.3. Thanks @xuanyuanking !

xuanyuanking · 2018-09-07T01:48:34Z

Thanks @BryanCutler @HyukjinKwon !

gatorsmile · 2018-09-08T18:43:47Z

@BryanCutler What is the reason to backport this PR? This sounds a behavior change.

@xuanyuanking Could you please update the document?

xuanyuanking · 2018-09-09T04:31:56Z

@xuanyuanking Could you please update the document?

#22369 Thanks for reminding, I'll pay attention in future work.

BryanCutler · 2018-09-09T05:20:55Z

@gatorsmile it seemed like a straightforward bug to me. Rows with extra values lead to incorrect output and exceptions when used in DataFrames, so it did not seem like there was any possible this would break existing code. For example

In [1]: MyRow = Row('a','b')

In [2]: print(MyRow(1,2,3))
Row(a=1, b=2)

In [3]: spark.createDataFrame([MyRow(1,2,3)])
Out[3]: DataFrame[a: bigint, b: bigint]

In [4]: spark.createDataFrame([MyRow(1,2,3)]).show()
18/09/08 21:55:48 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 3 values are provided.

In [5]: spark.createDataFrame([MyRow(1,2,3)], schema="x: int, y: int").show()

ValueError: Length of object (3) does not match with length of fields (2)

Maybe I was too hasty with backporting and this needed some discussion. Do you know of a use case that this change would break?

HyukjinKwon · 2018-09-10T02:20:55Z

Yea, actually I wouldn't at least backport this to branch-2.3 since the release is very close. Looks a bug to me as well.

One nitpicking is the case with RDD operation:

>>> from pyspark.sql import Row
>>> row_class = Row("c1", "c2")
>>> row = row_class(1, 2, 3)
>>> spark.sparkContext.parallelize([row]).map(lambda r: r.c1).collect()
[1]

This is really unlikely and I even wonder if it makes any sense (also given the nature of Python language itself), but still there might be a case although the creation of the namedtuple-like row with invalid arguments itself should be disallowed, as fixed here.

Can we just simply take this out from branch-2.3?

BryanCutler · 2018-09-10T17:30:27Z

Can we just simply take this out from branch-2.3?

Thanks @HyukjinKwon , that is fine with me. What do you think @gatorsmile ?

gatorsmile · 2018-09-10T17:36:54Z

@BryanCutler @HyukjinKwon Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version.

I just reverted it from branch 2.3. Thanks!

BryanCutler · 2018-09-10T18:14:52Z

Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version.

Yes, I know. It seemed to me at the time as failing fast rather than later and improving the error message, but best to be safe. Thanks!

gatorsmile · 2018-09-10T20:18:56Z

We are very conservative when backporting the PR to the released version.

Forbidden extra value for custom Row

b8c6522

BryanCutler requested changes Sep 5, 2018

View reviewed changes

HyukjinKwon reviewed Sep 6, 2018

View reviewed changes

address comments

eb3f506

BryanCutler approved these changes Sep 6, 2018

View reviewed changes

asfgit closed this in c84bc40 Sep 6, 2018

xuanyuanking deleted the SPARK-25072 branch September 7, 2018 01:48

xuanyuanking mentioned this pull request Sep 9, 2018

[SPARK-25072][DOC] Update migration guide for behavior change #22369

Closed

[SPARK-25072][PySpark] Forbid extra value for custom Row #22140

[SPARK-25072][PySpark] Forbid extra value for custom Row #22140

Uh oh!

Conversation

xuanyuanking commented Aug 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 18, 2018

Uh oh!

xuanyuanking commented Aug 19, 2018

Uh oh!

HyukjinKwon commented Aug 21, 2018

Uh oh!

BryanCutler commented Aug 21, 2018

Uh oh!

xuanyuanking commented Aug 22, 2018

Uh oh!

xuanyuanking commented Sep 5, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Sep 5, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

BryanCutler Sep 5, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 6, 2018

Uh oh!

SparkQA commented Sep 6, 2018

Uh oh!

BryanCutler commented Sep 6, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Sep 6, 2018

Uh oh!

xuanyuanking commented Sep 7, 2018

Uh oh!

gatorsmile commented Sep 8, 2018

Uh oh!

xuanyuanking commented Sep 9, 2018

Uh oh!

BryanCutler commented Sep 9, 2018

Uh oh!

HyukjinKwon commented Sep 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Sep 10, 2018

Uh oh!

gatorsmile commented Sep 10, 2018

Uh oh!

BryanCutler commented Sep 10, 2018

Uh oh!

gatorsmile commented Sep 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Sep 10, 2018 •

edited

Loading