Skip to content

Conversation

@xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

Add value length check in _create_row, forbid extra value for custom Row in PySpark.

How was this patch tested?

New UT in pyspark-sql

@SparkQA
Copy link

SparkQA commented Aug 18, 2018

Test build #94920 has finished for PR 22140 at commit b8c6522.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

cc @BryanCutler as well since we discussed an issue about this code path before.

@BryanCutler
Copy link
Member

Does it make any sense to have less values than fields? Maybe we should check that they are equal, wdyt @HyukjinKwon ?

@xuanyuanking
Copy link
Member Author

AFAIC, the fix should forbid illegal extra value passing. If less values than fields it should get a AttributeError while accessing as the currently implement, not ban it here? What do you think :) @HyukjinKwon @BryanCutler Thanks.

@xuanyuanking
Copy link
Member Author

gental ping @HyukjinKwon @BryanCutler

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just leave to case of less values for another time since you already have this fix. I do think you should move the check to def __call__ in Row just before _create_row is called. It is more user-facing that way.

struct_field = StructField("a", IntegerType())
self.assertRaises(TypeError, struct_field.typeName)

def test_invalid_create_row(slef):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: slef -> self

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in eb3f506.


def _create_row(fields, values):
if len(values) > len(fields):
raise ValueError("Can not create %s by %s" % (fields, values))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to improve this message a little, maybe "Can not create Row with fields %s, expected %d values but got %s" % (fields, len(fields), values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, improve done and move this check to __call__ in Row. eb3f506

self.assertRaises(TypeError, struct_field.typeName)

def test_invalid_create_row(slef):
rowClass = Row("c1", "c2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rowClass -> row_class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in eb3f506.

@HyukjinKwon
Copy link
Member

@BryanCutler, for #22140 (comment), yea, to me it looks less sense actually but seems at least working for now:

from pyspark.sql import Row
rowClass = Row("c1", "c2")
spark.createDataFrame([rowClass(1)]).show()
+---+
| c1|
+---+
|  1|
+---+

I think we should consider disallowing it in 3.0.0 given the test above.

@SparkQA
Copy link

SparkQA commented Sep 6, 2018

Test build #95756 has finished for PR 22140 at commit eb3f506.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

yea, to me it looks less sense actually but seems at least working for now:

good point, I guess it only fails when you supply a schema.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

asfgit pushed a commit that referenced this pull request Sep 6, 2018
## What changes were proposed in this pull request?

Add value length check in `_create_row`, forbid extra value for custom Row in PySpark.

## How was this patch tested?

New UT in pyspark-sql

Closes #22140 from xuanyuanking/SPARK-25072.

Lead-authored-by: liyuanjian <[email protected]>
Co-authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
(cherry picked from commit c84bc40)
Signed-off-by: Bryan Cutler <[email protected]>
asfgit pushed a commit that referenced this pull request Sep 6, 2018
## What changes were proposed in this pull request?

Add value length check in `_create_row`, forbid extra value for custom Row in PySpark.

## How was this patch tested?

New UT in pyspark-sql

Closes #22140 from xuanyuanking/SPARK-25072.

Lead-authored-by: liyuanjian <[email protected]>
Co-authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
(cherry picked from commit c84bc40)
Signed-off-by: Bryan Cutler <[email protected]>
@BryanCutler
Copy link
Member

merged to master, branch 2.4 and 2.3. Thanks @xuanyuanking !

@asfgit asfgit closed this in c84bc40 Sep 6, 2018
@xuanyuanking
Copy link
Member Author

Thanks @BryanCutler @HyukjinKwon !

@xuanyuanking xuanyuanking deleted the SPARK-25072 branch September 7, 2018 01:48
@gatorsmile
Copy link
Member

@BryanCutler What is the reason to backport this PR? This sounds a behavior change.

@xuanyuanking Could you please update the document?

@xuanyuanking
Copy link
Member Author

@xuanyuanking Could you please update the document?

#22369 Thanks for reminding, I'll pay attention in future work.

@BryanCutler
Copy link
Member

@gatorsmile it seemed like a straightforward bug to me. Rows with extra values lead to incorrect output and exceptions when used in DataFrames, so it did not seem like there was any possible this would break existing code. For example

In [1]: MyRow = Row('a','b')

In [2]: print(MyRow(1,2,3))
Row(a=1, b=2)

In [3]: spark.createDataFrame([MyRow(1,2,3)])
Out[3]: DataFrame[a: bigint, b: bigint]

In [4]: spark.createDataFrame([MyRow(1,2,3)]).show()
18/09/08 21:55:48 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 3 values are provided.

In [5]: spark.createDataFrame([MyRow(1,2,3)], schema="x: int, y: int").show()

ValueError: Length of object (3) does not match with length of fields (2)

Maybe I was too hasty with backporting and this needed some discussion. Do you know of a use case that this change would break?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 10, 2018

Yea, actually I wouldn't at least backport this to branch-2.3 since the release is very close. Looks a bug to me as well.

One nitpicking is the case with RDD operation:

>>> from pyspark.sql import Row
>>> row_class = Row("c1", "c2")
>>> row = row_class(1, 2, 3)
>>> spark.sparkContext.parallelize([row]).map(lambda r: r.c1).collect()
[1]

This is really unlikely and I even wonder if it makes any sense (also given the nature of Python language itself), but still there might be a case although the creation of the namedtuple-like row with invalid arguments itself should be disallowed, as fixed here.

Can we just simply take this out from branch-2.3?

@BryanCutler
Copy link
Member

Can we just simply take this out from branch-2.3?

Thanks @HyukjinKwon , that is fine with me. What do you think @gatorsmile ?

@gatorsmile
Copy link
Member

@BryanCutler @HyukjinKwon Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version.

I just reverted it from branch 2.3. Thanks!

@BryanCutler
Copy link
Member

Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version.

Yes, I know. It seemed to me at the time as failing fast rather than later and improving the error message, but best to be safe. Thanks!

@gatorsmile
Copy link
Member

We are very conservative when backporting the PR to the released version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants