-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25072][PySpark] Forbid extra value for custom Row #22140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #94920 has finished for PR 22140 at commit
|
|
cc @HyukjinKwon |
|
cc @BryanCutler as well since we discussed an issue about this code path before. |
|
Does it make any sense to have less values than fields? Maybe we should check that they are equal, wdyt @HyukjinKwon ? |
|
AFAIC, the fix should forbid illegal extra value passing. If less values than fields it should get a |
|
gental ping @HyukjinKwon @BryanCutler |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just leave to case of less values for another time since you already have this fix. I do think you should move the check to def __call__ in Row just before _create_row is called. It is more user-facing that way.
python/pyspark/sql/tests.py
Outdated
| struct_field = StructField("a", IntegerType()) | ||
| self.assertRaises(TypeError, struct_field.typeName) | ||
|
|
||
| def test_invalid_create_row(slef): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: slef -> self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done in eb3f506.
python/pyspark/sql/types.py
Outdated
|
|
||
| def _create_row(fields, values): | ||
| if len(values) > len(fields): | ||
| raise ValueError("Can not create %s by %s" % (fields, values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to improve this message a little, maybe "Can not create Row with fields %s, expected %d values but got %s" % (fields, len(fields), values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, improve done and move this check to __call__ in Row. eb3f506
python/pyspark/sql/tests.py
Outdated
| self.assertRaises(TypeError, struct_field.typeName) | ||
|
|
||
| def test_invalid_create_row(slef): | ||
| rowClass = Row("c1", "c2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rowClass -> row_class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done in eb3f506.
|
@BryanCutler, for #22140 (comment), yea, to me it looks less sense actually but seems at least working for now: from pyspark.sql import Row
rowClass = Row("c1", "c2")
spark.createDataFrame([rowClass(1)]).show()I think we should consider disallowing it in 3.0.0 given the test above. |
|
Test build #95756 has finished for PR 22140 at commit
|
good point, I guess it only fails when you supply a schema. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
## What changes were proposed in this pull request? Add value length check in `_create_row`, forbid extra value for custom Row in PySpark. ## How was this patch tested? New UT in pyspark-sql Closes #22140 from xuanyuanking/SPARK-25072. Lead-authored-by: liyuanjian <[email protected]> Co-authored-by: Yuanjian Li <[email protected]> Signed-off-by: Bryan Cutler <[email protected]> (cherry picked from commit c84bc40) Signed-off-by: Bryan Cutler <[email protected]>
## What changes were proposed in this pull request? Add value length check in `_create_row`, forbid extra value for custom Row in PySpark. ## How was this patch tested? New UT in pyspark-sql Closes #22140 from xuanyuanking/SPARK-25072. Lead-authored-by: liyuanjian <[email protected]> Co-authored-by: Yuanjian Li <[email protected]> Signed-off-by: Bryan Cutler <[email protected]> (cherry picked from commit c84bc40) Signed-off-by: Bryan Cutler <[email protected]>
|
merged to master, branch 2.4 and 2.3. Thanks @xuanyuanking ! |
|
Thanks @BryanCutler @HyukjinKwon ! |
|
@BryanCutler What is the reason to backport this PR? This sounds a behavior change. @xuanyuanking Could you please update the document? |
#22369 Thanks for reminding, I'll pay attention in future work. |
|
@gatorsmile it seemed like a straightforward bug to me. Rows with extra values lead to incorrect output and exceptions when used in Maybe I was too hasty with backporting and this needed some discussion. Do you know of a use case that this change would break? |
|
Yea, actually I wouldn't at least backport this to branch-2.3 since the release is very close. Looks a bug to me as well. One nitpicking is the case with RDD operation: >>> from pyspark.sql import Row
>>> row_class = Row("c1", "c2")
>>> row = row_class(1, 2, 3)
>>> spark.sparkContext.parallelize([row]).map(lambda r: r.c1).collect()
[1]This is really unlikely and I even wonder if it makes any sense (also given the nature of Python language itself), but still there might be a case although the creation of the namedtuple-like row with invalid arguments itself should be disallowed, as fixed here. Can we just simply take this out from branch-2.3? |
Thanks @HyukjinKwon , that is fine with me. What do you think @gatorsmile ? |
|
@BryanCutler @HyukjinKwon Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version. I just reverted it from branch 2.3. Thanks! |
Yes, I know. It seemed to me at the time as failing fast rather than later and improving the error message, but best to be safe. Thanks! |
|
We are very conservative when backporting the PR to the released version. |
What changes were proposed in this pull request?
Add value length check in
_create_row, forbid extra value for custom Row in PySpark.How was this patch tested?
New UT in pyspark-sql