-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-51920][SS][PYTHON] Fix type handling of namedTuple for transfromWithState #53314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-51920][SS][PYTHON] Fix type handling of namedTuple for transfromWithState #53314
Conversation
| # Named tuples (collections.namedtuple or typing.NamedTuple) have a | ||
| # _fields attribute. Spark Row has __fields__. Both require positional | ||
| # arguments and cannot be instantiated with a generator expression. | ||
| if hasattr(v, '_fields') or hasattr(v, '__fields__'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking the type is much safer than checking an attribute, especially considering that _fields is a not a rare attribute name. If we know what type we are targeting, we should just check type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! You are so fast. Did not get a chance to add a UT yet 😛 just convert to draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @gaogaotiantian, I use
if (
isinstance(v, Row) or
(isinstance(v, tuple) and hasattr(v, "_fields"))
):
instead. isinstance(v, NamedTuple) won’t work because typing.NamedTuple is a class factory, not a runtime parent of instances. Checking isinstance(v, tuple) and _fields is the correct way. Please take another look. Thanks!
bogao007
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, one minor comment on the test.
python/pyspark/sql/tests/pandas/helper/helper_pandas_transform_with_state.py
Outdated
Show resolved
Hide resolved
|
|
||
| # A stateful processor that contains composite python type inside Value, List and Map state variable | ||
| class PandasStatefulProcessorCompositeType(StatefulProcessor): | ||
| from typing import NamedTuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to the top of the test module? Is there a specific reason that you want it in the class definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. Just moved it the top!
What changes were proposed in this pull request?
Fix type handling of namedTuple for transfromWithState
Why are the changes needed?
We hit the issue when using namedTuple as value of structType like
The
_serialize_to_bytescannot construct the namedTuple correctly and hitIt's because NamedTuple cannot accept generator as parameter.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT
Was this patch authored or co-authored using generative AI tooling?
No