-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13842][PYSPARK] pyspark.sql.types.StructType accessor enhancements #12251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do this to cover the anticipated new APIs.
Do this to support more of the Pythonic "magic" methods.
Do this because the more direct syntax is now available.
Do this so there is less chance for someone to get the stateful attributes out of sync.
python/pyspark/sql/types.py
Outdated
| def __getitem__(self, key): | ||
| """Access fields by name or slice.""" | ||
| if isinstance(key, str): | ||
| _dict_fields = {field.name: field for field in self} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems a bit expensive to compute this dictionary every time no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured the performance of the metadata manipulation would be negligible as soon as you interacted with even a small dataset.
I could look into other lookup methods, but this seemed the most straightforward and pythonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A brief bit of googling suggests the only reasonable alternative would be a direct iteration with an early exit when you find a match. I can switch to that if you prefer; just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking some more at the Scala implementation, this is done in such a way that the order of the fields returned is in the order of the fields in the schema - it would be good to preserve this behaviour in the Python API for consistencies sake (and this implementation might I'm not super sure on Python slicing behaviour with dictionaries) but it would probably be good to add a test for this behaviour as well. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary will only apply if they access via a str, so it can only return a single field.
It did just occur to me that using an actual slice object will return a list of StructField objects instead of a StructType object in this code currently. I'll go fix that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There we go, I added another commit to correctly return StructType objects when accessing fields via a slice. This was a quirk of Python I was only recently made aware of. So in summary, this pull request proposes to:
- Using a
strkey will return the namedStructField:my_structtype['my_structfield_name]` - Using a single
intkey will return theStructFieldby index:my_structtype[2] - Using a
sliceobject (e.g.[2:3]) will return a newStructTypeobject with the chosen fields:my_structtype[2:3]
I'm happy to expand the tests and/or the documentation if you think any section could be clearer.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK if this is not called for every rows. In that case, the index should be used for performance.
|
Discussed most of these changes ahead of time with @holdenk and @davies in the associated JIRA. The last commit might be considered extraneous, but I thought it was related enough and a good idea so I went ahead and offered it up. Easy enough to remove if people would prefer. Ran just the Should be ready to review/discuss. Thanks! |
|
Can you update the description to include the ways that are being expanded? |
|
I updated the pull request description, but I also realize your comment could have meant the user documentation. I'll spend a moment to see if I can find the appropriate location for that. |
Do this so the new accessors are better documented.
|
It appears the user documentation is nicely scraped from the docstrings, so I went ahead and expanded the docstring slightly (and added a couple Let me know if you intended something different. |
Do this because slicing containers in Python should return container objects of the same type.
|
I updated the pull request description to reflect the fix for using the |
|
cc @davies for review |
| data_type_f = data_type | ||
| self.fields.append(StructField(field, data_type_f, nullable, metadata)) | ||
| self.names.append(field) | ||
| self._needSerializeAnyField = any(f.needConversion() for f in self.fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used when unpickle, it's better to have it (for performance).
|
@skparkes These accessor look good, but have some concern about the performance of serialization, please don't remove the without a good reason. |
This reverts commit 831f03a.
|
Alright, I had all the |
|
LGTM |
|
Test build #2831 has finished for PR 12251 at commit
|
|
I apologize, I did not realize you were still targeting python 2.6 compatibility. You guys really are gluttons for punishment. I'll get a commit up today that should provide 2.6 compatibility. |
|
@skparkes We had decided to drop Python 2.6 support, so we should run the tests with Python 2.7, will send a PR to fix that. |
|
After reading the discussion again, it seems that we still need some work to drop Python 2.6. So go ahead to fix this for 2.6 please. |
Do this because Spark is still targeting python 2.6 compatibility.
|
I believe this pull request should now be compatible with python 2.6. I did stand up a python 2.6 environment to run the pertinent tests locally and they now pass. |
|
The final code looks a little silly without a dictionary comprehension now that I reflect on it. Let me swap that over to just direct returning from the iteration. |
Do this because without the dictionary comprehension syntax this was looking very busy.
|
Okay, I made the switch over to the boring linear search method when accessing by field name. I'm comfortable with the state of this pull request and believe it should pass tests on python 2.6 now. Thanks. |
|
Test build #2838 has finished for PR 12251 at commit
|
|
Merging this into master, thanks! |
|
Thank you guys for working through this. |
What changes were proposed in this pull request?
Expand the possible ways to interact with the contents of a
pyspark.sql.types.StructTypeinstance.StructTypewill iterate its fields[field.name for field in my_structtype]my_structtype['my_field_name']my_structtype[0]StructTypewith just the chosen fields:my_structtype[1:3]len(my_structtype) == 2How was this patch tested?
Extended the unit test coverage in the accompanying
tests.py.