Skip to content

Conversation

@shea-parkes
Copy link
Contributor

@shea-parkes shea-parkes commented Apr 8, 2016

What changes were proposed in this pull request?

Expand the possible ways to interact with the contents of a pyspark.sql.types.StructType instance.

  • Iterating a StructType will iterate its fields
    • [field.name for field in my_structtype]
  • Indexing with a string will return a field by name
    • my_structtype['my_field_name']
  • Indexing with an integer will return a field by position
    • my_structtype[0]
  • Indexing with a slice will return a new StructType with just the chosen fields:
    • my_structtype[1:3]
  • The length is the number of fields (should also provide "truthiness" for free)
    • len(my_structtype) == 2

How was this patch tested?

Extended the unit test coverage in the accompanying tests.py.

Do this to cover the anticipated new APIs.
Do this to support more of the Pythonic "magic" methods.
Do this because the more direct syntax is now available.
Do this so there is less chance for someone to get the stateful
attributes out of sync.
def __getitem__(self, key):
"""Access fields by name or slice."""
if isinstance(key, str):
_dict_fields = {field.name: field for field in self}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems a bit expensive to compute this dictionary every time no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured the performance of the metadata manipulation would be negligible as soon as you interacted with even a small dataset.

I could look into other lookup methods, but this seemed the most straightforward and pythonic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A brief bit of googling suggests the only reasonable alternative would be a direct iteration with an early exit when you find a match. I can switch to that if you prefer; just let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking some more at the Scala implementation, this is done in such a way that the order of the fields returned is in the order of the fields in the schema - it would be good to preserve this behaviour in the Python API for consistencies sake (and this implementation might I'm not super sure on Python slicing behaviour with dictionaries) but it would probably be good to add a test for this behaviour as well. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary will only apply if they access via a str, so it can only return a single field.

It did just occur to me that using an actual slice object will return a list of StructField objects instead of a StructType object in this code currently. I'll go fix that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There we go, I added another commit to correctly return StructType objects when accessing fields via a slice. This was a quirk of Python I was only recently made aware of. So in summary, this pull request proposes to:

  • Using a str key will return the named StructField: my_structtype['my_structfield_name]`
  • Using a single int key will return the StructField by index: my_structtype[2]
  • Using a slice object (e.g. [2:3]) will return a new StructType object with the chosen fields: my_structtype[2:3]

I'm happy to expand the tests and/or the documentation if you think any section could be clearer.

Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK if this is not called for every rows. In that case, the index should be used for performance.

@shea-parkes
Copy link
Contributor Author

Discussed most of these changes ahead of time with @holdenk and @davies in the associated JIRA. The last commit might be considered extraneous, but I thought it was related enough and a good idea so I went ahead and offered it up. Easy enough to remove if people would prefer.

Ran just the pyspark.sql.types.StructType tests locally on a Python 3.5 setup. Tried to be mindful of coding for python 2.7 compatibility (although I already miss type hints...).

Should be ready to review/discuss.

Thanks!

@rxin
Copy link
Contributor

rxin commented Apr 8, 2016

Can you update the description to include the ways that are being expanded?

@shea-parkes
Copy link
Contributor Author

I updated the pull request description, but I also realize your comment could have meant the user documentation. I'll spend a moment to see if I can find the appropriate location for that.

Do this so the new accessors are better documented.
@shea-parkes
Copy link
Contributor Author

It appears the user documentation is nicely scraped from the docstrings, so I went ahead and expanded the docstring slightly (and added a couple doctests).

Let me know if you intended something different.

Do this because slicing containers in Python should return container
objects of the same type.
@shea-parkes
Copy link
Contributor Author

I updated the pull request description to reflect the fix for using the slice syntax. It is my belief that this should offer a solid balance of functionality and the principal of least astonishment. I'm still happy to revise this further (and even trim down the proposed changes). Just let me know.

@rxin
Copy link
Contributor

rxin commented Apr 19, 2016

cc @davies for review

data_type_f = data_type
self.fields.append(StructField(field, data_type_f, nullable, metadata))
self.names.append(field)
self._needSerializeAnyField = any(f.needConversion() for f in self.fields)
Copy link
Contributor

@davies davies Apr 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used when unpickle, it's better to have it (for performance).

@davies
Copy link
Contributor

davies commented Apr 19, 2016

@skparkes These accessor look good, but have some concern about the performance of serialization, please don't remove the without a good reason.

@shea-parkes
Copy link
Contributor Author

Alright, I had all the @property work rolled into a single commit. I've reverted it so only the accessor work should be left. Let me know if you see anything else you think should be changed.

@davies
Copy link
Contributor

davies commented Apr 19, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #2831 has finished for PR 12251 at commit 387cdfe.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shea-parkes
Copy link
Contributor Author

I apologize, I did not realize you were still targeting python 2.6 compatibility. You guys really are gluttons for punishment. I'll get a commit up today that should provide 2.6 compatibility.

@davies
Copy link
Contributor

davies commented Apr 20, 2016

@skparkes We had decided to drop Python 2.6 support, so we should run the tests with Python 2.7, will send a PR to fix that.

@davies
Copy link
Contributor

davies commented Apr 20, 2016

After reading the discussion again, it seems that we still need some work to drop Python 2.6. So go ahead to fix this for 2.6 please.

Do this because Spark is still targeting python 2.6 compatibility.
@shea-parkes
Copy link
Contributor Author

I believe this pull request should now be compatible with python 2.6. I did stand up a python 2.6 environment to run the pertinent tests locally and they now pass.

@shea-parkes
Copy link
Contributor Author

The final code looks a little silly without a dictionary comprehension now that I reflect on it. Let me swap that over to just direct returning from the iteration.

Do this because without the dictionary comprehension syntax this was
looking very busy.
@shea-parkes
Copy link
Contributor Author

Okay, I made the switch over to the boring linear search method when accessing by field name. I'm comfortable with the state of this pull request and believe it should pass tests on python 2.6 now. Thanks.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #2838 has finished for PR 12251 at commit fc7898b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Apr 20, 2016

Merging this into master, thanks!

@asfgit asfgit closed this in e7791c4 Apr 20, 2016
@shea-parkes
Copy link
Contributor Author

Thank you guys for working through this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants