[SPARK-13842][PYSPARK] pyspark.sql.types.StructType accessor enhancements #12251

shea-parkes · 2016-04-08T02:02:20Z

What changes were proposed in this pull request?

Expand the possible ways to interact with the contents of a pyspark.sql.types.StructType instance.

Iterating a StructType will iterate its fields
- [field.name for field in my_structtype]
Indexing with a string will return a field by name
- my_structtype['my_field_name']
Indexing with an integer will return a field by position
- my_structtype[0]
Indexing with a slice will return a new StructType with just the chosen fields:
- my_structtype[1:3]
The length is the number of fields (should also provide "truthiness" for free)
- len(my_structtype) == 2

How was this patch tested?

Extended the unit test coverage in the accompanying tests.py.

Do this to cover the anticipated new APIs.

Do this to support more of the Pythonic "magic" methods.

Do this because the more direct syntax is now available.

Do this so there is less chance for someone to get the stateful attributes out of sync.

holdenk · 2016-04-08T02:07:41Z

python/pyspark/sql/types.py

+    def __getitem__(self, key):
+        """Access fields by name or slice."""
+        if isinstance(key, str):
+            _dict_fields = {field.name: field for field in self}


it seems a bit expensive to compute this dictionary every time no?

I figured the performance of the metadata manipulation would be negligible as soon as you interacted with even a small dataset.

I could look into other lookup methods, but this seemed the most straightforward and pythonic.

A brief bit of googling suggests the only reasonable alternative would be a direct iteration with an early exit when you find a match. I can switch to that if you prefer; just let me know.

So looking some more at the Scala implementation, this is done in such a way that the order of the fields returned is in the order of the fields in the schema - it would be good to preserve this behaviour in the Python API for consistencies sake (and this implementation might I'm not super sure on Python slicing behaviour with dictionaries) but it would probably be good to add a test for this behaviour as well. What do you think?

The dictionary will only apply if they access via a str, so it can only return a single field.

It did just occur to me that using an actual slice object will return a list of StructField objects instead of a StructType object in this code currently. I'll go fix that now.

There we go, I added another commit to correctly return StructType objects when accessing fields via a slice. This was a quirk of Python I was only recently made aware of. So in summary, this pull request proposes to:

Using a str key will return the named StructField: my_structtype['my_structfield_name]`

Using a single int key will return the StructField by index: my_structtype[2]

Using a slice object (e.g. [2:3]) will return a new StructType object with the chosen fields: my_structtype[2:3]

I'm happy to expand the tests and/or the documentation if you think any section could be clearer.

Thanks.

I think it's OK if this is not called for every rows. In that case, the index should be used for performance.

shea-parkes · 2016-04-08T02:09:05Z

Discussed most of these changes ahead of time with @holdenk and @davies in the associated JIRA. The last commit might be considered extraneous, but I thought it was related enough and a good idea so I went ahead and offered it up. Easy enough to remove if people would prefer.

Ran just the pyspark.sql.types.StructType tests locally on a Python 3.5 setup. Tried to be mindful of coding for python 2.7 compatibility (although I already miss type hints...).

Should be ready to review/discuss.

Thanks!

rxin · 2016-04-08T02:38:33Z

Can you update the description to include the ways that are being expanded?

shea-parkes · 2016-04-08T02:52:25Z

I updated the pull request description, but I also realize your comment could have meant the user documentation. I'll spend a moment to see if I can find the appropriate location for that.

Do this so the new accessors are better documented.

shea-parkes · 2016-04-08T03:09:54Z

It appears the user documentation is nicely scraped from the docstrings, so I went ahead and expanded the docstring slightly (and added a couple doctests).

Let me know if you intended something different.

Do this because slicing containers in Python should return container objects of the same type.

shea-parkes · 2016-04-19T01:43:50Z

I updated the pull request description to reflect the fix for using the slice syntax. It is my belief that this should offer a solid balance of functionality and the principal of least astonishment. I'm still happy to revise this further (and even trim down the proposed changes). Just let me know.

rxin · 2016-04-19T01:47:35Z

cc @davies for review

davies · 2016-04-19T15:45:19Z

python/pyspark/sql/types.py

                data_type_f = data_type
            self.fields.append(StructField(field, data_type_f, nullable, metadata))
-            self.names.append(field)
-        self._needSerializeAnyField = any(f.needConversion() for f in self.fields)


This is used when unpickle, it's better to have it (for performance).

davies · 2016-04-19T16:42:30Z

@skparkes These accessor look good, but have some concern about the performance of serialization, please don't remove the without a good reason.

This reverts commit 831f03a.

shea-parkes · 2016-04-19T19:12:57Z

Alright, I had all the @property work rolled into a single commit. I've reverted it so only the accessor work should be left. Let me know if you see anything else you think should be changed.

davies · 2016-04-19T20:39:14Z

LGTM

SparkQA · 2016-04-19T20:46:55Z

Test build #2831 has finished for PR 12251 at commit 387cdfe.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shea-parkes · 2016-04-20T13:11:40Z

I apologize, I did not realize you were still targeting python 2.6 compatibility. You guys really are gluttons for punishment. I'll get a commit up today that should provide 2.6 compatibility.

davies · 2016-04-20T16:59:03Z

@skparkes We had decided to drop Python 2.6 support, so we should run the tests with Python 2.7, will send a PR to fix that.

davies · 2016-04-20T17:05:55Z

After reading the discussion again, it seems that we still need some work to drop Python 2.6. So go ahead to fix this for 2.6 please.

Do this because Spark is still targeting python 2.6 compatibility.

shea-parkes · 2016-04-20T18:38:38Z

I believe this pull request should now be compatible with python 2.6. I did stand up a python 2.6 environment to run the pertinent tests locally and they now pass.

shea-parkes · 2016-04-20T18:40:41Z

The final code looks a little silly without a dictionary comprehension now that I reflect on it. Let me swap that over to just direct returning from the iteration.

Do this because without the dictionary comprehension syntax this was looking very busy.

shea-parkes · 2016-04-20T18:44:00Z

Okay, I made the switch over to the boring linear search method when accessing by field name. I'm comfortable with the state of this pull request and believe it should pass tests on python 2.6 now. Thanks.

SparkQA · 2016-04-20T20:38:53Z

Test build #2838 has finished for PR 12251 at commit fc7898b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-04-20T20:44:56Z

Merging this into master, thanks!

shea-parkes · 2016-04-21T03:28:17Z

Thank you guys for working through this.

shea-parkes added 4 commits April 7, 2016 21:41

Expand pyspark.sql.types.StructType tests.

90207c4

Do this to cover the anticipated new APIs.

Expand pyspark.sql.types.StructType accessors.

139243a

Do this to support more of the Pythonic "magic" methods.

Dogfood new __iter__ in pyspark.sql.types.StructField.

0393115

Do this because the more direct syntax is now available.

Make pyspark.sql.types.StructType attributes lazier.

831f03a

Do this so there is less chance for someone to get the stateful attributes out of sync.

holdenk reviewed Apr 8, 2016
View reviewed changes

Expand pyspark.sql.types.StructType docstring.

5ffffd9

Do this so the new accessors are better documented.

Correctly slice StructTypes.

4cf5d01

Do this because slicing containers in Python should return container objects of the same type.

davies reviewed Apr 19, 2016
View reviewed changes

Revert "Make pyspark.sql.types.StructType attributes lazier."

387cdfe

This reverts commit 831f03a.

Make changes compatible with python 2.6.

b11fc7d

Do this because Spark is still targeting python 2.6 compatibility.

Switch to returning field from inside loop.

fc7898b

Do this because without the dictionary comprehension syntax this was looking very busy.

asfgit closed this in e7791c4 Apr 20, 2016

[SPARK-13842][PYSPARK] pyspark.sql.types.StructType accessor enhancements #12251

[SPARK-13842][PYSPARK] pyspark.sql.types.StructType accessor enhancements #12251

Uh oh!

Conversation

shea-parkes commented Apr 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdenk Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

shea-parkes Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

shea-parkes Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk Apr 19, 2016

Choose a reason for hiding this comment

Uh oh!

shea-parkes Apr 19, 2016

Choose a reason for hiding this comment

Uh oh!

shea-parkes Apr 19, 2016

Choose a reason for hiding this comment

Uh oh!

davies Apr 19, 2016

Choose a reason for hiding this comment

Uh oh!

shea-parkes commented Apr 8, 2016

Uh oh!

rxin commented Apr 8, 2016

Uh oh!

shea-parkes commented Apr 8, 2016

Uh oh!

shea-parkes commented Apr 8, 2016

Uh oh!

shea-parkes commented Apr 19, 2016

Uh oh!

rxin commented Apr 19, 2016

Uh oh!

davies Apr 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Apr 19, 2016

Uh oh!

shea-parkes commented Apr 19, 2016

Uh oh!

davies commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

shea-parkes commented Apr 20, 2016

Uh oh!

davies commented Apr 20, 2016

Uh oh!

davies commented Apr 20, 2016

Uh oh!

shea-parkes commented Apr 20, 2016

Uh oh!

shea-parkes commented Apr 20, 2016

Uh oh!

shea-parkes commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

davies commented Apr 20, 2016

Uh oh!

shea-parkes commented Apr 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shea-parkes commented Apr 8, 2016 •

edited

Loading

davies Apr 19, 2016 •

edited

Loading