[SPARK-20090][PYTHON] Add StructType.fieldNames in PySpark #18618

HyukjinKwon · 2017-07-13T05:52:47Z

What changes were proposed in this pull request?

This PR proposes StructType.fieldNames that returns a copy of a field name list rather than a (undocumented) StructType.names.

There are two points here:

API consistency with Scala/Java

Provide a safe way to get the field names. Manipulating these might cause unexpected behaviour as below:

from pyspark.sql.types import *


struct = StructType([StructField("f1", StringType(), True)])
names = struct.names
del names[0]
spark.createDataFrame([{"f1": 1}], struct).show()

...
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 1 fields are required while 0 values are provided.
	at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:138)
	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
...

How was this patch tested?

Added tests in python/pyspark/sql/tests.py.

HyukjinKwon · 2017-07-13T05:53:01Z

python/pyspark/sql/types.py

+    def fieldNames(self):
+        """
+        Returns all field names in a tuple.
+


HyukjinKwon · 2017-07-13T05:53:22Z

python/pyspark/sql/types.py

    This is the data type representing a :class:`Row`.

-    Iterating a :class:`StructType` will iterate its :class:`StructField`s.
+    Iterating a :class:`StructType` will iterate its :class:`StructField`\\s.


Before

After

Thank's for fixing the documentation issue while you were here :) +1

HyukjinKwon · 2017-07-13T05:57:19Z

Okay, @jkbradley, I tried to find and build some arguments for this API here although actually I am rather neutral on this (as the reasons above might not be worth enough adding an API).

Could you take a look please? I am also fine with closing this PR/JIRA.

SparkQA · 2017-07-13T06:22:12Z

Test build #79578 has finished for PR 18618 at commit efe113f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-07-14T17:45:56Z

Thanks @HyukjinKwon ! I'm still in favor of adding this, partly to match Scala and partly to have API docs for it.

I just had one question: Is there a reason fieldNames should return a tuple in Python, rather than a list? (I just always think of Python lists being the analogue of Scala Arrays.)

HyukjinKwon · 2017-07-15T04:03:07Z

Either way is fine to me. Let me update this to return a list. I was just thinking struct/row are a tuple-like and the output for this could be as so.

HyukjinKwon · 2017-07-15T04:21:48Z

python/pyspark/sql/types.py

+        >>> struct.fieldNames()
+        ['f1']
+        """
+        return list(self.names)


Just to note that this list call is required to make a copy to prevent an unexpected behaviour described in the PR description by manipulating this names.

>>> df = spark.range(1) >>> a = df.schema.fieldNames() >>> b = df.schema.names >>> df.schema.names[0] = "a" >>> a ['id'] >>> b ['a'] >>> a[0] = "aaaa" >>> a ['aaaa'] >>> b ['a']

SparkQA · 2017-07-15T04:44:11Z

Test build #79631 has finished for PR 18618 at commit eaa910d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-17T23:21:30Z

@jkbradley, does this make sense to you in general?

holdenk

This looks like a good improvement. Would it make sense to make the current undocumented API deprecated and more our internal usage to _name?

HyukjinKwon · 2017-07-21T00:58:28Z

@holdenk, sure, makes sense but let me just leave a deprecation note for the StructType.names if you are okay with it too (at least I use this a lot in the production codes ...).

HyukjinKwon · 2017-07-21T01:02:24Z

python/pyspark/sql/types.py


+    .. note:: `names` attribute is deprecated in 2.3. Use `fieldNames` method instead
+        to get a list of field names.
+


@holdenk, would you maybe still prefer to deprecate it? I am willing to follow your decision.

This is good enough :)

SparkQA · 2017-07-21T01:28:54Z

Test build #79814 has finished for PR 18618 at commit 86493be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM

HyukjinKwon · 2017-07-29T03:57:24Z

Would you guys mind if I give a shot for my first merge :)?

HyukjinKwon · 2017-07-29T04:01:53Z

retest this please

holdenk · 2017-07-29T04:05:04Z

oh I'm sorry I just merged this to master, but I'll leave the cherry-pick open if you want to back port it to 2.2.X?

HyukjinKwon · 2017-07-29T04:20:17Z

Oh, that's fine. I don't want to back port this. Will give a try in another small and safe one. Thank you!

SparkQA · 2017-07-29T04:27:24Z

Test build #80036 has finished for PR 18618 at commit 86493be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This PR proposes `StructType.fieldNames` that returns a copy of a field name list rather than a (undocumented) `StructType.names`. There are two points here: - API consistency with Scala/Java - Provide a safe way to get the field names. Manipulating these might cause unexpected behaviour as below: ```python from pyspark.sql.types import * struct = StructType([StructField("f1", StringType(), True)]) names = struct.names del names[0] spark.createDataFrame([{"f1": 1}], struct).show() ``` ``` ... java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 1 fields are required while 0 values are provided. at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:138) at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741) at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741) ... ``` ## How was this patch tested? Added tests in `python/pyspark/sql/tests.py`. Author: hyukjinkwon <[email protected]> Closes #18618 from HyukjinKwon/SPARK-20090.

jkbradley · 2017-08-01T00:30:29Z

@holdenk Thanks for merging it! Just wondering: Why is the "pushed a commit" notification from hubot? Did you use the dev/merge_spark_pr.py script?

HyukjinKwon · 2017-08-01T00:35:08Z

(Yea, I was wondering too..)

Add StructType.fieldNames in PySpark

efe113f

HyukjinKwon commented Jul 13, 2017

View reviewed changes

python/pyspark/sql/types.py

def fieldNames(self):

"""

Returns all field names in a tuple.

Copy link

Member Author

HyukjinKwon Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyukjinKwon commented Jul 13, 2017

View reviewed changes

Make the return type to a list instead of a tuple

eaa910d

HyukjinKwon commented Jul 15, 2017

View reviewed changes

holdenk reviewed Jul 20, 2017

View reviewed changes

Add a note for deprecation for names attribute in StructType

86493be

HyukjinKwon commented Jul 21, 2017

View reviewed changes

felixcheung approved these changes Jul 29, 2017

View reviewed changes

HyukjinKwon closed this Jul 29, 2017

HyukjinKwon deleted the SPARK-20090 branch January 2, 2018 03:41


		.. note:: `names` attribute is deprecated in 2.3. Use `fieldNames` method instead
		to get a list of field names.

[SPARK-20090][PYTHON] Add StructType.fieldNames in PySpark #18618

[SPARK-20090][PYTHON] Add StructType.fieldNames in PySpark #18618

Uh oh!

Conversation

HyukjinKwon commented Jul 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Jul 13, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 13, 2017

Choose a reason for hiding this comment

Uh oh!

holdenk Jul 20, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 13, 2017

Uh oh!

SparkQA commented Jul 13, 2017

Uh oh!

jkbradley commented Jul 14, 2017

Uh oh!

HyukjinKwon commented Jul 15, 2017

Uh oh!

HyukjinKwon Jul 15, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 15, 2017

Uh oh!

HyukjinKwon commented Jul 17, 2017

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 21, 2017

Uh oh!

HyukjinKwon Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 22, 2017

Choose a reason for hiding this comment

Uh oh!

holdenk Jul 29, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 29, 2017

Uh oh!

HyukjinKwon commented Jul 29, 2017

Uh oh!

holdenk commented Jul 29, 2017

Uh oh!

HyukjinKwon commented Jul 29, 2017

Uh oh!

SparkQA commented Jul 29, 2017

Uh oh!

jkbradley commented Aug 1, 2017

Uh oh!

HyukjinKwon commented Aug 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Jul 13, 2017 •

edited

Loading