Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 13, 2017

What changes were proposed in this pull request?

This PR proposes StructType.fieldNames that returns a copy of a field name list rather than a (undocumented) StructType.names.

There are two points here:

  • API consistency with Scala/Java

  • Provide a safe way to get the field names. Manipulating these might cause unexpected behaviour as below:

    from pyspark.sql.types import *
    
    
    struct = StructType([StructField("f1", StringType(), True)])
    names = struct.names
    del names[0]
    spark.createDataFrame([{"f1": 1}], struct).show()
    ...
    java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 1 fields are required while 0 values are provided.
    	at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:138)
    	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
    	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
    ...
    

How was this patch tested?

Added tests in python/pyspark/sql/tests.py.

def fieldNames(self):
"""
Returns all field names in a tuple.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-07-13 1 50 58

This is the data type representing a :class:`Row`.
Iterating a :class:`StructType` will iterate its :class:`StructField`s.
Iterating a :class:`StructType` will iterate its :class:`StructField`\\s.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before

2017-07-13 2 28 03

After

2017-07-13 2 48 36

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank's for fixing the documentation issue while you were here :) +1

@HyukjinKwon
Copy link
Member Author

Okay, @jkbradley, I tried to find and build some arguments for this API here although actually I am rather neutral on this (as the reasons above might not be worth enough adding an API).

Could you take a look please? I am also fine with closing this PR/JIRA.

@SparkQA
Copy link

SparkQA commented Jul 13, 2017

Test build #79578 has finished for PR 18618 at commit efe113f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks @HyukjinKwon ! I'm still in favor of adding this, partly to match Scala and partly to have API docs for it.

I just had one question: Is there a reason fieldNames should return a tuple in Python, rather than a list? (I just always think of Python lists being the analogue of Scala Arrays.)

@HyukjinKwon
Copy link
Member Author

Either way is fine to me. Let me update this to return a list. I was just thinking struct/row are a tuple-like and the output for this could be as so.

>>> struct.fieldNames()
['f1']
"""
return list(self.names)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that this list call is required to make a copy to prevent an unexpected behaviour described in the PR description by manipulating this names.

>>> df = spark.range(1)
>>> a = df.schema.fieldNames()
>>> b = df.schema.names
>>> df.schema.names[0] = "a"
>>> a
['id']
>>> b
['a']
>>> a[0] = "aaaa"
>>> a
['aaaa']
>>> b
['a']

@SparkQA
Copy link

SparkQA commented Jul 15, 2017

Test build #79631 has finished for PR 18618 at commit eaa910d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@jkbradley, does this make sense to you in general?

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good improvement. Would it make sense to make the current undocumented API deprecated and more our internal usage to _name?

@HyukjinKwon
Copy link
Member Author

@holdenk, sure, makes sense but let me just leave a deprecation note for the StructType.names if you are okay with it too (at least I use this a lot in the production codes ...).

.. note:: `names` attribute is deprecated in 2.3. Use `fieldNames` method instead
to get a list of field names.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017-07-21 10 00 39

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk, would you maybe still prefer to deprecate it? I am willing to follow your decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good enough :)

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79814 has finished for PR 18618 at commit 86493be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member Author

Would you guys mind if I give a shot for my first merge :)?

@HyukjinKwon
Copy link
Member Author

retest this please

@holdenk
Copy link
Contributor

holdenk commented Jul 29, 2017

oh I'm sorry I just merged this to master, but I'll leave the cherry-pick open if you want to back port it to 2.2.X?

@HyukjinKwon
Copy link
Member Author

Oh, that's fine. I don't want to back port this. Will give a try in another small and safe one. Thank you!

@SparkQA
Copy link

SparkQA commented Jul 29, 2017

Test build #80036 has finished for PR 18618 at commit 86493be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

hubot pushed a commit that referenced this pull request Jul 29, 2017
## What changes were proposed in this pull request?

This PR proposes `StructType.fieldNames` that returns a copy of a field name list rather than a (undocumented) `StructType.names`.

There are two points here:

  - API consistency with Scala/Java

  - Provide a safe way to get the field names. Manipulating these might cause unexpected behaviour as below:

    ```python
    from pyspark.sql.types import *

    struct = StructType([StructField("f1", StringType(), True)])
    names = struct.names
    del names[0]
    spark.createDataFrame([{"f1": 1}], struct).show()
    ```

    ```
    ...
    java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 1 fields are required while 0 values are provided.
    	at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:138)
    	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
    	at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741)
    ...
    ```

## How was this patch tested?

Added tests in `python/pyspark/sql/tests.py`.

Author: hyukjinkwon <[email protected]>

Closes #18618 from HyukjinKwon/SPARK-20090.
@jkbradley
Copy link
Member

@holdenk Thanks for merging it! Just wondering: Why is the "pushed a commit" notification from hubot? Did you use the dev/merge_spark_pr.py script?

@HyukjinKwon
Copy link
Member Author

(Yea, I was wondering too..)

@HyukjinKwon HyukjinKwon deleted the SPARK-20090 branch January 2, 2018 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants