-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names #40310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zhengruifeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
LGTM! |
| if schema is None: | ||
| _cols = [str(x) if not isinstance(x, str) else x for x in data.columns] | ||
| elif isinstance(schema, (list, tuple)) and _num_cols < len(data.columns): | ||
| _cols = _cols + [f"_{i + 1}" for i in range(_num_cols, len(data.columns))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I guess probably we can do a bit more: need to make sure the user provided column name are not the same as the auto-generated one.
Though the probability of the collision is small so maybe this is not a big concern.
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if test passed!
Seems like mypy check failed:
starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/sql/connect/session.py:[23](https://github.com/ueshin/apache-spark/actions/runs/4349200956/jobs/7598625659#step:16:24)8: error: Unsupported operand types for > ("int" and "None") [operator]
python/pyspark/sql/connect/session.py:238: note: Left operand is of type "Optional[int]"
python/pyspark/sql/connect/session.py:239: error: Unsupported left operand type for + ("None") [operator]
python/pyspark/sql/connect/session.py:239: note: Left operand is of type "Optional[List[str]]"
python/pyspark/sql/connect/session.py:239: error: Argument 1 to "range" has incompatible type "Optional[int]"; expected "SupportsIndex" [arg-type]
python/pyspark/sql/connect/session.py:315: error: Unsupported operand types for > ("int" and "None") [operator]
python/pyspark/sql/connect/session.py:315: note: Left operand is of type "Optional[int]"
Found 4 errors in 1 file (checked 418 source files)
|
Merged to master and branch-3.4. |
…ssing column names ### What changes were proposed in this pull request? Fixes `createDataFrame` to autogenerate missing column names. ### Why are the changes needed? Currently the number of the column names specified to `createDataFrame` does not match the actual number of columns, it raises an error: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) Traceback (most recent call last): ... ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements ``` but it should auto-generate the missing column names. ### Does this PR introduce _any_ user-facing change? It will auto-generate the missing columns: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) DataFrame[col1: string, _2: string] ``` ### How was this patch tested? Enabled the related test. Closes #40310 from ueshin/issues/SPARK-42022/columns. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 056ed5d) Signed-off-by: Hyukjin Kwon <[email protected]>
…ssing column names ### What changes were proposed in this pull request? Fixes `createDataFrame` to autogenerate missing column names. ### Why are the changes needed? Currently the number of the column names specified to `createDataFrame` does not match the actual number of columns, it raises an error: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) Traceback (most recent call last): ... ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements ``` but it should auto-generate the missing column names. ### Does this PR introduce _any_ user-facing change? It will auto-generate the missing columns: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) DataFrame[col1: string, _2: string] ``` ### How was this patch tested? Enabled the related test. Closes apache#40310 from ueshin/issues/SPARK-42022/columns. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 056ed5d) Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Fixes
createDataFrameto autogenerate missing column names.Why are the changes needed?
Currently the number of the column names specified to
createDataFramedoes not match the actual number of columns, it raises an error:but it should auto-generate the missing column names.
Does this PR introduce any user-facing change?
It will auto-generate the missing columns:
How was this patch tested?
Enabled the related test.