[SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names #40310

ueshin · 2023-03-07T00:20:30Z

What changes were proposed in this pull request?

Fixes createDataFrame to autogenerate missing column names.

Why are the changes needed?

Currently the number of the column names specified to createDataFrame does not match the actual number of columns, it raises an error:

>>> spark.createDataFrame([["a", "b"]], ["col1"])
Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

but it should auto-generate the missing column names.

Does this PR introduce any user-facing change?

It will auto-generate the missing columns:

>>> spark.createDataFrame([["a", "b"]], ["col1"])
DataFrame[col1: string, _2: string]

How was this patch tested?

Enabled the related test.

zhengruifeng

LGTM

amaliujia · 2023-03-07T05:48:46Z

LGTM!

amaliujia · 2023-03-07T05:50:17Z

python/pyspark/sql/connect/session.py

            if schema is None:
                _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+            elif isinstance(schema, (list, tuple)) and _num_cols < len(data.columns):
+                _cols = _cols + [f"_{i + 1}" for i in range(_num_cols, len(data.columns))]


In fact, I guess probably we can do a bit more: need to make sure the user provided column name are not the same as the auto-generated one.

Though the probability of the collision is small so maybe this is not a big concern.

itholic

LGTM if test passed!

Seems like mypy check failed:

starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/sql/connect/session.py:[23](https://github.com/ueshin/apache-spark/actions/runs/4349200956/jobs/7598625659#step:16:24)8: error: Unsupported operand types for > ("int" and "None")  [operator]
python/pyspark/sql/connect/session.py:238: note: Left operand is of type "Optional[int]"
python/pyspark/sql/connect/session.py:239: error: Unsupported left operand type for + ("None")  [operator]
python/pyspark/sql/connect/session.py:239: note: Left operand is of type "Optional[List[str]]"
python/pyspark/sql/connect/session.py:239: error: Argument 1 to "range" has incompatible type "Optional[int]"; expected "SupportsIndex"  [arg-type]
python/pyspark/sql/connect/session.py:315: error: Unsupported operand types for > ("int" and "None")  [operator]
python/pyspark/sql/connect/session.py:315: note: Left operand is of type "Optional[int]"
Found 4 errors in 1 file (checked 418 source files)

HyukjinKwon · 2023-03-08T00:35:32Z

Merged to master and branch-3.4.

…ssing column names ### What changes were proposed in this pull request? Fixes `createDataFrame` to autogenerate missing column names. ### Why are the changes needed? Currently the number of the column names specified to `createDataFrame` does not match the actual number of columns, it raises an error: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) Traceback (most recent call last): ... ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements ``` but it should auto-generate the missing column names. ### Does this PR introduce _any_ user-facing change? It will auto-generate the missing columns: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) DataFrame[col1: string, _2: string] ``` ### How was this patch tested? Enabled the related test. Closes #40310 from ueshin/issues/SPARK-42022/columns. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 056ed5d) Signed-off-by: Hyukjin Kwon <[email protected]>

…ssing column names ### What changes were proposed in this pull request? Fixes `createDataFrame` to autogenerate missing column names. ### Why are the changes needed? Currently the number of the column names specified to `createDataFrame` does not match the actual number of columns, it raises an error: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) Traceback (most recent call last): ... ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements ``` but it should auto-generate the missing column names. ### Does this PR introduce _any_ user-facing change? It will auto-generate the missing columns: ```py >>> spark.createDataFrame([["a", "b"]], ["col1"]) DataFrame[col1: string, _2: string] ``` ### How was this patch tested? Enabled the related test. Closes apache#40310 from ueshin/issues/SPARK-42022/columns. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 056ed5d) Signed-off-by: Hyukjin Kwon <[email protected]>

Fix createDataFrame to autogenerate missing column names.

69631f6

ueshin requested review from HyukjinKwon and zhengruifeng March 7, 2023 00:20

github-actions bot added CONNECT CORE PYTHON SQL labels Mar 7, 2023

zhengruifeng approved these changes Mar 7, 2023

View reviewed changes

HyukjinKwon approved these changes Mar 7, 2023

View reviewed changes

amaliujia reviewed Mar 7, 2023

View reviewed changes

itholic approved these changes Mar 7, 2023

View reviewed changes

Fix.

b4aca6b

HyukjinKwon closed this in 056ed5d Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names #40310

[SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names #40310

Uh oh!

ueshin commented Mar 7, 2023

Uh oh!

zhengruifeng left a comment

Uh oh!

amaliujia commented Mar 7, 2023

Uh oh!

amaliujia Mar 7, 2023 •

edited

Loading

Uh oh!

itholic left a comment

Uh oh!

HyukjinKwon commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names #40310

[SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names #40310

Uh oh!

Conversation

ueshin commented Mar 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Mar 7, 2023

Uh oh!

amaliujia Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amaliujia Mar 7, 2023 •

edited

Loading