Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Mar 7, 2023

What changes were proposed in this pull request?

Fixes createDataFrame to autogenerate missing column names.

Why are the changes needed?

Currently the number of the column names specified to createDataFrame does not match the actual number of columns, it raises an error:

>>> spark.createDataFrame([["a", "b"]], ["col1"])
Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

but it should auto-generate the missing column names.

Does this PR introduce any user-facing change?

It will auto-generate the missing columns:

>>> spark.createDataFrame([["a", "b"]], ["col1"])
DataFrame[col1: string, _2: string]

How was this patch tested?

Enabled the related test.

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amaliujia
Copy link
Contributor

LGTM!

if schema is None:
_cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
elif isinstance(schema, (list, tuple)) and _num_cols < len(data.columns):
_cols = _cols + [f"_{i + 1}" for i in range(_num_cols, len(data.columns))]
Copy link
Contributor

@amaliujia amaliujia Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I guess probably we can do a bit more: need to make sure the user provided column name are not the same as the auto-generated one.

Though the probability of the collision is small so maybe this is not a big concern.

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if test passed!

Seems like mypy check failed:

starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/sql/connect/session.py:[23](https://github.com/ueshin/apache-spark/actions/runs/4349200956/jobs/7598625659#step:16:24)8: error: Unsupported operand types for > ("int" and "None")  [operator]
python/pyspark/sql/connect/session.py:238: note: Left operand is of type "Optional[int]"
python/pyspark/sql/connect/session.py:239: error: Unsupported left operand type for + ("None")  [operator]
python/pyspark/sql/connect/session.py:239: note: Left operand is of type "Optional[List[str]]"
python/pyspark/sql/connect/session.py:239: error: Argument 1 to "range" has incompatible type "Optional[int]"; expected "SupportsIndex"  [arg-type]
python/pyspark/sql/connect/session.py:315: error: Unsupported operand types for > ("int" and "None")  [operator]
python/pyspark/sql/connect/session.py:315: note: Left operand is of type "Optional[int]"
Found 4 errors in 1 file (checked 418 source files)

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.4.

HyukjinKwon pushed a commit that referenced this pull request Mar 8, 2023
…ssing column names

### What changes were proposed in this pull request?

Fixes `createDataFrame` to autogenerate missing column names.

### Why are the changes needed?

Currently the number of the column names specified to `createDataFrame` does not match the actual number of columns, it raises an error:

```py
>>> spark.createDataFrame([["a", "b"]], ["col1"])
Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
```

but it should auto-generate the missing column names.

### Does this PR introduce _any_ user-facing change?

It will auto-generate the missing columns:

```py
>>> spark.createDataFrame([["a", "b"]], ["col1"])
DataFrame[col1: string, _2: string]
```

### How was this patch tested?

Enabled the related test.

Closes #40310 from ueshin/issues/SPARK-42022/columns.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 056ed5d)
Signed-off-by: Hyukjin Kwon <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…ssing column names

### What changes were proposed in this pull request?

Fixes `createDataFrame` to autogenerate missing column names.

### Why are the changes needed?

Currently the number of the column names specified to `createDataFrame` does not match the actual number of columns, it raises an error:

```py
>>> spark.createDataFrame([["a", "b"]], ["col1"])
Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
```

but it should auto-generate the missing column names.

### Does this PR introduce _any_ user-facing change?

It will auto-generate the missing columns:

```py
>>> spark.createDataFrame([["a", "b"]], ["col1"])
DataFrame[col1: string, _2: string]
```

### How was this patch tested?

Enabled the related test.

Closes apache#40310 from ueshin/issues/SPARK-42022/columns.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 056ed5d)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants