[SPARK-46540][PYTHON] Respect column names when Python data source read function outputs named Row objects #44531

allisonwang-db · 2023-12-29T02:26:34Z

What changes were proposed in this pull request?

This PR fixes an issue when the read method of Python DataSourceReader yields named Row objects.
Currently, it ignores the name in the Row object:

def read(self,...):
    yield Row(a=1, b=2)
    yield Row(b=3, a=2)

The result should be [Row(a=1, b=2), Row(a=2, b=3)], instead of [Row(a=1 , b=2), Row(a=3, b=2)].

Why are the changes needed?

To fix an incorrect behavior.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

allisonwang-db · 2024-01-02T01:47:02Z

cc @HyukjinKwon @ueshin

HyukjinKwon · 2024-01-02T02:00:50Z

python/pyspark/sql/worker/plan_data_source_read.py

-                        pylist[col].append(column_converters[col](result[col]))
+                    # Assign output values by name of the field, not position, if the result is a
+                    # named `Row` object.
+                    if isinstance(result, Row) and hasattr(result, "__fields__"):


Can we match the implementation with python worker? See assign_cols_by_name at worker.py.

This is actually different from assign_cols_by_name which re-arranges arrow batch columns by the arrow type names. Here we want to match a single named Row object to the return schema. The only way to tell whether it's named Row(a=1, b=1) from an unnamed Row(1,2) is by checking this __fields__.

HyukjinKwon · 2024-01-02T07:20:01Z

Merged to master.

github-actions bot added SQL PYTHON labels Dec 29, 2023

fix

4b44669

allisonwang-db force-pushed the spark-46540-named-rows branch from 400d876 to 4b44669 Compare December 29, 2023 02:59

HyukjinKwon reviewed Jan 2, 2024

View reviewed changes

HyukjinKwon approved these changes Jan 2, 2024

View reviewed changes

HyukjinKwon closed this in 48a09c4 Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46540][PYTHON] Respect column names when Python data source read function outputs named Row objects #44531

[SPARK-46540][PYTHON] Respect column names when Python data source read function outputs named Row objects #44531

Uh oh!

allisonwang-db commented Dec 29, 2023

Uh oh!

allisonwang-db commented Jan 2, 2024

Uh oh!

HyukjinKwon Jan 2, 2024

Uh oh!

allisonwang-db Jan 2, 2024

Uh oh!

HyukjinKwon commented Jan 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-46540][PYTHON] Respect column names when Python data source read function outputs named Row objects #44531

[SPARK-46540][PYTHON] Respect column names when Python data source read function outputs named Row objects #44531

Uh oh!

Conversation

allisonwang-db commented Dec 29, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db commented Jan 2, 2024

Uh oh!

HyukjinKwon Jan 2, 2024

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Jan 2, 2024

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants