Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes an issue when the read method of Python DataSourceReader yields named Row objects.
Currently, it ignores the name in the Row object:

def read(self,...):
    yield Row(a=1, b=2)
    yield Row(b=3, a=2)

The result should be [Row(a=1, b=2), Row(a=2, b=3)], instead of [Row(a=1 , b=2), Row(a=3, b=2)].

Why are the changes needed?

To fix an incorrect behavior.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

@allisonwang-db
Copy link
Contributor Author

cc @HyukjinKwon @ueshin

pylist[col].append(column_converters[col](result[col]))
# Assign output values by name of the field, not position, if the result is a
# named `Row` object.
if isinstance(result, Row) and hasattr(result, "__fields__"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we match the implementation with python worker? See assign_cols_by_name at worker.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually different from assign_cols_by_name which re-arranges arrow batch columns by the arrow type names. Here we want to match a single named Row object to the return schema. The only way to tell whether it's named Row(a=1, b=1) from an unnamed Row(1,2) is by checking this __fields__.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants