[SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching #38546

amaliujia · 2022-11-08T02:45:47Z

What changes were proposed in this pull request?

Current columns is implemented based on limit which runs a job to fetch data and get schema from the data collection. However a more efficient way is to call schema API which only need to analyze the plan without collect data. This approach should be more efficient in most of the cases.

Why are the changes needed?

Efficiency

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

amaliujia · 2022-11-08T02:45:58Z

R: @zhengruifeng @HyukjinKwon

AmplabJenkins · 2022-11-08T21:41:04Z

Can one of the admins verify this patch?

zhengruifeng · 2022-11-09T06:38:10Z

python/pyspark/sql/connect/dataframe.py

maybe we don't need to cache columns, just return self.schema.names?

I would prefer to not depend on the underly API when doing caching...

E.g. what if someday the cache on the schema is gone but this API is not aware of it, etc.

Basically do not make assumptions :)

Hmmm .. why do we need _cache? I think we can just remove.

hmm if users call this API multiple times, we only need one gRPC. This should be useful right?

something like:

df.columns() xxxx xxx df.columns() xxxx xxxx df.columns()

if you think this is a bit over-engineering I can remove.

For that case, we should probably have a proper cache layer instead of doing this alone in names.

we can do that for all cases for metadata-ish cases. e.g., schema or even collected results

I removed this caching stuff in this PR. After we pretty much support enough API, we can go back to build a cache layer for all metadata like API to save RPC calls.

python/pyspark/sql/connect/dataframe.py

…to avoid data fetching.

zhengruifeng · 2022-11-11T03:22:34Z

merged into master

…o avoid data fetching ### What changes were proposed in this pull request? Current `columns` is implemented based on `limit` which runs a job to fetch data and get schema from the data collection. However a more efficient way is to call `schema` API which only need to analyze the plan without collect data. This approach should be more efficient in most of the cases. ### Why are the changes needed? Efficiency ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#38546 from amaliujia/improve_python_columns. Authored-by: Rui Wang <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels Nov 8, 2022

zhengruifeng reviewed Nov 9, 2022

View reviewed changes

HyukjinKwon reviewed Nov 10, 2022

View reviewed changes

python/pyspark/sql/connect/dataframe.py Outdated Show resolved Hide resolved

amaliujia added 2 commits November 10, 2022 10:13

[SPARK-41036][CONNECT][PYTHON] columns API should use schema API …

dc79193

…to avoid data fetching.

update

c6ae61b

amaliujia force-pushed the improve_python_columns branch from be1216f to c6ae61b Compare November 10, 2022 18:25

zhengruifeng approved these changes Nov 11, 2022

View reviewed changes

zhengruifeng closed this in 2a8ed47 Nov 11, 2022

amaliujia mentioned this pull request Nov 22, 2022

[SPARK-41212][CONNECT][PYTHON] Implement DataFrame.isEmpty #38734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching #38546

[SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching #38546

Uh oh!

amaliujia commented Nov 8, 2022

Uh oh!

amaliujia commented Nov 8, 2022

Uh oh!

AmplabJenkins commented Nov 8, 2022

Uh oh!

zhengruifeng Nov 9, 2022

Uh oh!

amaliujia Nov 9, 2022 •

edited

Loading

Uh oh!

HyukjinKwon Nov 10, 2022

Uh oh!

amaliujia Nov 10, 2022 •

edited

Loading

Uh oh!

amaliujia Nov 10, 2022 •

edited

Loading

Uh oh!

HyukjinKwon Nov 10, 2022

Uh oh!

HyukjinKwon Nov 10, 2022

Uh oh!

amaliujia Nov 10, 2022

Uh oh!

Uh oh!

zhengruifeng commented Nov 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-41036][CONNECT][PYTHON] columns API should use schema API to avoid data fetching #38546

[SPARK-41036][CONNECT][PYTHON] columns API should use schema API to avoid data fetching #38546

Uh oh!

Conversation

amaliujia commented Nov 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Nov 8, 2022

Uh oh!

AmplabJenkins commented Nov 8, 2022

Uh oh!

zhengruifeng Nov 9, 2022

Choose a reason for hiding this comment

Uh oh!

amaliujia Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

amaliujia Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

amaliujia Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhengruifeng commented Nov 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching #38546

[SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching #38546

amaliujia Nov 9, 2022 •

edited

Loading

amaliujia Nov 10, 2022 •

edited

Loading

amaliujia Nov 10, 2022 •

edited

Loading