[SPARK-41216][CONNECT][PYTHON] Implement `DataFrame.{isLocal, isStreaming, printSchema, inputFiles}` #38742

zhengruifeng · 2022-11-21T11:23:29Z

What changes were proposed in this pull request?

~~1, Make AnalyzePlan support specified multiple analysis tasks, that is, we can get isLocal, schema, semanticHash together in single RPC if we want.~~
2, Implement following APIs

isLocal
isStreaming
printSchema
~~semanticHash~~
~~sameSemantics~~
inputFiles

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

yes, new APIs

How was this patch tested?

added UTs

zhengruifeng · 2022-11-22T00:36:49Z

cc @HyukjinKwon @cloud-fan @amaliujia @grundprinzip

zhengruifeng · 2022-11-22T00:37:42Z

connector/connect/src/main/protobuf/spark/connect/base.proto

~~I think we can also put catalog methods like listTables/getTable in AnalysisTask~~

catalog apis don't require a plan, maybe better to have a separate rpc

Yes that is why I actually wanted to model each of the Catalog method as RPC because that is more closer to the nature of RPC.

python/pyspark/sql/connect/dataframe.py

hvanhovell · 2022-11-22T15:04:27Z

connector/connect/src/main/protobuf/spark/connect/base.proto

Do we really want to expose this in connect? The problem is hash stability. The same client can connect to different spark versions and get different hashes for this same plan.

will remove it

hvanhovell · 2022-11-22T15:05:04Z

connector/connect/src/main/protobuf/spark/connect/base.proto

Are they equal or do the produce the same result?

one e2e test was added for it

will remove semantic_hash and same_semantics since they are developer apis, although they were also in pyspark

hvanhovell · 2022-11-22T15:06:08Z

connector/connect/src/main/protobuf/spark/connect/base.proto

Honestly this is a client side thing. They already have the schema, so they can construct it themselves.

we also ask the server to provide the string for df.show and df.explain, maybe simpler to also do this for printSchema

grundprinzip · 2022-11-22T15:36:56Z

connector/connect/src/main/protobuf/spark/connect/base.proto

for just having one optional int that is a weird message

grundprinzip · 2022-11-22T15:37:11Z

connector/connect/src/main/protobuf/spark/connect/base.proto

is this really useful here?

it had some usages anyway

grundprinzip · 2022-11-22T15:41:36Z

connector/connect/src/main/protobuf/spark/connect/base.proto

again, why an extra message type just to encapsulate an enum

the message for Explain was not changed, just moved

grundprinzip · 2022-11-22T15:43:18Z

connector/connect/src/main/protobuf/spark/connect/base.proto

What does this actually mean here? What is the use case for multiple analysis tasks?

multiple analysis tasks is for this case: user can get all attributes in single RPC and then cache them for reusing.

grundprinzip · 2022-11-22T15:43:44Z

connector/connect/src/main/protobuf/spark/connect/base.proto

Why would the request contain so much detail?

grundprinzip · 2022-11-22T15:43:56Z

connector/connect/src/main/protobuf/spark/connect/base.proto

grundprinzip · 2022-11-22T15:44:41Z

connector/connect/src/main/protobuf/spark/connect/base.proto

there is no symmetry to the request so it should't be in the request. What is the value of this for the customer? Is this part of the Spark public API?

Do we need this for Spark Connect now?

the methods added here are all public API, and used by the users

printSchema is frequently used, but I also add others by the way

grundprinzip

I think we need to simplify this change to avoid exposing too many Spark internals.

amaliujia · 2022-11-24T01:59:56Z

connector/connect/src/main/protobuf/spark/connect/base.proto

Document what is the default value?

cloud-fan · 2022-11-24T02:59:22Z

python/pyspark/sql/connect/dataframe.py

This is a developer API in Dataset, do we really need to provide it in Spark connect?

oh, I did not notice that, I am fine to remove sameSemantics and semanticHash

grundprinzip

We had an async discussion on this. I request the following changes in the current implementation:

Analysis is done one RPC at a time, no need to have a list of tasks
AnalysisRequest's only configurable parameter is the EXPLAIN_MODE
AnalysisResponse will contain all information that is required from other consumers like schema, is_local etc.

grundprinzip

Thank you! This looks much cleaner!

amaliujia

LGTM

Thanks for the great simplification!

zhengruifeng · 2022-11-25T03:58:24Z

Merged into master, thank you all!

cloud-fan · 2022-11-25T05:30:56Z

python/pyspark/sql/connect/dataframe.py

+        if self._plan is None:
+            raise Exception("Cannot analyze on empty plan.")
+        query = self._plan.to_proto(self._session)
+        return self._session._analyze(query).is_local


are we going to cache the analyze result later?

#38742 (comment)

I think we will do the caching in near future.

We literally can cache everything for each DataFrame since it is immutable. But I guess we need a design/discussion to clarify details of how and when.

There is another interesting question is if we want to do caching on the server side.

…eSemantics`, `_repr_html_ ` ### What changes were proposed in this pull request? Disable `semanticHash`, `sameSemantics`, `_repr_html_ ` ### Why are the changes needed? 1, Disable `semanticHash`, `sameSemantics` according to the discussions in #38742 2, Disable `_repr_html_ ` since it requires [eager mode](https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L878), otherwise, it just returns `None` ``` In [2]: spark.range(start=0, end=10)._repr_html_() is None Out[2]: True ``` ### Does this PR introduce _any_ user-facing change? for these three methods, throw `NotImplementedError` ### How was this patch tested? added test cases Closes #38815 from zhengruifeng/connect_disable_repr_html_sematic. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…ming, printSchema, inputFiles}` ### What changes were proposed in this pull request? ~~1, Make `AnalyzePlan` support specified multiple analysis tasks, that is, we can get `isLocal`, `schema`, `semanticHash` together in single RPC if we want.~~ 2, Implement following APIs - isLocal - isStreaming - printSchema - ~~semanticHash~~ - ~~sameSemantics~~ - inputFiles ### Why are the changes needed? for API coverage ### Does this PR introduce _any_ user-facing change? yes, new APIs ### How was this patch tested? added UTs Closes apache#38742 from zhengruifeng/connect_df_print_schema. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…eSemantics`, `_repr_html_ ` ### What changes were proposed in this pull request? Disable `semanticHash`, `sameSemantics`, `_repr_html_ ` ### Why are the changes needed? 1, Disable `semanticHash`, `sameSemantics` according to the discussions in apache#38742 2, Disable `_repr_html_ ` since it requires [eager mode](https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L878), otherwise, it just returns `None` ``` In [2]: spark.range(start=0, end=10)._repr_html_() is None Out[2]: True ``` ### Does this PR introduce _any_ user-facing change? for these three methods, throw `NotImplementedError` ### How was this patch tested? added test cases Closes apache#38815 from zhengruifeng/connect_disable_repr_html_sematic. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…ming, printSchema, inputFiles}` ### What changes were proposed in this pull request? ~~1, Make `AnalyzePlan` support specified multiple analysis tasks, that is, we can get `isLocal`, `schema`, `semanticHash` together in single RPC if we want.~~ 2, Implement following APIs - isLocal - isStreaming - printSchema - ~~semanticHash~~ - ~~sameSemantics~~ - inputFiles ### Why are the changes needed? for API coverage ### Does this PR introduce _any_ user-facing change? yes, new APIs ### How was this patch tested? added UTs Closes apache#38742 from zhengruifeng/connect_df_print_schema. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…eSemantics`, `_repr_html_ ` ### What changes were proposed in this pull request? Disable `semanticHash`, `sameSemantics`, `_repr_html_ ` ### Why are the changes needed? 1, Disable `semanticHash`, `sameSemantics` according to the discussions in apache#38742 2, Disable `_repr_html_ ` since it requires [eager mode](https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L878), otherwise, it just returns `None` ``` In [2]: spark.range(start=0, end=10)._repr_html_() is None Out[2]: True ``` ### Does this PR introduce _any_ user-facing change? for these three methods, throw `NotImplementedError` ### How was this patch tested? added test cases Closes apache#38815 from zhengruifeng/connect_disable_repr_html_sematic. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels Nov 21, 2022

zhengruifeng commented Nov 22, 2022

View reviewed changes

HyukjinKwon reviewed Nov 22, 2022

View reviewed changes

python/pyspark/sql/connect/dataframe.py Outdated Show resolved Hide resolved

zhengruifeng force-pushed the connect_df_print_schema branch from 5c112f0 to b7f7cc2 Compare November 22, 2022 09:32

HyukjinKwon approved these changes Nov 22, 2022

View reviewed changes

hvanhovell reviewed Nov 22, 2022

View reviewed changes

grundprinzip reviewed Nov 22, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/base.proto Outdated

Copy link

Contributor

grundprinzip Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the request contain so much detail?

grundprinzip reviewed Nov 22, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/base.proto Outdated

Copy link

Contributor

grundprinzip Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc

grundprinzip reviewed Nov 22, 2022

View reviewed changes

amaliujia reviewed Nov 24, 2022

View reviewed changes

cloud-fan reviewed Nov 24, 2022

View reviewed changes

zhengruifeng force-pushed the connect_df_print_schema branch from 87d5fa2 to 0145a2f Compare November 24, 2022 07:04

grundprinzip requested changes Nov 24, 2022

View reviewed changes

zhengruifeng changed the title ~~[SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/inputFiles~~ [SPARK-41216][CONNECT][PYTHON] Implement isLocal/isStreaming/printSchema/inputFiles Nov 24, 2022

zhengruifeng force-pushed the connect_df_print_schema branch from 0145a2f to 72fcb53 Compare November 24, 2022 09:29

grundprinzip approved these changes Nov 24, 2022

View reviewed changes

amaliujia reviewed Nov 24, 2022

View reviewed changes

reimplement

536265c

zhengruifeng force-pushed the connect_df_print_schema branch from 72fcb53 to 536265c Compare November 25, 2022 01:42

zhengruifeng changed the title ~~[SPARK-41216][CONNECT][PYTHON] Implement isLocal/isStreaming/printSchema/inputFiles~~ [SPARK-41216][CONNECT][PYTHON] Implement DataFrame.{isLocal, isStreaming, printSchema, inputFiles} Nov 25, 2022

zhengruifeng closed this in b84ddd5 Nov 25, 2022

zhengruifeng deleted the connect_df_print_schema branch November 25, 2022 03:58

cloud-fan reviewed Nov 25, 2022

View reviewed changes

zhengruifeng mentioned this pull request Nov 28, 2022

[SPARK-41225][CONNECT][PYTHON][FOLLOWUP] Disable semanticHash, sameSemantics, _repr_html_ #38815

Closed

zhengruifeng mentioned this pull request Jan 7, 2023

[SPARK-41874][CONNECT][PYTHON] Implement DataFrame.sameSemantics #39429

Closed

zhengruifeng mentioned this pull request Mar 1, 2023

[SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect #40228

Closed

[SPARK-41216][CONNECT][PYTHON] Implement DataFrame.{isLocal, isStreaming, printSchema, inputFiles} #38742

[SPARK-41216][CONNECT][PYTHON] Implement DataFrame.{isLocal, isStreaming, printSchema, inputFiles} #38742

Uh oh!

Conversation

zhengruifeng commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Nov 22, 2022

Uh oh!

zhengruifeng Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

amaliujia left a comment

Choose a reason for hiding this comment

Uh oh!

[SPARK-41216][CONNECT][PYTHON] Implement `DataFrame.{isLocal, isStreaming, printSchema, inputFiles}` #38742

[SPARK-41216][CONNECT][PYTHON] Implement `DataFrame.{isLocal, isStreaming, printSchema, inputFiles}` #38742

zhengruifeng commented Nov 21, 2022 •

edited

Loading

zhengruifeng Nov 22, 2022 •

edited

Loading

amaliujia Nov 22, 2022 •

edited

Loading

amaliujia Nov 25, 2022 •

edited

Loading