Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Nov 21, 2022

What changes were proposed in this pull request?

1, Make AnalyzePlan support specified multiple analysis tasks, that is, we can get isLocal, schema, semanticHash together in single RPC if we want.
2, Implement following APIs

  • isLocal
  • isStreaming
  • printSchema
  • semanticHash
  • sameSemantics
  • inputFiles

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

yes, new APIs

How was this patch tested?

added UTs

@zhengruifeng
Copy link
Contributor Author

Copy link
Contributor Author

@zhengruifeng zhengruifeng Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also put catalog methods like listTables/getTable in AnalysisTask

catalog apis don't require a plan, maybe better to have a separate rpc

Copy link
Contributor

@amaliujia amaliujia Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is why I actually wanted to model each of the Catalog method as RPC because that is more closer to the nature of RPC.

@zhengruifeng zhengruifeng force-pushed the connect_df_print_schema branch from 5c112f0 to b7f7cc2 Compare November 22, 2022 09:32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to expose this in connect? The problem is hash stability. The same client can connect to different spark versions and get different hashes for this same plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they equal or do the produce the same result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one e2e test was added for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove semantic_hash and same_semantics since they are developer apis, although they were also in pyspark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly this is a client side thing. They already have the schema, so they can construct it themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also ask the server to provide the string for df.show and df.explain, maybe simpler to also do this for printSchema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for just having one optional int that is a weird message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really useful here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it had some usages anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, why an extra message type just to encapsulate an enum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the message for Explain was not changed, just moved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this actually mean here? What is the use case for multiple analysis tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple analysis tasks is for this case: user can get all attributes in single RPC and then cache them for reusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the request contain so much detail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no symmetry to the request so it should't be in the request. What is the value of this for the customer? Is this part of the Spark public API?

Do we need this for Spark Connect now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the methods added here are all public API, and used by the users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printSchema is frequently used, but I also add others by the way

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to simplify this change to avoid exposing too many Spark internals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document what is the default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a developer API in Dataset, do we really need to provide it in Spark connect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I did not notice that, I am fine to remove sameSemantics and semanticHash

@zhengruifeng zhengruifeng force-pushed the connect_df_print_schema branch from 87d5fa2 to 0145a2f Compare November 24, 2022 07:04
@zhengruifeng zhengruifeng changed the title [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/semanticHash/sameSemantics/inputFiles [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/inputFiles Nov 24, 2022
Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had an async discussion on this. I request the following changes in the current implementation:

  • Analysis is done one RPC at a time, no need to have a list of tasks
  • AnalysisRequest's only configurable parameter is the EXPLAIN_MODE
  • AnalysisResponse will contain all information that is required from other consumers like schema, is_local etc.

@zhengruifeng zhengruifeng changed the title [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/inputFiles [SPARK-41216][CONNECT][PYTHON] Implement isLocal/isStreaming/printSchema/inputFiles Nov 24, 2022
@zhengruifeng zhengruifeng force-pushed the connect_df_print_schema branch from 0145a2f to 72fcb53 Compare November 24, 2022 09:29
Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks much cleaner!

Copy link
Contributor

@amaliujia amaliujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the great simplification!

@zhengruifeng zhengruifeng force-pushed the connect_df_print_schema branch from 72fcb53 to 536265c Compare November 25, 2022 01:42
@zhengruifeng zhengruifeng changed the title [SPARK-41216][CONNECT][PYTHON] Implement isLocal/isStreaming/printSchema/inputFiles [SPARK-41216][CONNECT][PYTHON] Implement DataFrame.{isLocal, isStreaming, printSchema, inputFiles} Nov 25, 2022
@zhengruifeng
Copy link
Contributor Author

Merged into master, thank you all!

@zhengruifeng zhengruifeng deleted the connect_df_print_schema branch November 25, 2022 03:58
if self._plan is None:
raise Exception("Cannot analyze on empty plan.")
query = self._plan.to_proto(self._session)
return self._session._analyze(query).is_local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to cache the analyze result later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#38742 (comment)

I think we will do the caching in near future.

Copy link
Contributor

@amaliujia amaliujia Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We literally can cache everything for each DataFrame since it is immutable. But I guess we need a design/discussion to clarify details of how and when.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another interesting question is if we want to do caching on the server side.

zhengruifeng added a commit that referenced this pull request Nov 28, 2022
…eSemantics`, `_repr_html_ `

### What changes were proposed in this pull request?
Disable `semanticHash`, `sameSemantics`, `_repr_html_ `

### Why are the changes needed?
1, Disable `semanticHash`, `sameSemantics` according to the discussions in #38742
2, Disable `_repr_html_ ` since it requires [eager mode](https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L878), otherwise, it just returns `None`

```
In [2]: spark.range(start=0, end=10)._repr_html_() is None
Out[2]: True

```

### Does this PR introduce _any_ user-facing change?
for these three methods, throw `NotImplementedError`

### How was this patch tested?
added test cases

Closes #38815 from zhengruifeng/connect_disable_repr_html_sematic.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…ming, printSchema, inputFiles}`

### What changes were proposed in this pull request?
~~1, Make `AnalyzePlan` support specified multiple analysis tasks, that is, we can get `isLocal`, `schema`, `semanticHash` together in single RPC if we want.~~
2, Implement following APIs

- isLocal
- isStreaming
- printSchema
- ~~semanticHash~~
- ~~sameSemantics~~
- inputFiles

### Why are the changes needed?
for API coverage

### Does this PR introduce _any_ user-facing change?
yes, new APIs

### How was this patch tested?
added UTs

Closes apache#38742 from zhengruifeng/connect_df_print_schema.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…eSemantics`, `_repr_html_ `

### What changes were proposed in this pull request?
Disable `semanticHash`, `sameSemantics`, `_repr_html_ `

### Why are the changes needed?
1, Disable `semanticHash`, `sameSemantics` according to the discussions in apache#38742
2, Disable `_repr_html_ ` since it requires [eager mode](https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L878), otherwise, it just returns `None`

```
In [2]: spark.range(start=0, end=10)._repr_html_() is None
Out[2]: True

```

### Does this PR introduce _any_ user-facing change?
for these three methods, throw `NotImplementedError`

### How was this patch tested?
added test cases

Closes apache#38815 from zhengruifeng/connect_disable_repr_html_sematic.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…ming, printSchema, inputFiles}`

### What changes were proposed in this pull request?
~~1, Make `AnalyzePlan` support specified multiple analysis tasks, that is, we can get `isLocal`, `schema`, `semanticHash` together in single RPC if we want.~~
2, Implement following APIs

- isLocal
- isStreaming
- printSchema
- ~~semanticHash~~
- ~~sameSemantics~~
- inputFiles

### Why are the changes needed?
for API coverage

### Does this PR introduce _any_ user-facing change?
yes, new APIs

### How was this patch tested?
added UTs

Closes apache#38742 from zhengruifeng/connect_df_print_schema.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…eSemantics`, `_repr_html_ `

### What changes were proposed in this pull request?
Disable `semanticHash`, `sameSemantics`, `_repr_html_ `

### Why are the changes needed?
1, Disable `semanticHash`, `sameSemantics` according to the discussions in apache#38742
2, Disable `_repr_html_ ` since it requires [eager mode](https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L878), otherwise, it just returns `None`

```
In [2]: spark.range(start=0, end=10)._repr_html_() is None
Out[2]: True

```

### Does this PR introduce _any_ user-facing change?
for these three methods, throw `NotImplementedError`

### How was this patch tested?
added test cases

Closes apache#38815 from zhengruifeng/connect_disable_repr_html_sematic.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants