Skip to content

[SPARK-41122][CONNECT] Explain API can support different modes#38638

Closed
amaliujia wants to merge 2 commits intoapache:masterfrom
amaliujia:explain_extended_not
Closed

[SPARK-41122][CONNECT] Explain API can support different modes#38638
amaliujia wants to merge 2 commits intoapache:masterfrom
amaliujia:explain_extended_not

Conversation

@amaliujia
Copy link
Contributor

@amaliujia amaliujia commented Nov 13, 2022

What changes were proposed in this pull request?

  1. This PR extends Explain API that can support formatted, cost, codegen, extended and simple modes.
  2. This PR also refactors to split Request into ExecutePlanRequest and AnalyzePlanRequest for corresponding RPC endpoints (same for the response).

Why are the changes needed?

Improve Debuggability

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

@cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we group the fields in Request?

I notice that there are more methods seems should be placed in AnalyzePlan:

  • isLocal
  • isStreaming
  • inputFiles
  • semanticHash
  • catalog APIs (not sure about this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually feel like we should start to add more RPC methods. Maybe we should flatten AnalyzePlan to more methods so we don't overload protos....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.show can be a proto message, why not df.explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm yeah I guess we can go to the flatten approach which extracts sutff into message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make Explain as a message now. However given how we split the rpc call, this message needs to stay in base.proto:

// Main interface for the SparkConnect service.
service SparkConnectService {

  // Executes a request that contains the query and returns a stream of [[Response]].
  rpc ExecutePlan(Request) returns (stream Response) {}

  // Analyzes a query and returns a [[AnalyzeResponse]] containing metadata about the query.
  rpc AnalyzePlan(Request) returns (AnalyzeResponse) {}
}

There might be better way to organize it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.show can be a proto message, why not df.explain?

I think we can put all functions into relations.proto, but one difference is that df.show will trigger a job, while df.explain only analyzes, maybe better to move it to AnalyzePlan

@amaliujia amaliujia force-pushed the explain_extended_not branch from 4905a0a to f514fc8 Compare November 14, 2022 19:33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably the right time to split of the Request for Analyze and Execute because the explain message should probably not be part of the overall request.

In the same way this actually, would follow the proto guidelines for the naming of the request objects to be named after their RPC method name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@amaliujia amaliujia force-pushed the explain_extended_not branch from c3c0b94 to 3718a3b Compare November 17, 2022 23:53
@amaliujia
Copy link
Contributor Author

@HyukjinKwon @cloud-fan @grundprinzip @zhengruifeng please take another look.

@amaliujia amaliujia force-pushed the explain_extended_not branch from 3718a3b to ab96e1a Compare November 18, 2022 00:53
@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon pushed a commit that referenced this pull request Nov 19, 2022
… Python

### What changes were proposed in this pull request?

Fix out of sync generated files for Python.

This happens on a rare case for protobuf version change. #38693 downgraded protobuf versions.

There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](#38693) was based on old code before #38638 so the protobuf generates based on stale code which leads to stale generated files.

The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo.

### Why are the changes needed?

Fix out of sync generated files for Python.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes #38718 from amaliujia/fix_out_of_sync_proto.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

1. This PR extends `Explain` API that can support `formatted`, `cost`, `codegen`, `extended` and `simple` modes.
2. This PR also refactors to split `Request` into `ExecutePlanRequest` and `AnalyzePlanRequest` for corresponding RPC endpoints (same for the response).

### Why are the changes needed?

Improve Debuggability

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#38638 from amaliujia/explain_extended_not.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
… Python

### What changes were proposed in this pull request?

Fix out of sync generated files for Python.

This happens on a rare case for protobuf version change. apache#38693 downgraded protobuf versions.

There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](apache#38693) was based on old code before apache#38638 so the protobuf generates based on stale code which leads to stale generated files.

The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo.

### Why are the changes needed?

Fix out of sync generated files for Python.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#38718 from amaliujia/fix_out_of_sync_proto.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
### What changes were proposed in this pull request?

1. This PR extends `Explain` API that can support `formatted`, `cost`, `codegen`, `extended` and `simple` modes.
2. This PR also refactors to split `Request` into `ExecutePlanRequest` and `AnalyzePlanRequest` for corresponding RPC endpoints (same for the response).

### Why are the changes needed?

Improve Debuggability

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#38638 from amaliujia/explain_extended_not.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
… Python

### What changes were proposed in this pull request?

Fix out of sync generated files for Python.

This happens on a rare case for protobuf version change. apache#38693 downgraded protobuf versions.

There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](apache#38693) was based on old code before apache#38638 so the protobuf generates based on stale code which leads to stale generated files.

The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo.

### Why are the changes needed?

Fix out of sync generated files for Python.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#38718 from amaliujia/fix_out_of_sync_proto.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

1. This PR extends `Explain` API that can support `formatted`, `cost`, `codegen`, `extended` and `simple` modes.
2. This PR also refactors to split `Request` into `ExecutePlanRequest` and `AnalyzePlanRequest` for corresponding RPC endpoints (same for the response).

### Why are the changes needed?

Improve Debuggability

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#38638 from amaliujia/explain_extended_not.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
… Python

### What changes were proposed in this pull request?

Fix out of sync generated files for Python.

This happens on a rare case for protobuf version change. apache#38693 downgraded protobuf versions.

There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](apache#38693) was based on old code before apache#38638 so the protobuf generates based on stale code which leads to stale generated files.

The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo.

### Why are the changes needed?

Fix out of sync generated files for Python.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#38718 from amaliujia/fix_out_of_sync_proto.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants