[SPARK-41122][CONNECT] Explain API can support different modes#38638
[SPARK-41122][CONNECT] Explain API can support different modes#38638amaliujia wants to merge 2 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
should we group the fields in Request?
I notice that there are more methods seems should be placed in AnalyzePlan:
- isLocal
- isStreaming
- inputFiles
- semanticHash
- catalog APIs (not sure about this)
There was a problem hiding this comment.
I actually feel like we should start to add more RPC methods. Maybe we should flatten AnalyzePlan to more methods so we don't overload protos....
There was a problem hiding this comment.
df.show can be a proto message, why not df.explain?
There was a problem hiding this comment.
hmmm yeah I guess we can go to the flatten approach which extracts sutff into message.
There was a problem hiding this comment.
I make Explain as a message now. However given how we split the rpc call, this message needs to stay in base.proto:
// Main interface for the SparkConnect service.
service SparkConnectService {
// Executes a request that contains the query and returns a stream of [[Response]].
rpc ExecutePlan(Request) returns (stream Response) {}
// Analyzes a query and returns a [[AnalyzeResponse]] containing metadata about the query.
rpc AnalyzePlan(Request) returns (AnalyzeResponse) {}
}
There might be better way to organize it.
There was a problem hiding this comment.
df.show can be a proto message, why not df.explain?
I think we can put all functions into relations.proto, but one difference is that df.show will trigger a job, while df.explain only analyzes, maybe better to move it to AnalyzePlan
4905a0a to
f514fc8
Compare
There was a problem hiding this comment.
I think this is probably the right time to split of the Request for Analyze and Execute because the explain message should probably not be part of the overall request.
In the same way this actually, would follow the proto guidelines for the naming of the request objects to be named after their RPC method name.
|
Can one of the admins verify this patch? |
c3c0b94 to
3718a3b
Compare
|
@HyukjinKwon @cloud-fan @grundprinzip @zhengruifeng please take another look. |
3718a3b to
ab96e1a
Compare
|
Merged to master. |
… Python ### What changes were proposed in this pull request? Fix out of sync generated files for Python. This happens on a rare case for protobuf version change. #38693 downgraded protobuf versions. There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](#38693) was based on old code before #38638 so the protobuf generates based on stale code which leads to stale generated files. The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo. ### Why are the changes needed? Fix out of sync generated files for Python. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38718 from amaliujia/fix_out_of_sync_proto. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? 1. This PR extends `Explain` API that can support `formatted`, `cost`, `codegen`, `extended` and `simple` modes. 2. This PR also refactors to split `Request` into `ExecutePlanRequest` and `AnalyzePlanRequest` for corresponding RPC endpoints (same for the response). ### Why are the changes needed? Improve Debuggability ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#38638 from amaliujia/explain_extended_not. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… Python ### What changes were proposed in this pull request? Fix out of sync generated files for Python. This happens on a rare case for protobuf version change. apache#38693 downgraded protobuf versions. There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](apache#38693) was based on old code before apache#38638 so the protobuf generates based on stale code which leads to stale generated files. The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo. ### Why are the changes needed? Fix out of sync generated files for Python. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#38718 from amaliujia/fix_out_of_sync_proto. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? 1. This PR extends `Explain` API that can support `formatted`, `cost`, `codegen`, `extended` and `simple` modes. 2. This PR also refactors to split `Request` into `ExecutePlanRequest` and `AnalyzePlanRequest` for corresponding RPC endpoints (same for the response). ### Why are the changes needed? Improve Debuggability ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#38638 from amaliujia/explain_extended_not. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… Python ### What changes were proposed in this pull request? Fix out of sync generated files for Python. This happens on a rare case for protobuf version change. apache#38693 downgraded protobuf versions. There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](apache#38693) was based on old code before apache#38638 so the protobuf generates based on stale code which leads to stale generated files. The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo. ### Why are the changes needed? Fix out of sync generated files for Python. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#38718 from amaliujia/fix_out_of_sync_proto. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? 1. This PR extends `Explain` API that can support `formatted`, `cost`, `codegen`, `extended` and `simple` modes. 2. This PR also refactors to split `Request` into `ExecutePlanRequest` and `AnalyzePlanRequest` for corresponding RPC endpoints (same for the response). ### Why are the changes needed? Improve Debuggability ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#38638 from amaliujia/explain_extended_not. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… Python ### What changes were proposed in this pull request? Fix out of sync generated files for Python. This happens on a rare case for protobuf version change. apache#38693 downgraded protobuf versions. There were something not generated before but with the protobuf version downgraded that was generated (and this is why there was no merge conflict). However [the downgrading PR](apache#38693) was based on old code before apache#38638 so the protobuf generates based on stale code which leads to stale generated files. The way to better avoid this is when upon such change, it should lock the repo (partially on some directory to reduce impact), do the work, merge, and enforce pending PR to rebase. However this is not feasible (or too heavy) for concurrent development on Spark repo. ### Why are the changes needed? Fix out of sync generated files for Python. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes apache#38718 from amaliujia/fix_out_of_sync_proto. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
ExplainAPI that can supportformatted,cost,codegen,extendedandsimplemodes.RequestintoExecutePlanRequestandAnalyzePlanRequestfor corresponding RPC endpoints (same for the response).Why are the changes needed?
Improve Debuggability
Does this PR introduce any user-facing change?
NO
How was this patch tested?
UT