-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #52154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #52154
Conversation
a74b139 to
d595752
Compare
|
cc @hvanhovell for review |
| message QueryFunctionFailure { | ||
| // Identifier for a dataset within the graph that the query function needed to know the schema | ||
| // of but which had not yet been analyzed itself. | ||
| optional string missing_dependency = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is possible for us to return all missing dependencies in one go? That should help us to be more efficient during initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. This Proto basically describe a Flow's query function cannot be lazily analyzed by Spark Connect client because it triggers some eager analysis (e.g., df.schema / df.isStreaming), and the dependencies represented by the df have not yet been resolved by the dataflow graph.
It's very possible for the df to contain multiple unresolved dependencies (e.g., multi-table join). Thereby, instead of storing a string identifier for a single missing dependency, we directly pass the entire logical plan to the server and let the server to filter out which leaf nodes in the plan have not yet been resolved by the dataflow graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt @sryza
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. My understanding was that, with the current backend implementation, we only get one unresolved dependency at a time (because we basically find out that it's missing when we hit a failure). However:
- I might be wrong about that implementation
- We could make a more sophisticated implementation in the future, and it would be good if the protocol supports that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on further thinking and discussion, it seems like we might be able to just leave this out for now: when the server fails to analyze a plan, it knows what flow that plan was associated with, so it can just do the bookkeeping on its side.
There might be some edge situations (e.g. at the beginning) where this means that we end up needing one more query function invocation than we otherwise would, for query functions that do analysis. But we can bias towards simplicity for now and optimize later if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good Sandy, let me update the proto to reflect this
sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto
Outdated
Show resolved
Hide resolved
| // An unresolved relation that defines the dataset's flow. | ||
| optional spark.connect.Relation relation = 4; | ||
| // [Deprecated] An unresolved relation that defines the dataset's flow. | ||
| optional spark.connect.Relation relation = 4 [deprecated = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this un-released we can just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since SDP currently still extensive using this field right now, wdyt I mark the proto as deprecated in this PR, followup with a PR updating the code usage to use the new proto. After that, we remove the deprecated proto
| message QueryFunctionResult { | ||
| oneof flow_function_evaluation_result { | ||
| // If the query function executed successfully, the unresolved logical plan produced by it. | ||
| spark.connect.Relation analyzed_plan = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's won't be analyzed yet, right? It's an unresolved plan that will be analyzed once it gets to the server? Also, I think we should use the word "relation" instead of "plan" – I think this was my bad originally and Dongjoon recently pointed out that relation was more accurage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG let me update it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me also create a ticket to rename the opps. looks like it's already doneplan field to relation for DefineFlow RPC before SDP is released.
| // The logical plan that the query function needed to eagerly analyze in order to know | ||
| // the schema / isStreaming / etc of the plan it produced, but could not because it has | ||
| // unresolved dependencies. | ||
| spark.connect.Relation unresolved_dependency_plan = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vs. passing a list of missing dependencies, which will lead to a simpler backend implementation?
|
|
||
| // An unresolved relation that defines the dataset's flow. | ||
| // An unresolved relation that defines the dataset's flow. Empty if the query function | ||
| // that defines the flow cannot be analyzed at the time of flow definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead the server would bookkeeping the unresolved dependency from the previously triggered AnalyzePlan RPC. SC client can just sent None as the relation when defining the flow, and the server would use the tracked unresolved dependencies to register the Flow in the DataflowGraph.
sryza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Will give a little more time for @hvanhovell to take a look if he's interested, will then merge. |
…th `4.1.0-preview3` RC1 ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1. ### Why are the changes needed? There are many changes between Apache Spark 4.1.0-preview2 and preview3. - apache/spark#52685 - apache/spark#52613 - apache/spark#52553 - apache/spark#52532 - apache/spark#52517 - apache/spark#52514 - apache/spark#52487 - apache/spark#52328 - apache/spark#52200 - apache/spark#52154 - apache/spark#51344 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`. ``` $ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests). ``` $ swift test --no-parallel ... ✔ Test run with 203 tests in 21 suites passed after 19.088 seconds. ``` ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #252 from dongjoon-hyun/SPARK-54043. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ve Pipelines query functions
### What changes were proposed in this pull request?
Introduces a mechanism for lazy execution of Declarative Pipelines query functions. A query function is something like the `mv1` in this example:
```python
materialized_view
def mv1():
return spark.table("upstream_table").filter(some_condition)
```
Currently, query functions are always executed eagerly. I.e. the implementation of the `materialized_view` decorator immediately invokes the function that it decorates and then registers the resulting DataFrame with the server.
This PR introduces Spark Connect proto changes that enable executing query functions later on, initiated by the server during graph resolution. After all datasets and flows have been registered with the server, the server can tell the client to execute the query functions for flows that haven't yet successfully been executed. The way this works is that the client initiates an RPC with the server, and then the server streams back responses that indicate to the client when it's time to execute a query function for one of its flows. Relevant changes:
- New `QueryFunctionFailure` message
- New `QueryFunctionResult` message
- Replace relation field in `DefineFlow` with `query_function_result` field
- New `DefineFlowQueryFunctionResult` message
- New `GetQueryFunctionExecutionSignalStream` message
- New `PipelineQueryFunctionExecutionSignal` message
### Why are the changes needed?
There are some situations where we can't resolve the relation immediately at the time we're registering a flow.
E.g. consider this situation:
file 1:
```python
materialized_view
def mv1():
data = [("Alice", 10), ("Bob", 15), ("Alice", 5)]
return spark.createDataFrame(data, ["name", "amount"])
```
file 2:
```python
materialized_view
def mv2():
return spark.table("mv1").groupBy("name").agg(sum("amount").alias("total_amount"))
```
Unlike some other transformations, which get analyzed lazily, `groupBy` can trigger an `AnalyzePlan` Spark Connect request immediately. If the query function for `mv2` gets executed before `mv1`, then it will hit an error, because `mv1` doesn't exist yet. `groupBy` isn't the only example here (`df.schema`, etc).
Other examples of these kinds of situations:
- The set of columns for a downstream table is determined from the set of columns in an upstream table.
- When `spark.sql` is used.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
It is a proto only changes. Will followup with unit tests and E2E tests once we add implementation.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#52154 from SCHJonathan/jonathan-chang_data/proto-changes.
Authored-by: Yuheng Chang <[email protected]>
Signed-off-by: Sandy Ryza <[email protected]>
What changes were proposed in this pull request?
Introduces a mechanism for lazy execution of Declarative Pipelines query functions. A query function is something like the
mv1in this example:Currently, query functions are always executed eagerly. I.e. the implementation of the
materialized_viewdecorator immediately invokes the function that it decorates and then registers the resulting DataFrame with the server.This PR introduces Spark Connect proto changes that enable executing query functions later on, initiated by the server during graph resolution. After all datasets and flows have been registered with the server, the server can tell the client to execute the query functions for flows that haven't yet successfully been executed. The way this works is that the client initiates an RPC with the server, and then the server streams back responses that indicate to the client when it's time to execute a query function for one of its flows. Relevant changes:
QueryFunctionFailuremessageQueryFunctionResultmessageDefineFlowwithquery_function_resultfieldDefineFlowQueryFunctionResultmessageGetQueryFunctionExecutionSignalStreammessagePipelineQueryFunctionExecutionSignalmessageWhy are the changes needed?
There are some situations where we can't resolve the relation immediately at the time we're registering a flow.
E.g. consider this situation:
file 1:
file 2:
Unlike some other transformations, which get analyzed lazily,
groupBycan trigger anAnalyzePlanSpark Connect request immediately. If the query function formv2gets executed beforemv1, then it will hit an error, becausemv1doesn't exist yet.groupByisn't the only example here (df.schema, etc).Other examples of these kinds of situations:
spark.sqlis used.Does this PR introduce any user-facing change?
No
How was this patch tested?
It is a proto only changes. Will followup with unit tests and E2E tests once we add implementation.
Was this patch authored or co-authored using generative AI tooling?
No