-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42941][SS][CONNECT][1/2] StreamingQueryListener - Event Serde in JSON format #41540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
PS. ChatGPT is especially helpful in doing such boilerplate jobs :P |
| inputRowsPerSecond=j["inputRowsPerSecond"], | ||
| processedRowsPerSecond=j["processedRowsPerSecond"], | ||
| observedMetrics={ | ||
| k: Row(*row_dict.keys())(*row_dict.values()) # Assume no nested rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be nested row for this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the original PRs #26127. The intended use case of observe method is to construct this Row by aggregating on some fields. I think we don't need to handle nested rows here but I'm open to discussion.
|
@rangadi @HyukjinKwon Can you take a look? Thanks! |
|
cc @HeartSaVioR too |
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala
Show resolved
Hide resolved
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine from a cursory look.
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala
Show resolved
Hide resolved
rangadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a few comments.
| ) | ||
|
|
||
| @classmethod | ||
| def fromJson(cls, j: Dict[str, Any]) -> "QueryStartedEvent": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the context where these are used? Thanks for the detailed description and you mentioned it for 'step 3'. Could you point to any WIP code that uses this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Here is the pointer: https://github.com/apache/spark/pull/41096/files#r1230025798
Note that it's draft code so it's really messy. Basically since we register a scala listener but run python code inside it. We need a way to send the events to python process. And JSON is a safe way for that job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Scala does toJson and Python code does fromJson.
Should this be under connect directory since it is Connect specific code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to still let users to use import pyspark.sql.streaming.Query<xxx>Event in Connect. If we put the code in connect folder they might need to do import pyspark.sql.connect.streaming.Query<xxx>Event.
A way I can think of doing this is to refactor the code to make the event classes abstract here:
class QueryStartedEvent(ABC):
@property
@abstractmethod
def id(self) -> uuid.UUID:
...
Do you think we should do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to still let users to use import pyspark.sql.streaming.QueryEvent
@HyukjinKwon is this true? I think Users have only the Spark Connect python code.
import pyspark.sql.connect.streaming.QueryEvent.
User code in a StreamingListener in connect is a Spark Connect code. I think our imports statements import the right version (connect or legacy depending on the environment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually I didn't want users to directly import the event class (and that's why I mentioned that the constructor is private). Do we need for end users to be able to import in Spark Connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They would write StreamingListener code just like their rest of Spark-Connect code. I.e. they have access to only those packages that are available in Spark-Connect. (they don't import 'connect' version directly).
What that would mean for this class is a bit uncertain to me. If @HyukjinKwon is ok with this, I am ok.
|
Merged to master. |
What changes were proposed in this pull request?
Following the discussion of
foreachBatchimplementation, we decide to implement connect StreamingQueryListener in a way that the server runs the listener code, rather than the client.Following this POC: #41096, this is going to be done in a way such that
StreamingQueryListener, which initialize the python progress and run the python code. (Details of this step still depends onforeachBatchimplementation.This PR focus on step 3, the serialization and deserialization of the events.
Also finishes a TODO to check exception in QueryTerminatedEvent
Why are the changes needed?
For implementing Connect StreamingQueryListener
Does this PR introduce any user-facing change?
No
How was this patch tested?
New unit tests