[SPARK-42941][SS][CONNECT][1/2] StreamingQueryListener - Event Serde in JSON format #41540

WweiL · 2023-06-09T19:13:03Z

What changes were proposed in this pull request?

Following the discussion of foreachBatch implementation, we decide to implement connect StreamingQueryListener in a way that the server runs the listener code, rather than the client.

Following this POC: #41096, this is going to be done in a way such that

Client sends serialized python code to server
Server initializes a Scala StreamingQueryListener, which initialize the python progress and run the python code. (Details of this step still depends on foreachBatch implementation.
When a new StreamingQuery Event comes in, the jvm serialize it to JSON and send it to the python progress to process.

This PR focus on step 3, the serialization and deserialization of the events.

Also finishes a TODO to check exception in QueryTerminatedEvent

Why are the changes needed?

For implementing Connect StreamingQueryListener

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

WweiL · 2023-06-09T19:16:31Z

PS. ChatGPT is especially helpful in doing such boilerplate jobs :P

WweiL · 2023-06-09T19:17:46Z

python/pyspark/sql/streaming/listener.py

+            inputRowsPerSecond=j["inputRowsPerSecond"],
+            processedRowsPerSecond=j["processedRowsPerSecond"],
+            observedMetrics={
+                k: Row(*row_dict.keys())(*row_dict.values())  # Assume no nested rows


Can there be nested row for this field?

Checking the original PRs #26127. The intended use case of observe method is to construct this Row by aggregating on some fields. I think we don't need to handle nested rows here but I'm open to discussion.

WweiL · 2023-06-09T19:25:42Z

@rangadi @HyukjinKwon Can you take a look? Thanks!

HyukjinKwon · 2023-06-10T01:35:51Z

cc @HeartSaVioR too

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala

python/pyspark/sql/tests/streaming/test_streaming_listener.py

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala

python/pyspark/sql/streaming/listener.py

HyukjinKwon

Seems fine from a cursory look.

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala

rangadi

Made a few comments.

python/pyspark/sql/streaming/listener.py

rangadi · 2023-06-14T18:11:45Z

python/pyspark/sql/streaming/listener.py

+        )
+
+    @classmethod
+    def fromJson(cls, j: Dict[str, Any]) -> "QueryStartedEvent":


What is the context where these are used? Thanks for the detailed description and you mentioned it for 'step 3'. Could you point to any WIP code that uses this?

Sure! Here is the pointer: https://github.com/apache/spark/pull/41096/files#r1230025798
Note that it's draft code so it's really messy. Basically since we register a scala listener but run python code inside it. We need a way to send the events to python process. And JSON is a safe way for that job

I see. Scala does toJson and Python code does fromJson.
Should this be under connect directory since it is Connect specific code?

I think we need to still let users to use import pyspark.sql.streaming.Query<xxx>Event in Connect. If we put the code in connect folder they might need to do import pyspark.sql.connect.streaming.Query<xxx>Event.

A way I can think of doing this is to refactor the code to make the event classes abstract here:

class QueryStartedEvent(ABC): @property @abstractmethod def id(self) -> uuid.UUID: ...

Do you think we should do that?

I think we need to still let users to use import pyspark.sql.streaming.QueryEvent

@HyukjinKwon is this true? I think Users have only the Spark Connect python code.

import pyspark.sql.connect.streaming.QueryEvent.

User code in a StreamingListener in connect is a Spark Connect code. I think our imports statements import the right version (connect or legacy depending on the environment).

Yeah, actually I didn't want users to directly import the event class (and that's why I mentioned that the constructor is private). Do we need for end users to be able to import in Spark Connect?

They would write StreamingListener code just like their rest of Spark-Connect code. I.e. they have access to only those packages that are available in Spark-Connect. (they don't import 'connect' version directly).

What that would mean for this class is a bit uncertain to me. If @HyukjinKwon is ok with this, I am ok.

HyukjinKwon · 2023-06-21T01:33:55Z

Merged to master.

WweiL added 2 commits June 9, 2023 12:02

done

43e760c

fmt

4d52eec

github-actions bot added CORE PYTHON SQL STRUCTURED STREAMING labels Jun 9, 2023

WweiL commented Jun 9, 2023

View reviewed changes

WweiL added 2 commits June 9, 2023 13:15

also add test to exception in QueryTerminatedEvent

80bb4c4

type fix for StreamingQueryProgress.numInputRows

ce57437

LuciferYang reviewed Jun 10, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala Show resolved Hide resolved

WweiL added 2 commits June 12, 2023 10:14

lint, line too long

0214464

further type fix on numInputRows

2886940

HeartSaVioR reviewed Jun 13, 2023

View reviewed changes

python/pyspark/sql/tests/streaming/test_streaming_listener.py Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala Show resolved Hide resolved

WweiL added 4 commits June 13, 2023 10:47

add watermark to test json

97c08c4

fix test failure

d3041ae

use different assert function

c5fe437

fmt

f20d7f8

HyukjinKwon reviewed Jun 14, 2023

View reviewed changes

python/pyspark/sql/streaming/listener.py Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Jun 14, 2023

View reviewed changes

remove versionchanged

d725d6d

LuciferYang reviewed Jun 14, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala Show resolved Hide resolved

rangadi reviewed Jun 14, 2023

View reviewed changes

WweiL added 3 commits June 14, 2023 11:42

oops forget ser method for QueryProgressEvent

d378973

address raghu's comments

a1da6de

lint

f08c674

rangadi approved these changes Jun 16, 2023

View reviewed changes

HyukjinKwon closed this in 6bfc011 Jun 21, 2023

[SPARK-42941][SS][CONNECT][1/2] StreamingQueryListener - Event Serde in JSON format #41540

[SPARK-42941][SS][CONNECT][1/2] StreamingQueryListener - Event Serde in JSON format #41540

Uh oh!

Conversation

WweiL commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

WweiL commented Jun 9, 2023

Uh oh!

WweiL Jun 9, 2023

Choose a reason for hiding this comment

Uh oh!

WweiL Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL commented Jun 9, 2023

Uh oh!

HyukjinKwon commented Jun 10, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rangadi Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

WweiL Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

rangadi Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

WweiL Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rangadi Jun 15, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rangadi Jun 16, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WweiL commented Jun 9, 2023 •

edited

Loading

WweiL Jun 12, 2023 •

edited

Loading

WweiL Jun 14, 2023 •

edited

Loading

HyukjinKwon Jun 15, 2023 •

edited

Loading