[SPARK-42941][SS][CONNECT] Python StreamingQueryListener #42116

WweiL · 2023-07-24T01:58:50Z

What changes were proposed in this pull request?

Implement the python streaming query listener and the addListener method and removeListener method, follow up filed in: SPARK-44516 to actually terminate the query listener process when removeListener is called. SPARK-44516 depends on SPARK-44433.

Why are the changes needed?

SS Connect development

Does this PR introduce any user-facing change?

Yes now they can use connect listener

How was this patch tested?

Manual test and added unit test

addListener:

# Client side:
>>> from pyspark.sql.streaming.listener import StreamingQueryListener;from pyspark.sql.streaming.listener import (QueryStartedEvent, QueryProgressEvent, QueryTerminatedEvent, QueryIdleEvent)
>>> class MyListener(StreamingQueryListener):
...     def onQueryStarted(self, event: QueryStartedEvent) -> None: print("hi, event query id is: " +  str(event.id)); df=self.spark.createDataFrame(["10","11","13"], "string").toDF("age"); df.write.saveAsTable("tbllistener1")
...     def onQueryProgress(self, event: QueryProgressEvent) -> None: pass
...     def onQueryIdle(self, event: QueryIdleEvent) -> None: pass
...     def onQueryTerminated(self, event: QueryTerminatedEvent) -> None: pass
... 
>>> spark.streams.addListener(MyListener())
>>> q = spark.readStream.format("rate").load().writeStream.format("console").start()
>>> q.stop()
>>> spark.read.table("tbllistener1").collect()
[Row(age='13'), Row(age='10'), Row(age='11’)]


# Server side:
##### event_type received from python process is 0
hi, event query id is: dd7ba1c4-6c8f-4369-9c3c-5dede22b8a2f

removeListener:

# Client side:
>>> listener = MyListener(); spark.streams.addListener(listener)
>>> spark.streams.removeListener(listener)

# Server side:
# nothing to print actually, the listener is removed from server side StreamingQueryManager and cache in sessionHolder, but the process still hangs there. Follow up SPARK-44516 filed to stop this process

bogao007

Can we add unit tests as well as tests using spark session inside listener?

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

...erver/src/main/scala/org/apache/spark/sql/connect/planner/StreamingQueryListenerHelper.scala

python/pyspark/sql/streaming/query.py

…pyspark.sql.tests.connect.streaming.test_parity_listener'

bogao007

LGTM

bogao007 · 2023-07-28T22:57:23Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

-            command.getRemoveListener.getListenerPayload.toByteArray,
-            Utils.getContextOrSparkClassLoader)
-          .id
+        val listenerId = command.getRemoveListener.getId


Thanks for doing this change!

WweiL · 2023-07-28T23:55:30Z

@HyukjinKwon @ueshin guys can you take another look? Thanks! This also needs to goto 3.5 sorry for the trouble!

core/src/main/scala/org/apache/spark/api/python/StreamingPythonRunner.scala

python/pyspark/connect_streaming_listener.py

python/pyspark/connect_streaming_foreachBatch.py

python/pyspark/sql/connect/session.py

python/pyspark/sql/connect/streaming/worker/__init__.py

WweiL · 2023-07-31T20:26:10Z

Hi Takuya @ueshin, could you check if this could be merged? Thanks!

ueshin · 2023-07-31T21:36:24Z

Thanks! merging to master/~~3.5~~.

ueshin · 2023-07-31T21:37:58Z

@WweiL There was a conflict with 3.5. Could you submit another PR to backport this? Thanks.

### What changes were proposed in this pull request? Implement the python streaming query listener and the `addListener` method and `removeListener` method, follow up filed in: SPARK-44516 to actually terminate the query listener process when `removeListener` is called. SPARK-44516 depends on SPARK-44433. ### Why are the changes needed? SS Connect development ### Does this PR introduce _any_ user-facing change? Yes now they can use connect listener ### How was this patch tested? Manual test and added unit test #### addListener: ``` # Client side: >>> from pyspark.sql.streaming.listener import StreamingQueryListener;from pyspark.sql.streaming.listener import (QueryStartedEvent, QueryProgressEvent, QueryTerminatedEvent, QueryIdleEvent) >>> class MyListener(StreamingQueryListener): ... def onQueryStarted(self, event: QueryStartedEvent) -> None: print("hi, event query id is: " + str(event.id)); df=self.spark.createDataFrame(["10","11","13"], "string").toDF("age"); df.write.saveAsTable("tbllistener1") ... def onQueryProgress(self, event: QueryProgressEvent) -> None: pass ... def onQueryIdle(self, event: QueryIdleEvent) -> None: pass ... def onQueryTerminated(self, event: QueryTerminatedEvent) -> None: pass ... >>> spark.streams.addListener(MyListener()) >>> q = spark.readStream.format("rate").load().writeStream.format("console").start() >>> q.stop() >>> spark.read.table("tbllistener1").collect() [Row(age='13'), Row(age='10'), Row(age='11’)] # Server side: ##### event_type received from python process is 0 hi, event query id is: dd7ba1c4-6c8f-4369-9c3c-5dede22b8a2f ``` #### removeListener: ``` # Client side: >>> listener = MyListener(); spark.streams.addListener(listener) >>> spark.streams.removeListener(listener) # Server side: # nothing to print actually, the listener is removed from server side StreamingQueryManager and cache in sessionHolder, but the process still hangs there. Follow up SPARK-44516 filed to stop this process ``` Closes apache#42116 from WweiL/listener-poc-newest. Lead-authored-by: Wei Liu <[email protected]> Co-authored-by: pengzhon-db <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

bogao007 · 2023-07-31T22:06:43Z

@WweiL There was a conflict with 3.5. Could you submit another PR to backport this? Thanks.

@ueshin Created a backport PR to 3.5 branch #42250, could you help take a look? Thanks!

Implement the python streaming query listener and the `addListener` method and `removeListener` method, follow up filed in: SPARK-44516 to actually terminate the query listener process when `removeListener` is called. SPARK-44516 depends on SPARK-44433. SS Connect development Yes now they can use connect listener Manual test and added unit test ``` >>> from pyspark.sql.streaming.listener import StreamingQueryListener;from pyspark.sql.streaming.listener import (QueryStartedEvent, QueryProgressEvent, QueryTerminatedEvent, QueryIdleEvent) >>> class MyListener(StreamingQueryListener): ... def onQueryStarted(self, event: QueryStartedEvent) -> None: print("hi, event query id is: " + str(event.id)); df=self.spark.createDataFrame(["10","11","13"], "string").toDF("age"); df.write.saveAsTable("tbllistener1") ... def onQueryProgress(self, event: QueryProgressEvent) -> None: pass ... def onQueryIdle(self, event: QueryIdleEvent) -> None: pass ... def onQueryTerminated(self, event: QueryTerminatedEvent) -> None: pass ... >>> spark.streams.addListener(MyListener()) >>> q = spark.readStream.format("rate").load().writeStream.format("console").start() >>> q.stop() >>> spark.read.table("tbllistener1").collect() [Row(age='13'), Row(age='10'), Row(age='11’)] hi, event query id is: dd7ba1c4-6c8f-4369-9c3c-5dede22b8a2f ``` ``` >>> listener = MyListener(); spark.streams.addListener(listener) >>> spark.streams.removeListener(listener) ``` Closes apache#42116 from WweiL/listener-poc-newest. Lead-authored-by: Wei Liu <[email protected]> Co-authored-by: pengzhon-db <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

…ss with `removeListener` and improvements ### What changes were proposed in this pull request? This is a followup to #42116. It addresses the following issues: 1. When `removeListener` is called upon one listener, before the python process is left running, now it also get stopped. 2. When multiple `removeListener` is called on the same listener, in non-connect mode, subsequent calls will be noop. But before this PR, in connect it actually throws an error, which doesn't align with existing behavior, this PR addresses it. 3. Set the socket timeout to be None (\infty) for `foreachBatch_worker` and `listener_worker`, because there could be a long time between each microbatch. If not setting this, the socket will timeout and won't be able to process new data. ``` scala> Streaming query listener worker is starting with url sc://localhost:15002/;user_id=wei.liu and sessionId 886191f0-2b64-4c44-b067-de511f04b42d. Traceback (most recent call last): File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/sql/connect/streaming/worker/listener_worker.py", line 95, in <module> File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/sql/connect/streaming/worker/listener_worker.py", line 82, in main File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/serializers.py", line 557, in loads File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/serializers.py", line 594, in read_int File "/usr/lib/python3.9/socket.py", line 704, in readinto return self._sock.recv_into(b) socket.timeout: timed out ``` ### Why are the changes needed? Necessary improvements ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test + unit test Closes #42283 from WweiL/SPARK-44433-listener-process-termination. Authored-by: Wei Liu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…process with removeListener and improvements ### Master Branch PR: #42283 ### What changes were proposed in this pull request? This is a followup to #42116. It addresses the following issues: 1. When `removeListener` is called upon one listener, before the python process is left running, now it also get stopped. 2. When multiple `removeListener` is called on the same listener, in non-connect mode, subsequent calls will be noop. But before this PR, in connect it actually throws an error, which doesn't align with existing behavior, this PR addresses it. 3. Set the socket timeout to be None (\infty) for `foreachBatch_worker` and `listener_worker`, because there could be a long time between each microbatch. If not setting this, the socket will timeout and won't be able to process new data. ``` scala> Streaming query listener worker is starting with url sc://localhost:15002/;user_id=wei.liu and sessionId 886191f0-2b64-4c44-b067-de511f04b42d. Traceback (most recent call last): File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/sql/connect/streaming/worker/listener_worker.py", line 95, in <module> File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/sql/connect/streaming/worker/listener_worker.py", line 82, in main File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/serializers.py", line 557, in loads File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/serializers.py", line 594, in read_int File "/usr/lib/python3.9/socket.py", line 704, in readinto return self._sock.recv_into(b) socket.timeout: timed out ``` ### Why are the changes needed? Necessary improvements ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test + unit test Closes #42340 from WweiL/SPARK-44433-listener-followup-3.5. Authored-by: Wei Liu <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

pengzhon-db and others added 13 commits May 5, 2023 15:27

foreachbatch spark connect

d363260

add streaming_worker.py

508928a

python proto

6cb7d01

use same python process for one streaming query

74ad159

wip

1f73c6f

working

2dec59c

latest change

40bf120

resolve conflicts

57b80ff

streaming function declaration pyi & rdd

c5c8e85

this won't work, still throws None obj doesn't have craeteDataFrame

8b9dfe5

same error, this also doesn't work

341d588

this worked

4d8eec3

first revision

3a43d6c

github-actions bot added SQL STRUCTURED STREAMING CORE PYTHON CONNECT labels Jul 24, 2023

WweiL added 4 commits July 23, 2023 19:01

file cleanup

38d76c0

doc update

1923840

is this breaking change also?

8bcf605

remove doc test for now

56665c3

bogao007 reviewed Jul 24, 2023

View reviewed changes

WweiL added 7 commits July 24, 2023 15:08

add remove listener

cb0caea

gen proto

9c4f6e6

ticket update

3724992

before resolving merge conflict

2e3b8d2

resolve conflict

4a6a15f

documentation to PythonStreamingQueryListener

0040b70

this works on manual test but in unit test it shows No module named '…

d404f9f

…pyspark.sql.tests.connect.streaming.test_parity_listener'

WweiL added 5 commits July 27, 2023 09:33

minor

13f50bf

lint

3f83d07

lint

eb4d2b5

Merge remote-tracking branch 'spark/master' into listener-poc-newest

95b0111

lint

fb4415b

bogao007 approved these changes Jul 28, 2023

View reviewed changes

ueshin reviewed Jul 29, 2023

View reviewed changes

WweiL added 3 commits July 29, 2023 10:28

address comments, move worker files to sql/connect/streaming/worker

baf791c

minor, remove redundant log

8c520ba

add init

aa22c3b

ueshin reviewed Jul 31, 2023

View reviewed changes

python/pyspark/sql/connect/streaming/worker/__init__.py Show resolved Hide resolved

add new pkg to setup.py

c9ffa52

ueshin approved these changes Jul 31, 2023

View reviewed changes

ueshin closed this in 799ab87 Jul 31, 2023

bogao007 mentioned this pull request Jul 31, 2023

[SPARK-42941][SS][CONNECT] Python StreamingQueryListener #42249

Closed

bogao007 mentioned this pull request Jul 31, 2023

[SPARK-42941][SS][CONNECT][3.5] Python StreamingQueryListener #42250

Closed

WweiL mentioned this pull request Aug 2, 2023

[SPARK-44433][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with removeListener and improvements #42283

Closed

This was referenced Aug 4, 2023

[SPARK-44433][3.5][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with removeListener and improvements #42340

Closed

[SPARK-44433][3.5][PYTHON][CONNECT][SS][FOLLOWUP] Terminate listener process with removeListener and improvements #42346

Closed

[SPARK-42941][SS][CONNECT] Python StreamingQueryListener #42116

[SPARK-42941][SS][CONNECT] Python StreamingQueryListener #42116

Uh oh!

Conversation

WweiL commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

addListener:

removeListener:

Uh oh!

bogao007 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bogao007 left a comment

Choose a reason for hiding this comment

Uh oh!

bogao007 Jul 28, 2023

Choose a reason for hiding this comment

Uh oh!

WweiL commented Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WweiL commented Jul 31, 2023

Uh oh!

ueshin commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Jul 31, 2023

Uh oh!

bogao007 commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WweiL commented Jul 24, 2023 •

edited

Loading

WweiL commented Jul 28, 2023 •

edited

Loading

ueshin commented Jul 31, 2023 •

edited

Loading

bogao007 commented Jul 31, 2023 •

edited

Loading