[SPARK-43031] [SS] [Connect] Enable unit test and doctest for streaming #40691

WweiL · 2023-04-06T22:24:40Z

What changes were proposed in this pull request?

Enable unit tests and doctests for streaming queries. A lot are skipped and needs to be un-skipped as the development goes on.

Note that I also separated the {foreach, foreachBatch} tests from the original test suite. Because currently they are not implemented in connect and it seems unnecessary to manually add a skip for all of them in StreamingParityTests. Also it doesn't hurt to separate them I think as these tests are large enough already.

Why are the changes needed?

More tests is always better than less.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

It's test itself.

WweiL · 2023-04-06T22:26:06Z

python/pyspark/sql/connect/streaming/readwriter.py

            return self.load(path=path, format="json")
        else:
            raise TypeError("path can be only a single string")



Please ignore the change in this file. They are added and to be reviewed in #40689.

python/pyspark/sql/streaming/readwriter.py

WweiL · 2023-04-07T20:40:14Z

CC @rangadi @pengzhon-db

rangadi

Overall LGTM. Made a few comments.

rangadi · 2023-04-10T18:32:14Z

python/pyspark/sql/tests/streaming/test_streaming_foreach_family.py

+from pyspark.testing.sqlutils import ReusedSQLTestCase
+
+
+class StreamingTestsForeachFamilyMixin:


Wondering why 'foreach family'. What are the other foreach() methods tested here?

Right I put both foreach and foreachBatch to this class. My naming sense isn't that great so I'm very welcome to any suggestion on it's name here...

rangadi · 2023-04-10T18:39:02Z

python/pyspark/sql/tests/streaming/test_streaming_foreach_family.py

+
+
+class StreamingTestsForeachFamilyMixin:
+    class ForeachWriterTester:


When I tried these tests, I got a serialization error after moving this to 'Mixin'. We could handle it when run into it with connect tests.

I get all tests passed by running python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming_foreach_family. I renamed the mixin part though in the new commit

Can we move foreach back to its original place. Foreach() and foreachbath() are very different. The later takes a dataframe, the former is more like a UDF.

Please add foreach() api to one of the jiras or a new jira.

Or you could keep here. Either is ok.

I see! I could make another class for foreach then. The jira for foreach support is already added by me IIRC, in the epic

rangadi · 2023-04-10T18:42:37Z

python/pyspark/sql/tests/streaming/test_streaming.py

            )
-        self.assertTrue(df.isStreaming)
-        self.assertEqual(df.schema.simpleString(), "struct<data:string>")
+            # TODO: Moving this outside of with block will trigger the following error,


It does not need to be a TODO. Testing it inside with is the right thing.
Calls like schema are executed on server with the updated state of the conf.

I see, I'll remove the commetns. So this is a behavior change for migrating to connect?

Yes. Something like schema is evaluated with the conf at the time of schema call.

python/pyspark/sql/tests/streaming/test_streaming.py

rangadi · 2023-04-10T18:46:32Z

python/pyspark/sql/tests/streaming/test_streaming.py

            q.stop()
            shutil.rmtree(tmpPath)

-    class ForeachWriterTester:


Moving foreach tests to a different files SGTM.

python/pyspark/sql/tests/streaming/test_streaming_foreach_family.py

rangadi · 2023-04-10T18:49:46Z

python/pyspark/sql/streaming/query.py

        Return whether the query has terminated or not within 5 seconds

-        >>> sq.awaitTermination(5)
+        >>> sq.awaitTermination(5) # doctest: +SKIP


Why are these needed? This won't be tested in connect yet, right?

Right.. This actually silents this line in the doctest for both connect and non-connect I believe (@HyukjinKwon please correct me if I'm wrong) , just some temporary pain we have to bear with as connect doesn't support this yet

What is the pain? What fails if you remove this diff?

Nothing fails but we skip some tests for also non-connect scenario

Lets not skip any tests. Any skip should have a TODO comment.

Sure! I've moved the TODO to the top of these methods to avoid them showing up in the PySpark doc

If we're absolutely sure that we will enable this back very soon, I am fine. As @rangadi said, I tried my best to avoid skipping the tests, but I did few times when I am about to enable it back very soon.

I see. I think I'll just remove all of the SKIP flags and disable the doctest for now. Some of the functionalities might not be supported that soon I think.

Better solution is to comment out setting __doc__ for awaitTermination() in connect's query.py.

Thanks! Done

python/pyspark/sql/streaming/readwriter.py

rangadi · 2023-04-10T20:55:17Z

python/pyspark/sql/streaming/query.py

        >>> sq.stop()
-        >>> sq.isActive
+
+        >>> sq.isActive # doctest: +SKIP


is this required? Why is it skipped?

Calling isActive after the query is stopped will throw error right now before the better session management is implemented

Is see, please add a TODO here with a brief comment.

…SIS flag as it's already in test options

### What changes were proposed in this pull request? This PR adds the `orc`, `parquet`, and `text` APIs in connect's DataStreamReader ### Why are the changes needed? Part of Streaming Connect project. ### Does this PR introduce _any_ user-facing change? Yes, now the three APIs are enabled. But everything is pretty much still under developed so far. ### How was this patch tested? Manually tested, unit tests will be added in SPARK-43031 as a follow-up PR #40691. Closes #40689 from WweiL/SPARK-42951-reader-apis. Authored-by: Wei Liu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

WweiL · 2023-04-11T05:41:10Z

Hi @HyukjinKwon could you please take another look? Thanks!

pengzhon-db · 2023-04-11T05:52:37Z

python/pyspark/sql/dataframe.py

        >>> with tempfile.TemporaryDirectory() as d:
        ...     # Create a table with Rate source.
        ...     df.writeStream.toTable(
-        ...         "my_table", checkpointLocation=d) # doctest: +ELLIPSIS


curious what does # doctest: +ELLIPSIS mean?

My understanding is that in doctest the result will be kind of regex checked if this flag is set. Like the line below right now is <...streaming.query.StreamingQuery object at 0x...>, but before it was <pyspark.sql.streaming.query.StreamingQuery object at 0x...>, which would conflict with connect's test, which returns <pyspark.sql.connect.streaming.query.StreamingQuery object at 0x...>

So to make this test works for both connect and non-connect, we enable this regex-like check and replace below with ...
But it doesn't matter anyway, as we enabled the flag in test options in the __main__ method below

python/pyspark/sql/streaming/readwriter.py

pengzhon-db · 2023-04-11T06:09:10Z

python/pyspark/sql/tests/connect/streaming/test_parity_streaming.py

+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase):


Not sure if I understand the intent of this file. Looks all tests in this file are skipped. Where do we test already supported spark streaming connect?

uhoh, the tests here shouldn't be skipped.

This file marks all tests that should be skipped currently because these functionalities are not supported yet in connect.

It executes all tests from StreamingTestsMixin, except these below marked skipped.

When anyone adds support to corresponding functions, they should delete the @unittest.skip and function, unless they want to do some change specific for connect.

So ideally in the end, this class should only be like

class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase): pass

And then it runs all tests from StreamingTestsMixin

Each of the skipped tests here has a SPARK ticket associated with it. We will enable these soon. LGTM.

HyukjinKwon

looks fine from a cursory look

… test

rangadi

LGTM. @WweiL ping Hyukjin when this is ready to be merged.

rangadi · 2023-04-12T05:52:11Z

python/pyspark/sql/tests/connect/streaming/test_parity_streaming.py

+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase):


Each of the skipped tests here has a SPARK ticket associated with it. We will enable these soon. LGTM.

WweiL · 2023-04-12T17:30:02Z

@HyukjinKwon can you merge this? Thank you!

HyukjinKwon · 2023-04-13T00:05:44Z

Merged to master.

done

4d0fcdd

github-actions bot added BUILD CONNECT CORE PYTHON SQL STRUCTURED STREAMING labels Apr 6, 2023

WweiL mentioned this pull request Apr 6, 2023

[SPARK-42951][SS][Connect] DataStreamReader APIs #40689

Closed

WweiL commented Apr 6, 2023

View reviewed changes

HyukjinKwon reviewed Apr 7, 2023

View reviewed changes

python/pyspark/sql/streaming/readwriter.py Outdated Show resolved Hide resolved

WweiL added 2 commits April 6, 2023 17:59

add versionchanged to query and readwriter

0ae7e33

fix conflict, put TODO to correct position

b14fe81

WweiL requested a review from HyukjinKwon April 7, 2023 19:36

style

17720b7

rangadi reviewed Apr 10, 2023

View reviewed changes

comments

1e68a3c

rangadi reviewed Apr 10, 2023

View reviewed changes

WweiL added 3 commits April 10, 2023 14:07

address comments, add a new foreachBatch test class, remove all ELLIP…

dc05be8

…SIS flag as it's already in test options

minor

c1674eb

minor

23b9c93

WweiL requested a review from rangadi April 10, 2023 21:11

lint

60ddd01

pengzhon-db reviewed Apr 11, 2023

View reviewed changes

HyukjinKwon approved these changes Apr 11, 2023

View reviewed changes

WweiL added 3 commits April 11, 2023 10:52

remove empty line

e576821

remove several docs in connect readwriter.py and query.py to pass doc…

26e2488

… test

minor, add back doc tests in module.py

e25f7e6

style

aa1d4c2

rangadi approved these changes Apr 12, 2023

View reviewed changes

pengzhon-db approved these changes Apr 12, 2023

View reviewed changes

HyukjinKwon closed this in 76bd695 Apr 13, 2023

		from pyspark.testing.sqlutils import ReusedSQLTestCase


		class StreamingTestsForeachFamilyMixin:



		class StreamingTestsForeachFamilyMixin:
		class ForeachWriterTester:

		from pyspark.testing.connectutils import ReusedConnectTestCase


		class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase):

[SPARK-43031] [SS] [Connect] Enable unit test and doctest for streaming #40691

[SPARK-43031] [SS] [Connect] Enable unit test and doctest for streaming #40691

Uh oh!

Conversation

WweiL commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

WweiL Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WweiL commented Apr 7, 2023

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL commented Apr 6, 2023 •

edited

Loading

WweiL Apr 6, 2023 •

edited

Loading

WweiL Apr 10, 2023 •

edited

Loading

WweiL Apr 10, 2023 •

edited

Loading

WweiL Apr 10, 2023 •

edited

Loading

WweiL Apr 11, 2023 •

edited

Loading