-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-43031] [SS] [Connect] Enable unit test and doctest for streaming #40691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return self.load(path=path, format="json") | ||
| else: | ||
| raise TypeError("path can be only a single string") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore the change in this file. They are added and to be reviewed in #40689.
rangadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Made a few comments.
| from pyspark.testing.sqlutils import ReusedSQLTestCase | ||
|
|
||
|
|
||
| class StreamingTestsForeachFamilyMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why 'foreach family'. What are the other foreach() methods tested here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I put both foreach and foreachBatch to this class. My naming sense isn't that great so I'm very welcome to any suggestion on it's name here...
|
|
||
|
|
||
| class StreamingTestsForeachFamilyMixin: | ||
| class ForeachWriterTester: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried these tests, I got a serialization error after moving this to 'Mixin'. We could handle it when run into it with connect tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get all tests passed by running python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming_foreach_family. I renamed the mixin part though in the new commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move foreach back to its original place. Foreach() and foreachbath() are very different. The later takes a dataframe, the former is more like a UDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add foreach() api to one of the jiras or a new jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could keep here. Either is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! I could make another class for foreach then. The jira for foreach support is already added by me IIRC, in the epic
| ) | ||
| self.assertTrue(df.isStreaming) | ||
| self.assertEqual(df.schema.simpleString(), "struct<data:string>") | ||
| # TODO: Moving this outside of with block will trigger the following error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not need to be a TODO. Testing it inside with is the right thing.
Calls like schema are executed on server with the updated state of the conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'll remove the commetns. So this is a behavior change for migrating to connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Something like schema is evaluated with the conf at the time of schema call.
| q.stop() | ||
| shutil.rmtree(tmpPath) | ||
|
|
||
| class ForeachWriterTester: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving foreach tests to a different files SGTM.
python/pyspark/sql/tests/streaming/test_streaming_foreach_family.py
Outdated
Show resolved
Hide resolved
| Return whether the query has terminated or not within 5 seconds | ||
| >>> sq.awaitTermination(5) | ||
| >>> sq.awaitTermination(5) # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these needed? This won't be tested in connect yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.. This actually silents this line in the doctest for both connect and non-connect I believe (@HyukjinKwon please correct me if I'm wrong) , just some temporary pain we have to bear with as connect doesn't support this yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the pain? What fails if you remove this diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing fails but we skip some tests for also non-connect scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not skip any tests. Any skip should have a TODO comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I've moved the TODO to the top of these methods to avoid them showing up in the PySpark doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're absolutely sure that we will enable this back very soon, I am fine. As @rangadi said, I tried my best to avoid skipping the tests, but I did few times when I am about to enable it back very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think I'll just remove all of the SKIP flags and disable the doctest for now. Some of the functionalities might not be supported that soon I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better solution is to comment out setting __doc__ for awaitTermination() in connect's query.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done
| >>> sq.stop() | ||
| >>> sq.isActive | ||
| >>> sq.isActive # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this required? Why is it skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling isActive after the query is stopped will throw error right now before the better session management is implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is see, please add a TODO here with a brief comment.
### What changes were proposed in this pull request? This PR adds the `orc`, `parquet`, and `text` APIs in connect's DataStreamReader ### Why are the changes needed? Part of Streaming Connect project. ### Does this PR introduce _any_ user-facing change? Yes, now the three APIs are enabled. But everything is pretty much still under developed so far. ### How was this patch tested? Manually tested, unit tests will be added in SPARK-43031 as a follow-up PR #40691. Closes #40689 from WweiL/SPARK-42951-reader-apis. Authored-by: Wei Liu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
|
Hi @HyukjinKwon could you please take another look? Thanks! |
| >>> with tempfile.TemporaryDirectory() as d: | ||
| ... # Create a table with Rate source. | ||
| ... df.writeStream.toTable( | ||
| ... "my_table", checkpointLocation=d) # doctest: +ELLIPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious what does # doctest: +ELLIPSIS mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that in doctest the result will be kind of regex checked if this flag is set. Like the line below right now is <...streaming.query.StreamingQuery object at 0x...>, but before it was <pyspark.sql.streaming.query.StreamingQuery object at 0x...>, which would conflict with connect's test, which returns <pyspark.sql.connect.streaming.query.StreamingQuery object at 0x...>
So to make this test works for both connect and non-connect, we enable this regex-like check and replace below with ...
But it doesn't matter anyway, as we enabled the flag in test options in the __main__ method below
| from pyspark.testing.connectutils import ReusedConnectTestCase | ||
|
|
||
|
|
||
| class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand the intent of this file. Looks all tests in this file are skipped. Where do we test already supported spark streaming connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhoh, the tests here shouldn't be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file marks all tests that should be skipped currently because these functionalities are not supported yet in connect.
It executes all tests from StreamingTestsMixin, except these below marked skipped.
When anyone adds support to corresponding functions, they should delete the @unittest.skip and function, unless they want to do some change specific for connect.
So ideally in the end, this class should only be like
class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase):
pass
And then it runs all tests from StreamingTestsMixin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of the skipped tests here has a SPARK ticket associated with it. We will enable these soon. LGTM.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine from a cursory look
rangadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @WweiL ping Hyukjin when this is ready to be merged.
| from pyspark.testing.connectutils import ReusedConnectTestCase | ||
|
|
||
|
|
||
| class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of the skipped tests here has a SPARK ticket associated with it. We will enable these soon. LGTM.
|
@HyukjinKwon can you merge this? Thank you! |
|
Merged to master. |
What changes were proposed in this pull request?
Enable unit tests and doctests for streaming queries. A lot are skipped and needs to be un-skipped as the development goes on.
Note that I also separated the
{foreach, foreachBatch}tests from the original test suite. Because currently they are not implemented in connect and it seems unnecessary to manually add a skip for all of them inStreamingParityTests. Also it doesn't hurt to separate them I think as these tests are large enough already.Why are the changes needed?
More tests is always better than less.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
It's test itself.