-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API #27331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #117270 has finished for PR 27331 at commit
|
|
Test build #117307 has finished for PR 27331 at commit
|
|
Waiting for resolution of discussion on dev - Revert and revisit the public custom expression API for partition (a.k.a. Transform API) |
41f4e18 to
8de4978
Compare
|
Test build #117849 has finished for PR 27331 at commit
|
|
@HyukjinKwon Glancing over the discussion it doesn't seem like the upstream feature is going to be reverted, does it? |
|
No, it seems not. But I am not very sure if we should expose these APIs considering that DSv2 is still under developement, and incomplete yet. These APIs are considered as exceptions in terms of compatibility across versions and they are unstable at this moment. |
|
WDYT @rdblue, @dbtsai, @cloud-fan, @brkyvz? |
Makes sense. |
|
I think it's a good idea to keep Python up to date with the Scala API. Thanks for fixing this, @zero323! If there is a stated strategy that prevents us from merging this, then I'm find waiting for now. But if we don't have an existing policy to avoid adding experimental APIs to Python, I think we should add it. |
|
Test build #119264 has finished for PR 27331 at commit
|
|
I am okay with adding it in PySpark. Just wanted to make sure if we're going to change DSv2 API shape more or not. If that's expected, let's add them later to reduce dev overhead as new APIs should target 3.1 anyway. If it's expected to not change, I am good with adding it. |
I don't have strong opinion here, but I guess that one consideration is how much keeping this out of PySpark limits feedback that can gathered and used to drive the evolution of the API. |
|
Test build #124325 has finished for PR 27331 at commit
|
|
@rdblue WDYT? I don't know the details about stability about it here. Does it look to you the APIs are not going to be changed soon? Just rough estimation is fine. |
|
As I said at the time, I think this should have been merged. |
|
Sorry, there's no policy to block but I don't also think practically it's a good idea to merge if it's expected to change. Once you have PySpark and SparkR APIs, you should fix all APIs together every time you change APIs with fixing tests, and I don't believe the dev people are all used to all languages which are overhead. That's why we have a bunch of inconsistencies between SQL function APIs in other languages as an example. @rdblue, can I ask the rough estimation about stability? |
Just my 2 cents ‒ if only APIs that stabilized are exposed, then Python users which, if I recall correctly, consist around half of the whole user base, are essentially excluded from the shaping and testing process, aren't they? That's a huge issue. |
| return self | ||
|
|
||
| @since(3.1) | ||
| def partitionedBy(self, col, *cols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's important to describe what are expected for col. Only columns and the partition transform functions are allowed, not the regular Spark Column.
I still don't like it we made this API looks like it takes regular Spark Columns - they are mutually exclusive given the last discussion in the dev mailing list, this was one of the reason why Pandas UDFs were redesigned and separate into two separate groups .. let's at least clarify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different class for these partition column expressions such as PartitioningColumn like we do for TypedColumn, and add asPartitioningColumn to Column?
I remember we basically want to remove these partitioning specific expressions at [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)
if we find a better way to do it.
I suspect doing PartitioningColumn is a-okay as a temporary solution (?) because we can guard it by typing, and we can move these partitioning-specific expressions into a separate package. I think this way make them distinguished at least. I can work as well on it if this sounds fine to you guys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the need for separation here that doesn't exist in Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, I don't mean to we should do that here. I mean to suggest/discuss to make the separation in the Scala first because that propagates the confusion to PySpark API side as well.
They are different things so I am suggesting to make it different. I hope we can more focus on the discussion itself.
Short version: I don't think it is relevant; on its merits, I expect the API to be stable; I don't know what other people will do. I haven't replied because I don't see how it is an important concern. An experimental API in Scala can be an experimental API in Python. There's no strategy about not porting APIs that might change. And, it's easier to maintain compatibility on the Python side. That's why I think it should be merged, regardless of the question of stability. But if stability matters to you, I'll quickly try to address it.
|
@rdblue, I explained multiple times why I think this is relevant and important - once you add them, you should fix it in Python and R side too. I don't believe all dev people are used to Python and R side given my interactions for many years in Spark dev. This isn't a great way to discuss that you ignore because you don't think it's important or relevant. I just wanted to know the rough picture rather than asking you to assert the stability here because you are the one who drove DSv2 in the community, and I do believe you're the right one to ask. I fully understand the things can change. I am here to help and make progresses here rather than nitpicking or blaming on something not done. I fully understand the pain we had at DSv2. It would be nicer if we can be more cooperative next time. |
To clarify, this isn't a discussion. There isn't more for me to say since I've already added my perspective. |
| self.mode(mode)._jwrite.jdbc(url, table, jprop) | ||
|
|
||
|
|
||
| class DataFrameWriterV2(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move it to a new file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think that's better approach. I don't have strong opinion, though feature is small and unlikely to be used directly.
|
Okay, @zero323, can you address the comments except #27331 (comment)? Let's just merge this one. I will make a PR to fix the things I pointed out by myself. |
|
Test build #126123 has finished for PR 27331 at commit
|
|
Test build #126122 has finished for PR 27331 at commit
|
|
Test build #126124 has finished for PR 27331 at commit
|
|
Test build #126131 has finished for PR 27331 at commit
|
|
I discussed offline with @rdblue, @cloud-fan and @brkyvz. I will take a look by myself and try to make a fix soon. |
|
@rdblue, @brkyvz, @cloud-fan, I am merging this since I am going to make a followup soon. Let me know if there are some more comments here, I will address them in the followup. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master. |
- Adds `DataFramWriterV2` class. - Adds `writeTo` method to `pyspark.sql.DataFrame`. - Adds related SQL partitioning functions (`years`, `months`, ..., `bucket`). Feature parity. No. Added new unit tests. TODO: Should we test against `org.apache.spark.sql.connector.InMemoryTableCatalog`? If so, how to expose it in Python tests? Closes apache#27331 from zero323/SPARK-29157. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
DataFramWriterV2class.writeTomethod topyspark.sql.DataFrame.years,months, ...,bucket).Why are the changes needed?
Feature parity.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new unit tests.
TODO: Should we test against
org.apache.spark.sql.connector.InMemoryTableCatalog? If so, how to expose it in Python tests?