Skip to content

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Jan 23, 2020

What changes were proposed in this pull request?

  • Adds DataFramWriterV2 class.
  • Adds writeTo method to pyspark.sql.DataFrame.
  • Adds related SQL partitioning functions (years, months, ..., bucket).

Why are the changes needed?

Feature parity.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

TODO: Should we test against org.apache.spark.sql.connector.InMemoryTableCatalog? If so, how to expose it in Python tests?

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117270 has finished for PR 27331 at commit 81fac11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameWriterV2(object):

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117307 has finished for PR 27331 at commit 7a1aa6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameWriterV2(object):

@zero323 zero323 closed this Jan 25, 2020
@zero323 zero323 deleted the SPARK-29157 branch January 25, 2020 00:01
@zero323 zero323 restored the SPARK-29157 branch January 25, 2020 00:01
@zero323 zero323 reopened this Jan 25, 2020
@zero323
Copy link
Member Author

zero323 commented Jan 26, 2020

Waiting for resolution of discussion on dev - Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

@zero323 zero323 force-pushed the SPARK-29157 branch 2 times, most recently from 41f4e18 to 8de4978 Compare February 4, 2020 15:41
@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117849 has finished for PR 27331 at commit 8de4978.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameWriterV2(object):

@zero323
Copy link
Member Author

zero323 commented Mar 3, 2020

@HyukjinKwon Glancing over the discussion it doesn't seem like the upstream feature is going to be reverted, does it?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 3, 2020

No, it seems not. But I am not very sure if we should expose these APIs considering that DSv2 is still under developement, and incomplete yet. These APIs are considered as exceptions in terms of compatibility across versions and they are unstable at this moment.

@HyukjinKwon
Copy link
Member

WDYT @rdblue, @dbtsai, @cloud-fan, @brkyvz?

@zero323
Copy link
Member Author

zero323 commented Mar 4, 2020

No, it seems not. But I am not very sure if we should expose these APIs considering that DSv2 is still under developement, and incomplete yet. These APIs are not considered as exceptions and they are unstable at this moment.

Makes sense.

@rdblue
Copy link
Contributor

rdblue commented Mar 4, 2020

I think it's a good idea to keep Python up to date with the Scala API. Thanks for fixing this, @zero323!

If there is a stated strategy that prevents us from merging this, then I'm find waiting for now. But if we don't have an existing policy to avoid adding experimental APIs to Python, I think we should add it.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119264 has finished for PR 27331 at commit ac9eab4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameWriterV2(object):

@HyukjinKwon
Copy link
Member

I am okay with adding it in PySpark. Just wanted to make sure if we're going to change DSv2 API shape more or not. If that's expected, let's add them later to reduce dev overhead as new APIs should target 3.1 anyway. If it's expected to not change, I am good with adding it.

@zero323 zero323 changed the title [WIP][SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API Mar 4, 2020
@zero323
Copy link
Member Author

zero323 commented Mar 4, 2020

I am okay with adding it in PySpark. Just wanted to make sure if we're going to change DSv2 API shape more or not. If that's expected, let's add them later to reduce dev overhead as new APIs should target 3.1 anyway. If it's expected to not change, I am good with adding it.

I don't have strong opinion here, but I guess that one consideration is how much keeping this out of PySpark limits feedback that can gathered and used to drive the evolution of the API.

@SparkQA
Copy link

SparkQA commented Jun 20, 2020

Test build #124325 has finished for PR 27331 at commit 24ec8f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 21, 2020

@rdblue WDYT? I don't know the details about stability about it here. Does it look to you the APIs are not going to be changed soon? Just rough estimation is fine.

@rdblue
Copy link
Contributor

rdblue commented Jun 22, 2020

As I said at the time, I think this should have been merged.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 23, 2020

Sorry, there's no policy to block but I don't also think practically it's a good idea to merge if it's expected to change.

Once you have PySpark and SparkR APIs, you should fix all APIs together every time you change APIs with fixing tests, and I don't believe the dev people are all used to all languages which are overhead. That's why we have a bunch of inconsistencies between SQL function APIs in other languages as an example.

@rdblue, can I ask the rough estimation about stability?

@zero323
Copy link
Member Author

zero323 commented Jun 24, 2020

it's a good idea to merge if it's expected to change.

Just my 2 cents ‒ if only APIs that stabilized are exposed, then Python users which, if I recall correctly, consist around half of the whole user base, are essentially excluded from the shaping and testing process, aren't they? That's a huge issue.

return self

@since(3.1)
def partitionedBy(self, col, *cols):
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's important to describe what are expected for col. Only columns and the partition transform functions are allowed, not the regular Spark Column.

I still don't like it we made this API looks like it takes regular Spark Columns - they are mutually exclusive given the last discussion in the dev mailing list, this was one of the reason why Pandas UDFs were redesigned and separate into two separate groups .. let's at least clarify it.

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different class for these partition column expressions such as PartitioningColumn like we do for TypedColumn, and add asPartitioningColumn to Column?

I remember we basically want to remove these partitioning specific expressions at [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)
if we find a better way to do it.

I suspect doing PartitioningColumn is a-okay as a temporary solution (?) because we can guard it by typing, and we can move these partitioning-specific expressions into a separate package. I think this way make them distinguished at least. I can work as well on it if this sounds fine to you guys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need for separation here that doesn't exist in Scala.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, I don't mean to we should do that here. I mean to suggest/discuss to make the separation in the Scala first because that propagates the confusion to PySpark API side as well.

They are different things so I am suggesting to make it different. I hope we can more focus on the discussion itself.

@rdblue
Copy link
Contributor

rdblue commented Jun 25, 2020

nobody knows the answer about the stability

Short version: I don't think it is relevant; on its merits, I expect the API to be stable; I don't know what other people will do.

I haven't replied because I don't see how it is an important concern. An experimental API in Scala can be an experimental API in Python. There's no strategy about not porting APIs that might change. And, it's easier to maintain compatibility on the Python side. That's why I think it should be merged, regardless of the question of stability.

But if stability matters to you, I'll quickly try to address it.

  • Stability in an API is hard to judge without real-world use. That's why we wait some amount of time, instead of just declaring stable right away. I think this is very likely to be stable because it is a translation of the underlying SQL primitives that are stable. But...
  • I am not in control of the changes. Maintaining an API as stable is a choice; the previous DataFrameWriter is a poor API, but has been maintained as stable. In contrast, the 2.4 DSv2 API was good enough, but rewritten. I can't say what other members of the community might try to change in this API.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 26, 2020

I haven't replied because I don't see how it is an important concern.

@rdblue, I explained multiple times why I think this is relevant and important - once you add them, you should fix it in Python and R side too. I don't believe all dev people are used to Python and R side given my interactions for many years in Spark dev.
I support to add it for 3.1 but not now in the early stage if it's unstable. As I explained earlier, I take this DSv2 case as an exceptional case. See the concern about #27331 (comment) too.

This isn't a great way to discuss that you ignore because you don't think it's important or relevant.

I just wanted to know the rough picture rather than asking you to assert the stability here because you are the one who drove DSv2 in the community, and I do believe you're the right one to ask. I fully understand the things can change.

I am here to help and make progresses here rather than nitpicking or blaming on something not done. I fully understand the pain we had at DSv2. It would be nicer if we can be more cooperative next time.

@rdblue
Copy link
Contributor

rdblue commented Jun 26, 2020

This isn't a great way to discuss

To clarify, this isn't a discussion. There isn't more for me to say since I've already added my perspective.

self.mode(mode)._jwrite.jdbc(url, table, jprop)


class DataFrameWriterV2(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move it to a new file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think that's better approach. I don't have strong opinion, though feature is small and unlikely to be used directly.

@HyukjinKwon
Copy link
Member

Okay, @zero323, can you address the comments except #27331 (comment)? Let's just merge this one. I will make a PR to fix the things I pointed out by myself.

@SparkQA
Copy link

SparkQA commented Jul 19, 2020

Test build #126123 has finished for PR 27331 at commit 90ddbcc.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2020

Test build #126122 has finished for PR 27331 at commit c8fe7e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2020

Test build #126124 has finished for PR 27331 at commit 3093c35.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2020

Test build #126131 has finished for PR 27331 at commit 9197c84.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I discussed offline with @rdblue, @cloud-fan and @brkyvz. I will take a look by myself and try to make a fix soon.
Thanks for working on this @zero323 and bearing with me here guys.

@HyukjinKwon
Copy link
Member

@rdblue, @brkyvz, @cloud-fan, I am merging this since I am going to make a followup soon. Let me know if there are some more comments here, I will address them in the followup.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@zero323 zero323 deleted the SPARK-29157 branch July 20, 2020 05:51
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
- Adds `DataFramWriterV2` class.
- Adds `writeTo` method to `pyspark.sql.DataFrame`.
- Adds related SQL partitioning functions (`years`, `months`, ..., `bucket`).

Feature parity.

No.

Added new unit tests.

TODO: Should we test against `org.apache.spark.sql.connector.InMemoryTableCatalog`? If so, how to expose it in Python tests?

Closes apache#27331 from zero323/SPARK-29157.

Authored-by: zero323 <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants