-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API #27331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a4e666f
264f661
540035e
2432182
6b9d54a
49168fb
9197c84
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,7 +18,7 @@ | |
| from py4j.java_gateway import JavaClass | ||
|
|
||
| from pyspark import RDD, since | ||
| from pyspark.sql.column import _to_seq | ||
| from pyspark.sql.column import _to_seq, _to_java_column | ||
| from pyspark.sql.types import * | ||
| from pyspark.sql import utils | ||
| from pyspark.sql.utils import to_str | ||
|
|
@@ -1075,6 +1075,145 @@ def jdbc(self, url, table, mode=None, properties=None): | |
| self.mode(mode)._jwrite.jdbc(url, table, jprop) | ||
|
|
||
|
|
||
| class DataFrameWriterV2(object): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we move it to a new file?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you think that's better approach. I don't have strong opinion, though feature is small and unlikely to be used directly. |
||
| """ | ||
| Interface used to write a class:`pyspark.sql.dataframe.DataFrame` | ||
| to external storage using the v2 API. | ||
|
|
||
| .. versionadded:: 3.1.0 | ||
| """ | ||
|
|
||
| def __init__(self, df, table): | ||
| self._df = df | ||
| self._spark = df.sql_ctx | ||
| self._jwriter = df._jdf.writeTo(table) | ||
|
|
||
| @since(3.1) | ||
| def using(self, provider): | ||
| """ | ||
| Specifies a provider for the underlying output data source. | ||
| Spark's default catalog supports "parquet", "json", etc. | ||
| """ | ||
| self._jwriter.using(provider) | ||
| return self | ||
|
|
||
| @since(3.1) | ||
| def option(self, key, value): | ||
zero323 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| Add a write option. | ||
| """ | ||
| self._jwriter.option(key, to_str(value)) | ||
| return self | ||
|
|
||
| @since(3.1) | ||
| def options(self, **options): | ||
| """ | ||
| Add write options. | ||
| """ | ||
| options = {k: to_str(v) for k, v in options.items()} | ||
| self._jwriter.options(options) | ||
| return self | ||
|
|
||
| @since(3.1) | ||
| def tableProperty(self, property, value): | ||
| """ | ||
| Add table property. | ||
| """ | ||
| self._jwriter.tableProperty(property, value) | ||
| return self | ||
|
|
||
| @since(3.1) | ||
| def partitionedBy(self, col, *cols): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it's important to describe what are expected for I still don't like it we made this API looks like it takes regular Spark Columns - they are mutually exclusive given the last discussion in the dev mailing list, this was one of the reason why Pandas UDFs were redesigned and separate into two separate groups .. let's at least clarify it.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different class for these partition column expressions such as I remember we basically want to remove these partitioning specific expressions at [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API) I suspect doing
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see the need for separation here that doesn't exist in Scala.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rdblue, I don't mean to we should do that here. I mean to suggest/discuss to make the separation in the Scala first because that propagates the confusion to PySpark API side as well. They are different things so I am suggesting to make it different. I hope we can more focus on the discussion itself. |
||
| """ | ||
| Partition the output table created by `create`, `createOrReplace`, or `replace` using | ||
| the given columns or transforms. | ||
|
|
||
| When specified, the table data will be stored by these values for efficient reads. | ||
|
|
||
| For example, when a table is partitioned by day, it may be stored | ||
| in a directory layout like: | ||
|
|
||
| * `table/day=2019-06-01/` | ||
| * `table/day=2019-06-02/` | ||
|
|
||
| Partitioning is one of the most widely used techniques to optimize physical data layout. | ||
| It provides a coarse-grained index for skipping unnecessary data reads when queries have | ||
| predicates on the partitioned columns. In order for partitioning to work well, the number | ||
| of distinct values in each column should typically be less than tens of thousands. | ||
|
|
||
| `col` and `cols` support only the following functions: | ||
|
|
||
| * :py:func:`pyspark.sql.functions.years` | ||
| * :py:func:`pyspark.sql.functions.months` | ||
| * :py:func:`pyspark.sql.functions.days` | ||
| * :py:func:`pyspark.sql.functions.hours` | ||
| * :py:func:`pyspark.sql.functions.bucket` | ||
|
|
||
| """ | ||
| col = _to_java_column(col) | ||
| cols = _to_seq(self._spark._sc, [_to_java_column(c) for c in cols]) | ||
| return self | ||
|
|
||
| @since(3.1) | ||
| def create(self): | ||
| """ | ||
| Create a new table from the contents of the data frame. | ||
|
|
||
| The new table's schema, partition layout, properties, and other configuration will be | ||
| based on the configuration set on this writer. | ||
| """ | ||
| self._jwriter.create() | ||
|
|
||
| @since(3.1) | ||
| def replace(self): | ||
| """ | ||
| Replace an existing table with the contents of the data frame. | ||
|
|
||
| The existing table's schema, partition layout, properties, and other configuration will be | ||
| replaced with the contents of the data frame and the configuration set on this writer. | ||
| """ | ||
| self._jwriter.replace() | ||
|
|
||
| @since(3.1) | ||
| def createOrReplace(self): | ||
| """ | ||
| Create a new table or replace an existing table with the contents of the data frame. | ||
|
|
||
| The output table's schema, partition layout, properties, | ||
| and other configuration will be based on the contents of the data frame | ||
| and the configuration set on this writer. | ||
| If the table exists, its configuration and data will be replaced. | ||
| """ | ||
| self._jwriter.createOrReplace() | ||
|
|
||
| @since(3.1) | ||
| def append(self): | ||
| """ | ||
| Append the contents of the data frame to the output table. | ||
| """ | ||
| self._jwriter.append() | ||
|
|
||
| @since(3.1) | ||
| def overwrite(self, condition): | ||
| """ | ||
| Overwrite rows matching the given filter condition with the contents of the data frame in | ||
| the output table. | ||
| """ | ||
| condition = _to_java_column(column) | ||
| self._jwriter.overwrite(condition) | ||
|
|
||
| @since(3.1) | ||
| def overwritePartitions(self): | ||
| """ | ||
| Overwrite all partition for which the data frame contains at least one row with the contents | ||
| of the data frame in the output table. | ||
|
|
||
| This operation is equivalent to Hive's `INSERT OVERWRITE ... PARTITION`, which replaces | ||
| partitions dynamically depending on the contents of the data frame. | ||
| """ | ||
| self._jwriter.overwritePartitions() | ||
|
|
||
|
|
||
| def _test(): | ||
| import doctest | ||
| import os | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.