Spark 3.4: IcebergSource extends SessionConfigSupport #7732

pan3793 · 2023-05-29T13:19:26Z

This PR aims to make IcebergSource extends SessionConfigSupport to improve the Spark DataSource v2 API coverage.

/**
 * A mix-in interface for {@link TableProvider}. Data sources can implement this interface to
 * propagate session configs with the specified key-prefix to all data source operations in this
 * session.
 *
 * @since 3.0.0
 */
@Evolving
public interface SessionConfigSupport extends TableProvider {

  /**
   * Key prefix of the session configs to propagate, which is usually the data source name. Spark
   * will extract all session configs that starts with `spark.datasource.$keyPrefix`, turn
   * `spark.datasource.$keyPrefix.xxx -&gt; yyy` into `xxx -&gt; yyy`, and propagate them to all
   * data source operations in this session.
   */
  String keyPrefix();
}

It allows to set read/write options by setting Spark session configuration when using the DataFrame API to read/write tables. For examples,

// set write option through session configuration
spark.sql("set spark.datasource.iceberg.<write-opt-key>=<value>")

spark.write
  .format("iceberg")
  .option("<write-opt-key>", "<value>") // equivalent w/ the above SET statement
  ...

dramaticlly

LGTM, excited to use SQL syntax to influence spark write options. I am curious to see if similar can be applied to DELETE and MERGE INTO which can only be achieved via SQL today in iceberg

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

pan3793 · 2023-06-01T12:17:17Z

@dramaticlly thanks, I refined the test code.

pan3793 · 2023-06-06T12:49:58Z

Kindly ping @RussellSpitzer @aokolnychyi @rdblue

github-actions · 2024-09-02T00:14:42Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-09T00:14:58Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

szehon-ho

This looks good to me, small suggestion for the test, let me know what you think

szehon-ho · 2024-11-17T18:51:54Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+
+    withSQLConf(
+        // set write option through session configuration
+        ImmutableMap.of("spark.datasource.iceberg.overwrite-mode", "dynamic"),


suggestion, wdyt to test with snapshot-property and assert that its set explicitly? It may make the test a bit more clear without needing to understand what overwrite-mode is?

@szehon-ho Hmm.. sorry I don't get your point.

Let me explain my idea briefly, the test case should cover both the read and write paths:

create a table, write some data into the table, and record the snapshot as s1

overwrite the table with dynamic overwrite mode (test setting write options through session conf) and check the current snapshot of the table

read the table from the snapshot s1 (test setting read options through session conf) and check the data

Yea sorry, i was not clear. Its just a suggestion.

I think, because the test is to test SessionConfigSupport functionality only, it may be more clear for the reader if the first check (on write part) is like:

'spark.datasource.iceberg.snapshot-property.foo=bar' and then check if foo is set on latest snapshot summary?

Because i think the reader of the test need to know what is 'dynamic overwrite' mode to understand the assert (its not related to the feature), whereas the above is a bit more self-explanatory imo.

I think the read part is decently understandable without additional context.

Thanks for the detailed description. I am educated because I didn't know that Iceberg can add custom snapshot properties through options. I update the test case to follow the suggestions.

pan3793 · 2024-11-18T15:41:05Z

rebased on the latest main branch

szehon-ho · 2024-11-18T19:40:45Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+
+    withSQLConf(
+        // set write option through session configuration
+        ImmutableMap.of("spark.datasource.iceberg.overwrite-mode", "dynamic"),


Yea sorry, i was not clear. Its just a suggestion.

I think, because the test is to test SessionConfigSupport functionality only, it may be more clear for the reader if the first check (on write part) is like:

'spark.datasource.iceberg.snapshot-property.foo=bar' and then check if foo is set on latest snapshot summary?

Because i think the reader of the test need to know what is 'dynamic overwrite' mode to understand the assert (its not related to the feature), whereas the above is a bit more self-explanatory imo.

I think the read part is decently understandable without additional context.

szehon-ho · 2024-11-18T19:47:29Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

 * over any namespace resolution.
 */
-public class IcebergSource implements DataSourceRegister, SupportsCatalogOptions {
+public class IcebergSource


Another comment, is it now multiple ways to configure properties (including #4011), it may be confusing to user. Worth to add a documentation about it, listing the precedence, ie:

I guess using dataframe API (to be double-checked)

explicit dataframe option

dataframe session default

if table exists, explicit table option

if table exists, table default

I updated the docs and hope it's clear now.

pan3793 · 2024-11-19T08:28:50Z

docs/docs/spark-configuration.md

 ### Write options

-Spark write options are passed when configuring the DataFrameWriter, like this:
+Spark write options are passed when configuring the DataFrameWriterV2, like this:


I replaced the example with DataFrameWriterV2 because

iceberg/docs/docs/spark-writes.md

Line 264 in 319f29e

The v1 DataFrame `write` API is still supported, but is not recommended.

szehon-ho

hi @pan3793 i think it looks good, but it looks like it may take a few iterations on the doc part. Should we leave it for another pr to make this smaller? (can get the functionality in first)

szehon-ho · 2024-11-20T18:37:22Z

docs/docs/spark-configuration.md

-| vectorization-enabled  | As per table property | Overrides this table's read.parquet.vectorization.enabled                                          |
-| batch-size  | As per table property | Overrides this table's read.parquet.vectorization.batch-size                                          |
-| stream-from-timestamp | (none) | A timestamp in milliseconds to stream from; if before the oldest known ancestor snapshot, the oldest will be used |
+Iceberg 1.8.0 and later support setting read options by Spark session configuration `spark.datasource.iceberg.<key>=<value>`


I think this is good, but was also thinking of adding a section for priority as well as mentioned.

This can be in its own section, like "session level configuration"?

szehon-ho · 2024-11-20T18:37:36Z

docs/docs/spark-configuration.md

+when using DataFrame to read Iceberg tables, for example: `spark.datasource.iceberg.split-size=512m`, it has lower priority
+than options explicitly passed to DataFrameReader.
+
+| Spark option          | Default               | Description                                                                                                       |


I think we can revert change to this table?

it was auto-formatted by IDEA, reverted

pan3793 · 2024-11-21T09:33:31Z

@szehon-ho WDYT of the current state? I keep the docs change minimally in this patch.

szehon-ho

Thanks @pan3793 for minimizing the changes. I think the doc can still be improved , i put the comments.

But I think it'd be faster if you want to get the code changes in first, to split it in another PR.

szehon-ho · 2024-11-21T21:47:34Z

docs/docs/spark-configuration.md

-| vectorization-enabled  | As per table property | Overrides this table's read.parquet.vectorization.enabled                                          |
-| batch-size  | As per table property | Overrides this table's read.parquet.vectorization.batch-size                                          |
-| stream-from-timestamp | (none) | A timestamp in milliseconds to stream from; if before the oldest known ancestor snapshot, the oldest will be used |
+Iceberg 1.8.0 and later support setting read options by Spark session configuration `spark.datasource.iceberg.<key>=<value>`


This can be in its own section, like "session level configuration"?

szehon-ho · 2024-11-21T21:48:16Z

docs/docs/spark-configuration.md

+    .append()
 ```

+Iceberg 1.8.0 and later support setting write options by Spark session configuration `spark.datasource.iceberg.<key>=<value>`


If we extract to its own section, no need to repeat it?

I write it here because it's "Write options", actually, Spark has different concepts to allow the format/extensions to control the behavior, i.e. table properties, session configurations, options.

szehon-ho · 2024-11-21T21:49:29Z

docs/docs/spark-configuration.md

    .table("catalog.db.table")
 ```

+Iceberg 1.8.0 and later support setting read options by Spark session configuration `spark.datasource.iceberg.<key>=<value>`


I still think we need new section like 'Configuration Priority' where we can explain the order of precedence:
DataFrame Writes:

explicit dataframeWriter option

dataframe session default

if table exists, explicit table option

if table exists, table default

DataFrame Reads:

explicit dataFrameReader option

dataframe session default

if table exists, explicit table option

if table exists, table default

(please double check)

I hesitate to write such a section because the situation looks more complex, some configurations are allowed to be set by dedicated session configuration, for example

public boolean localityEnabled() { boolean defaultValue = Util.mayHaveBlockLocations(table.io(), table.location()); return confParser .booleanConf() .option(SparkReadOptions.LOCALITY) .sessionConf(SparkSQLProperties.LOCALITY) .defaultValue(defaultValue) .parse(); }

This reverts commit ea7a515.

This reverts commit d3b9f9d.

pan3793 · 2024-11-22T08:59:42Z

@szehon-ho I made a minor change on assertion statement after your approval, also create two backports PR for Spark 3.3 and 3.5, thanks for your detailed review.

szehon-ho · 2024-11-22T09:54:37Z

Sure, thanks its a good catch, assertThat is better.

szehon-ho · 2024-11-23T06:57:58Z

Merged, thanks @pan3793

github-actions bot added the spark label May 29, 2023

advancedxy mentioned this pull request May 30, 2023

Spark 3.4: Allow control locality enabled on reading through session conf #7733

Closed

dramaticlly approved these changes May 31, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java Outdated Show resolved Hide resolved

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java Outdated Show resolved Hide resolved

dramaticlly mentioned this pull request Jun 12, 2023

Allow table read/write options to be configured and/or enforced at catalog level using catalog properties #5343

Closed

github-actions bot added the stale label Sep 2, 2024

github-actions bot closed this Sep 9, 2024

pan3793 mentioned this pull request Nov 14, 2024

[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder apache/spark#48822

Closed

szehon-ho reopened this Nov 17, 2024

szehon-ho reviewed Nov 17, 2024

View reviewed changes

github-actions bot removed the stale label Nov 18, 2024

Spark 3.4: IcebergSource extends SessionConfigSupport

b21335a

pan3793 force-pushed the SessionConfigSupport branch from 75d88fa to b21335a Compare November 18, 2024 15:40

tune

2dddaa7

szehon-ho reviewed Nov 18, 2024

View reviewed changes

pan3793 added 2 commits November 19, 2024 13:45

improve test

9558bad

style

f9d8ca4

github-actions bot added the docs label Nov 19, 2024

pan3793 commented Nov 19, 2024

View reviewed changes

szehon-ho reviewed Nov 20, 2024

View reviewed changes

pan3793 added 2 commits November 21, 2024 17:14

docs

d3b9f9d

nit

ea7a515

pan3793 force-pushed the SessionConfigSupport branch from 362ccac to ea7a515 Compare November 21, 2024 09:14

szehon-ho reviewed Nov 21, 2024

View reviewed changes

Revert "nit"

0b5ded7

This reverts commit ea7a515.

Revert "docs"

82bc24b

This reverts commit d3b9f9d.

szehon-ho approved these changes Nov 22, 2024

View reviewed changes

This was referenced Nov 22, 2024

Spark 3.5: IcebergSource extends SessionConfigSupport #11624

Merged

Spark 3.3: IcebergSource extends SessionConfigSupport #11625

Merged

improve assertion

5ee5cbf

nit

ff515b2

szehon-ho merged commit 9cc13b1 into apache:main Nov 23, 2024
31 checks passed

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spark 3.4: IcebergSource extends SessionConfigSupport (apache#7732)

d55dceb

Spark 3.4: IcebergSource extends SessionConfigSupport #7732

Spark 3.4: IcebergSource extends SessionConfigSupport #7732

Uh oh!

Conversation

pan3793 commented May 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dramaticlly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pan3793 commented Jun 1, 2023

Uh oh!

pan3793 commented Jun 6, 2023

Uh oh!

github-actions bot commented Sep 2, 2024

Uh oh!

github-actions bot commented Sep 9, 2024

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Nov 18, 2024

Uh oh!

szehon-ho Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Nov 21, 2024

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Nov 22, 2024

Uh oh!

szehon-ho commented Nov 22, 2024

Uh oh!

pan3793 commented May 29, 2023 •

edited

Loading

pan3793 Nov 18, 2024 •

edited

Loading

szehon-ho Nov 18, 2024 •

edited

Loading

szehon-ho Nov 18, 2024 •

edited

Loading

szehon-ho left a comment •

edited

Loading

szehon-ho Nov 21, 2024 •

edited

Loading