Spark 3.4: Allow control locality enabled on reading through session conf #7733

pan3793 · 2023-05-30T03:41:41Z

This PR aims to allow user control locality enabled on reading through session conf spark.sql.iceberg.locality.enabled.

Previously, it's default enabled for HDFS, and can be disabled by setting read option, but it's is not friendly for SQL cases.

As described in #2577, the locality calculation may cause significant pressure on NameNode, after this PR, user can conveniently disable it by setting spark.sql.iceberg.locality.enabled=false

…conf

pan3793 · 2023-05-30T03:43:14Z

The change should be straightforward, no test is added because I don't find the existing test about locality in Spark module.

advancedxy · 2023-05-30T03:54:59Z

If #7732 is merged, is this new setting spark.sql.iceberg.locality.enabled still needed?
I think you can disable locality by setting spark.datasource.iceberg.locality=false?

pan3793 · 2023-05-30T04:00:12Z

If #7732 is merged, is this new setting spark.sql.iceberg.locality.enabled still needed? I think you can disable locality by setting spark.datasource.iceberg.locality=false?

This is needed. SessionConfigSupport is a mixed-in trait of TableProvider, it only takes effect for iceberg tables loaded through table provider, but not for those tables loaded through catalog.

advancedxy · 2023-05-30T06:10:20Z

If #7732 is merged, is this new setting spark.sql.iceberg.locality.enabled still needed? I think you can disable locality by setting spark.datasource.iceberg.locality=false?

This is needed. SessionConfigSupport is a mixed-in trait of TableProvider, it only takes effect for iceberg tables loaded through table provider, but not for those tables loaded through catalog.

Ah, yeah. The options are not effective for tables loaded through catalog. But it may bring some confusions to users, such as:

set spark.datasource.iceberg.locality=false; -- works only for DataFrame
set spark.sql.iceberg.locality.enabled=false; -- works both DataFrame and Spark SQL

I was thinking maybe we could inject a new ResolutionRule in IcebergSparkSessionExtensions that collects session configurations(including spark.datasource.iceberg.xxx and spark.sql.iceberg.xxx values) and set options to Iceberg tables. In that way, configurations are unified for DataFrame and Catalog tables. WDYT?

pan3793 · 2023-05-30T06:33:02Z

set spark.datasource.iceberg.locality=false; -- works only for DataFrame
set spark.sql.iceberg.locality.enabled=false; -- works both DataFrame and Spark SQL

The division is not true, the key point here is where the table is loaded through TableProvider or CatalogPlugin.

table loaded through TableProvider examples

DataFrame cases

spark.read.format("iceberg").xxx

df.write.format("iceberg").xxx

SQL cases

create table t_iceberg (...) using iceberg;
select ... from t_iceberg;
insert into t_iceberg select ...;

table loaded through CatalogPlugin

Assume iceberg catalog is pre-setup properly

DataFrame cases

spark.table("iceberg.db.tbl")...
df.writeTo("iceberg.db.tbl")...

SQL cases

select ... from iceberg.db.tbl;
insert into iceberg.db.tbl select ...;

pan3793 · 2023-05-30T06:45:41Z

I was thinking maybe we could inject a new ResolutionRule in IcebergSparkSessionExtensions that collects session configurations(including spark.datasource.iceberg.xxx and spark.sql.iceberg.xxx values) and set options to Iceberg tables. In that way, configurations are unified for DataFrame and Catalog tables. WDYT?

In my experience, allowing the user to control some behaviors by using SET xxx=yyy is fantastic, but it seems that Iceberg only allows a small set of configurations to be overwritten by SQL session configuration. So here the questions are:

are there principles for which kind of configurations should be exposed to session conf?
since we are building a DataSource upon Spark DSv2 API, the API does have such limitations to allow overwrite options using SQL syntax, I would rather respect API design or promote the API change upstream than do such a hackly thing.

advancedxy · 2023-05-30T07:16:42Z

set spark.datasource.iceberg.locality=false; -- works only for DataFrame
set spark.sql.iceberg.locality.enabled=false; -- works both DataFrame and Spark SQL
The division is not true, the key point here is where the table is loaded through TableProvider or CatalogPlugin.

table loaded through TableProvider examples

DataFrame cases
spark.read.format("iceberg").xxx

df.write.format("iceberg").xxx
SQL cases
create table t_iceberg (...) using iceberg;
select ... from t_iceberg;
insert into t_iceberg select ...;
table loaded through CatalogPlugin

Assume iceberg catalog is pre-setup properly

DataFrame cases
spark.table("iceberg.db.tbl")...
df.writeTo.("iceberg.db.tbl")...
SQL cases
select ... from iceberg.db.tbl;
insert into iceberg.db.tbl select ...;

Thanks for the detail explanation. By DataFrame I mean spark.read.format and df.write.format cases, the create table t_iceberg (...) using iceberg doesn't occur to me, but it's indeed loaded by TableProvider. Anyway you can get my point is that tables loaded from TableProvider and from CatalogPlugin don't have unified configuration settings.

advancedxy · 2023-05-30T07:29:52Z

In my experience, allowing the user to control some behaviors by using SET xxx=yyy is fantastic

Yes, it gives users great flexibility.

are there principles for which kind of configurations should be exposed to session conf?

If #7732 is merged, then all the options in SparkReadOptions and SparkWriteOptions would be exposed to session conf?

since we are building a DataSource upon Spark DSv2 API, the API does have such limitations to allow overwrite options using SQL syntax, I would rather respect API design or promote the API change upstream than do such a hackly thing.

I don't think there are any limitations to allow overwrite options using SQL syntax, it's just that Spark doesn't pass any options when load table through CatalogPlugin. It would be great that such change is promoted at Spark side. However it may take some time to accept that kind of change.

pan3793 · 2023-06-01T12:21:16Z

cc @rdblue @aokolnychyi @RussellSpitzer

wypoon

This is indeed a useful configuration to have. We have also found a need for it when using Iceberg on HDFS, where locality is enabled by default.
I'd like to see it in spark/v3.3 as well.

pan3793 · 2023-07-19T16:55:35Z

@rdblue any chance to review this PR? it and the related PRs were discussed on the mailing list for a while.

@wypoon it's easy to backport it to lower versions, as long as the community reaches the consensus to expose it to SQLConf

wypoon · 2023-07-25T19:18:31Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

+    return confParser
+        .booleanConf()
+        .option(SparkReadOptions.LOCALITY)
+        .sessionConf(SparkSQLProperties.LOCALITY_ENABLED)
+        .defaultValue(defaultValue)
+        .parse();


On further thought, I think it would be better to do

if (defaultValue) { return confParser .booleanConf() .option(SparkReadOptions.LOCALITY) .sessionConf(SparkSQLProperties.LOCALITY_ENABLED) .defaultValue(true) .parse(); } else { return false; }

If defaultValue is false, then locality should be disabled, regardless of the option or session conf, as block locations will not be available.

Is there any issue with passing through "true" when that has no effect?

I would probably rename "defaultValue" to "hasBlockLocations" or "canReadLocal" or something

If SparkReadConf#localityEnabled returns true, even though the table is not in HDFS, then Iceberg will try to get the block locations, and I am not sure about this, since I haven't tested it, I think the HDFS call could throw an IOException and Iceberg will rethrow that. That would be a bad thing.

Actually, maybe it wouldn't throw an exception. But the default implementation of FileSystem#getFileBlockLocations does

String[] name = new String[]{"localhost:50010"}; String[] host = new String[]{"localhost"}; return new BlockLocation[]{new BlockLocation(name, host, 0L, file.getLen())};

and from that, we get a String[] of the hosts, and that probably won't be too helpful to Spark.

+1 to hasBlockLocations.

RussellSpitzer · 2023-07-25T19:30:50Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

  // Controls whether vectorized reads are enabled
  public static final String VECTORIZATION_ENABLED = "spark.sql.iceberg.vectorization.enabled";

+  // Controls whether locality reads are enabled


can we add in this is only for Hadoop FileIO

github-actions · 2024-09-02T00:14:43Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-09T00:14:59Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Spark 3.4: Allow control locality enabled on reading through session …

ef72bc8

…conf

github-actions bot added the spark label May 30, 2023

pan3793 mentioned this pull request Jun 7, 2023

Spark 3.4: Allow write mode (copy-on-write/merge-on-read) to be specified in SQLConf #7790

Closed

wypoon approved these changes Jul 15, 2023

View reviewed changes

wypoon reviewed Jul 25, 2023

View reviewed changes

RussellSpitzer reviewed Jul 25, 2023

View reviewed changes

github-actions bot added the stale label Sep 2, 2024

github-actions bot closed this Sep 9, 2024

Spark 3.4: Allow control locality enabled on reading through session conf #7733

Spark 3.4: Allow control locality enabled on reading through session conf #7733

Uh oh!

Conversation

pan3793 commented May 30, 2023

Uh oh!

pan3793 commented May 30, 2023

Uh oh!

advancedxy commented May 30, 2023

Uh oh!

pan3793 commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented May 30, 2023

Uh oh!

pan3793 commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented May 30, 2023

Uh oh!

advancedxy commented May 30, 2023

Uh oh!

pan3793 commented Jun 1, 2023

Uh oh!

wypoon left a comment

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Jul 19, 2023

Uh oh!

wypoon Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

wypoon Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

wypoon Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

wypoon Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 2, 2024

Uh oh!

github-actions bot commented Sep 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pan3793 commented May 30, 2023 •

edited

Loading

pan3793 commented May 30, 2023 •

edited

Loading

pan3793 commented May 30, 2023 •

edited

Loading