Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented May 30, 2023

This PR aims to allow user control locality enabled on reading through session conf spark.sql.iceberg.locality.enabled.

Previously, it's default enabled for HDFS, and can be disabled by setting read option, but it's is not friendly for SQL cases.

As described in #2577, the locality calculation may cause significant pressure on NameNode, after this PR, user can conveniently disable it by setting spark.sql.iceberg.locality.enabled=false

@github-actions github-actions bot added the spark label May 30, 2023
@pan3793
Copy link
Member Author

pan3793 commented May 30, 2023

The change should be straightforward, no test is added because I don't find the existing test about locality in Spark module.

@advancedxy
Copy link
Contributor

If #7732 is merged, is this new setting spark.sql.iceberg.locality.enabled still needed?
I think you can disable locality by setting spark.datasource.iceberg.locality=false?

@pan3793
Copy link
Member Author

pan3793 commented May 30, 2023

If #7732 is merged, is this new setting spark.sql.iceberg.locality.enabled still needed? I think you can disable locality by setting spark.datasource.iceberg.locality=false?

This is needed. SessionConfigSupport is a mixed-in trait of TableProvider, it only takes effect for iceberg tables loaded through table provider, but not for those tables loaded through catalog.

@advancedxy
Copy link
Contributor

If #7732 is merged, is this new setting spark.sql.iceberg.locality.enabled still needed? I think you can disable locality by setting spark.datasource.iceberg.locality=false?

This is needed. SessionConfigSupport is a mixed-in trait of TableProvider, it only takes effect for iceberg tables loaded through table provider, but not for those tables loaded through catalog.

Ah, yeah. The options are not effective for tables loaded through catalog. But it may bring some confusions to users, such as:

set spark.datasource.iceberg.locality=false; -- works only for DataFrame
set spark.sql.iceberg.locality.enabled=false; -- works both DataFrame and Spark SQL

I was thinking maybe we could inject a new ResolutionRule in IcebergSparkSessionExtensions that collects session configurations(including spark.datasource.iceberg.xxx and spark.sql.iceberg.xxx values) and set options to Iceberg tables. In that way, configurations are unified for DataFrame and Catalog tables. WDYT?

@pan3793
Copy link
Member Author

pan3793 commented May 30, 2023

set spark.datasource.iceberg.locality=false; -- works only for DataFrame
set spark.sql.iceberg.locality.enabled=false; -- works both DataFrame and Spark SQL

The division is not true, the key point here is where the table is loaded through TableProvider or CatalogPlugin.

  1. table loaded through TableProvider examples
  • DataFrame cases
spark.read.format("iceberg").xxx

df.write.format("iceberg").xxx
  • SQL cases
create table t_iceberg (...) using iceberg;
select ... from t_iceberg;
insert into t_iceberg select ...;
  1. table loaded through CatalogPlugin

Assume iceberg catalog is pre-setup properly

  • DataFrame cases
spark.table("iceberg.db.tbl")...
df.writeTo("iceberg.db.tbl")...
  • SQL cases
select ... from iceberg.db.tbl;
insert into iceberg.db.tbl select ...;

@pan3793
Copy link
Member Author

pan3793 commented May 30, 2023

I was thinking maybe we could inject a new ResolutionRule in IcebergSparkSessionExtensions that collects session configurations(including spark.datasource.iceberg.xxx and spark.sql.iceberg.xxx values) and set options to Iceberg tables. In that way, configurations are unified for DataFrame and Catalog tables. WDYT?

In my experience, allowing the user to control some behaviors by using SET xxx=yyy is fantastic, but it seems that Iceberg only allows a small set of configurations to be overwritten by SQL session configuration. So here the questions are:

  • are there principles for which kind of configurations should be exposed to session conf?
  • since we are building a DataSource upon Spark DSv2 API, the API does have such limitations to allow overwrite options using SQL syntax, I would rather respect API design or promote the API change upstream than do such a hackly thing.

@advancedxy
Copy link
Contributor

set spark.datasource.iceberg.locality=false; -- works only for DataFrame
set spark.sql.iceberg.locality.enabled=false; -- works both DataFrame and Spark SQL

The division is not true, the key point here is where the table is loaded through TableProvider or CatalogPlugin.

  1. table loaded through TableProvider examples
  • DataFrame cases
spark.read.format("iceberg").xxx

df.write.format("iceberg").xxx
  • SQL cases
create table t_iceberg (...) using iceberg;
select ... from t_iceberg;
insert into t_iceberg select ...;
  1. table loaded through CatalogPlugin

Assume iceberg catalog is pre-setup properly

  • DataFrame cases
spark.table("iceberg.db.tbl")...
df.writeTo.("iceberg.db.tbl")...
  • SQL cases
select ... from iceberg.db.tbl;
insert into iceberg.db.tbl select ...;

Thanks for the detail explanation. By DataFrame I mean spark.read.format and df.write.format cases, the create table t_iceberg (...) using iceberg doesn't occur to me, but it's indeed loaded by TableProvider. Anyway you can get my point is that tables loaded from TableProvider and from CatalogPlugin don't have unified configuration settings.

@advancedxy
Copy link
Contributor

In my experience, allowing the user to control some behaviors by using SET xxx=yyy is fantastic

Yes, it gives users great flexibility.

are there principles for which kind of configurations should be exposed to session conf?

If #7732 is merged, then all the options in SparkReadOptions and SparkWriteOptions would be exposed to session conf?

since we are building a DataSource upon Spark DSv2 API, the API does have such limitations to allow overwrite options using SQL syntax, I would rather respect API design or promote the API change upstream than do such a hackly thing.

I don't think there are any limitations to allow overwrite options using SQL syntax, it's just that Spark doesn't pass any options when load table through CatalogPlugin. It would be great that such change is promoted at Spark side. However it may take some time to accept that kind of change.

@pan3793
Copy link
Member Author

pan3793 commented Jun 1, 2023

Copy link
Contributor

@wypoon wypoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a useful configuration to have. We have also found a need for it when using Iceberg on HDFS, where locality is enabled by default.
I'd like to see it in spark/v3.3 as well.

@pan3793
Copy link
Member Author

pan3793 commented Jul 19, 2023

@rdblue any chance to review this PR? it and the related PRs were discussed on the mailing list for a while.

@wypoon it's easy to backport it to lower versions, as long as the community reaches the consensus to expose it to SQLConf

Comment on lines +76 to +81
return confParser
.booleanConf()
.option(SparkReadOptions.LOCALITY)
.sessionConf(SparkSQLProperties.LOCALITY_ENABLED)
.defaultValue(defaultValue)
.parse();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, I think it would be better to do

    if (defaultValue) {
      return confParser
          .booleanConf()
          .option(SparkReadOptions.LOCALITY)
          .sessionConf(SparkSQLProperties.LOCALITY_ENABLED)
          .defaultValue(true)
          .parse();
    } else {
      return false;
    }

If defaultValue is false, then locality should be disabled, regardless of the option or session conf, as block locations will not be available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any issue with passing through "true" when that has no effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably rename "defaultValue" to "hasBlockLocations" or "canReadLocal" or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SparkReadConf#localityEnabled returns true, even though the table is not in HDFS, then Iceberg will try to get the block locations, and I am not sure about this, since I haven't tested it, I think the HDFS call could throw an IOException and Iceberg will rethrow that. That would be a bad thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it wouldn't throw an exception. But the default implementation of FileSystem#getFileBlockLocations does

                String[] name = new String[]{"localhost:50010"};
                String[] host = new String[]{"localhost"};
                return new BlockLocation[]{new BlockLocation(name, host, 0L, file.getLen())};

and from that, we get a String[] of the hosts, and that probably won't be too helpful to Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to hasBlockLocations.

// Controls whether vectorized reads are enabled
public static final String VECTORIZATION_ENABLED = "spark.sql.iceberg.vectorization.enabled";

// Controls whether locality reads are enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add in this is only for Hadoop FileIO

@github-actions
Copy link

github-actions bot commented Sep 2, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 2, 2024
@github-actions
Copy link

github-actions bot commented Sep 9, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants