Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,12 @@ public boolean caseSensitive() {

public boolean localityEnabled() {
boolean defaultValue = Util.mayHaveBlockLocations(table.io(), table.location());
return PropertyUtil.propertyAsBoolean(readOptions, SparkReadOptions.LOCALITY, defaultValue);
return confParser
.booleanConf()
.option(SparkReadOptions.LOCALITY)
.sessionConf(SparkSQLProperties.LOCALITY_ENABLED)
.defaultValue(defaultValue)
.parse();
Comment on lines +76 to +81
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, I think it would be better to do

    if (defaultValue) {
      return confParser
          .booleanConf()
          .option(SparkReadOptions.LOCALITY)
          .sessionConf(SparkSQLProperties.LOCALITY_ENABLED)
          .defaultValue(true)
          .parse();
    } else {
      return false;
    }

If defaultValue is false, then locality should be disabled, regardless of the option or session conf, as block locations will not be available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any issue with passing through "true" when that has no effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably rename "defaultValue" to "hasBlockLocations" or "canReadLocal" or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SparkReadConf#localityEnabled returns true, even though the table is not in HDFS, then Iceberg will try to get the block locations, and I am not sure about this, since I haven't tested it, I think the HDFS call could throw an IOException and Iceberg will rethrow that. That would be a bad thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it wouldn't throw an exception. But the default implementation of FileSystem#getFileBlockLocations does

                String[] name = new String[]{"localhost:50010"};
                String[] host = new String[]{"localhost"};
                return new BlockLocation[]{new BlockLocation(name, host, 0L, file.getLen())};

and from that, we get a String[] of the hosts, and that probably won't be too helpful to Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to hasBlockLocations.

}

public Long snapshotId() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ private SparkSQLProperties() {}
// Controls whether vectorized reads are enabled
public static final String VECTORIZATION_ENABLED = "spark.sql.iceberg.vectorization.enabled";

// Controls whether locality reads are enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add in this is only for Hadoop FileIO

public static final String LOCALITY_ENABLED = "spark.sql.iceberg.locality.enabled";

// Controls whether reading/writing timestamps without timezones is allowed
@Deprecated
public static final String HANDLE_TIMESTAMP_WITHOUT_TIMEZONE =
Expand Down