Skip to content

Support S3FileIO in Hadoop and Nessie backed Iceberg tables.#20352

Merged
tdcmeehan merged 1 commit intoprestodb:masterfrom
hmadison:hm/iceberg-nessie-s3
Jul 21, 2023
Merged

Support S3FileIO in Hadoop and Nessie backed Iceberg tables.#20352
tdcmeehan merged 1 commit intoprestodb:masterfrom
hmadison:hm/iceberg-nessie-s3

Conversation

@hmadison
Copy link
Contributor

@hmadison hmadison commented Jul 20, 2023

See Also apache/iceberg#3546.

When an non-Hive catalog is used to back Iceberg, Presto does not set up the s3, s3a, or s3n file systems for Hadoop. This prevents non-Hive catalogs from accessing files in object storage without manually passing in a hive configuration.

As a small usability improvement, I've updated IcebergResourceFactory to apply S3ConfigurationUpdater#updateConfiguration when iceberg.hadoop.config.resources is not set. This has the effect of configuring the object storage file systems for Hadoop by default, with the same configuration properties as the Hive connector.

Test plan

I ran (via Docker Compose) a test query which created a new table using the tpch.tiny.region with Nessie/MinIO backed catalog. I then queried the table back out and compared it to the source tpch.tiny.region table.

Test Resources

Docker compose file:

---
services:
  nessie:
    image: "ghcr.io/projectnessie/nessie"
    ports:
      - "19120:19120"
  minio:
    image: "quay.io/minio/minio"
    ports:
      - "9001:9001"
      - "9000:9000"
    command: |
      server /data --console-address ":9001"

Test Query:

create schema test;
create table test.region as (select * from tpch.tiny.region);
== RELEASE NOTES ==

General Changes
* Iceberg catalogues which use Hadoop or Nessie as catalogs now support reading from and writing to S3 with the same configuration options as the Hive catalog.

@hmadison hmadison requested a review from a team as a code owner July 20, 2023 20:43
@hmadison hmadison requested a review from presto-oss July 20, 2023 20:43
@tdcmeehan
Copy link
Contributor

FYI the issue linked seems irrelevant

@hmadison
Copy link
Contributor Author

FYI the issue linked seems irrelevant

My apologies, I meant to link to the Iceberg issue which discusses, in part, the error message.

@tdcmeehan
Copy link
Contributor

No problem, do you want to update the commit message with apache/iceberg#3546 so it correctly links?

@hmadison hmadison force-pushed the hm/iceberg-nessie-s3 branch from fcd4548 to a3e6d50 Compare July 21, 2023 15:36
@tdcmeehan tdcmeehan merged commit 532e77f into prestodb:master Jul 21, 2023
@hmadison hmadison deleted the hm/iceberg-nessie-s3 branch July 21, 2023 16:49
@wanglinsong wanglinsong mentioned this pull request Jul 27, 2023
28 tasks
Configuration configuration = new Configuration(false);

if (hadoopConfigResources.isEmpty()) {
s3ConfigurationUpdater.updateConfiguration(configuration);
Copy link
Member

@agrawalreetika agrawalreetika Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hmadison for fixing up support for S3-backed Iceberg tables.

I have a question, Is there any reason we are not updating S3 configs always?
I mean we can also have the scenario when we have an iceberg table backed by an S3 filesystem and we want to supply some extra Hadoop config-resource in the catalog, then in that case how s3 configs would be updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants