[SPARK] Allow spark catalogs to have hadoop configuration overrides p… #2792

kbendick · 2021-07-07T21:48:29Z

…er catalog

Allows users to set hadoop configuration overrides on any Iceberg tables that come from an Iceberg enabled Spark catalog.

Users specify the configurations similar to specifying global hadoop configuration overrides on the spark session via the spark config.

E.g. for a catalog foo, to override a hadoop config property fs.s3a.max.connections for iceberg table's in that catalog, a config would be added to the spark session config via --conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4.

For now this only works for Spark catalogs, in the future we should consider making this possible for other engines.

This closes #2607

kbendick · 2021-07-07T21:49:11Z

cc @RussellSpitzer @aokolnychyi @flyrain @raptond @rdblue

spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

kbendick · 2021-07-08T01:10:46Z

@rdblue Could we possibly add this to the 0.12 release track? It's a relatively small change but it would be beneficial to us to have this released with 0.12 for users who might maintain their own environments.

If not possible, no worries.

spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

flyrain · 2021-07-08T23:55:01Z

I believe this is per catalog, not per table, right? Does it also support table level setting?

E.g. for a table foo, to override a hadoop config property fs.s3a.max.connections, a config would be added to the spark session config via --conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4.

kbendick · 2021-07-09T19:03:40Z

I believe this is per catalog, not per table, right? Does it also support table level setting?

E.g. for a table foo, to override a hadoop config property fs.s3a.max.connections, a config would be added to the spark session config via --conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4.

That's correct. I have updated the description to mention catalog instead (as there's no way to change the hadoop config for multiple Iceberg tables in the same catalog).

flyrain

LGTM

nastra

LGTM with some nits

spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java

szehon-ho

Hi, I added just a few comments

spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

kbendick · 2021-07-15T22:09:28Z

I just rebased and fixed the merge conflicts. I've also added a synchronized block @szehon-ho.

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

…er catalog

kbendick · 2021-07-16T01:37:02Z

@pvary @marton-bod would you mind taking a look at this as you were in the original discussion on the issue (and are much more informed on the subject of hadoop configs than myself 🙂 )

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

…ormat string per PR feedback

RussellSpitzer · 2021-07-19T22:07:19Z

Solves #2607 - Thanks @kbendick! Thanks @pvary, @nastra , @flyrain and @szehon-ho for reviewing!

…apache#2792) Previously Iceberg Catalogs loaded into Spark would always use the Hadoop Configuration owned by the underlying Spark Session. This made it impossible to use a different set of configuration values which may be required to connect to a remote Catalog. This patch allows Spark catalogs to have hadoop configuration overrides per catalog permitting different configuration for different underlying Iceberg catalogs.

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

github-actions bot added the spark label Jul 7, 2021

kbendick force-pushed the pull-spark-sql-hadoop-confs-per-catalog branch 5 times, most recently from 5192b6d to f174ae7 Compare July 7, 2021 22:54

kbendick commented Jul 8, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Show resolved Hide resolved

kbendick force-pushed the pull-spark-sql-hadoop-confs-per-catalog branch from d743768 to d18bae1 Compare July 8, 2021 00:51

flyrain reviewed Jul 8, 2021

View reviewed changes

spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java Outdated Show resolved Hide resolved

flyrain reviewed Jul 8, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java Outdated Show resolved Hide resolved

flyrain approved these changes Jul 9, 2021

View reviewed changes

kbendick force-pushed the pull-spark-sql-hadoop-confs-per-catalog branch from de67e05 to 16634c3 Compare July 9, 2021 23:53

nastra approved these changes Jul 13, 2021

View reviewed changes

spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java Outdated Show resolved Hide resolved

spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java Outdated Show resolved Hide resolved

szehon-ho reviewed Jul 14, 2021

View reviewed changes

kbendick force-pushed the pull-spark-sql-hadoop-confs-per-catalog branch from b0b981b to 605870f Compare July 15, 2021 22:07

kbendick commented Jul 15, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java Outdated Show resolved Hide resolved

kbendick force-pushed the pull-spark-sql-hadoop-confs-per-catalog branch from bc48137 to 746434e Compare July 15, 2021 23:06

[SPARK] Allow spark catalogs to have hadoop configuration overrides p…

0adbea5

…er catalog

kbendick force-pushed the pull-spark-sql-hadoop-confs-per-catalog branch from 746434e to 0adbea5 Compare July 15, 2021 23:11

szehon-ho approved these changes Jul 15, 2021

View reviewed changes

pvary reviewed Jul 16, 2021

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java Outdated Show resolved Hide resolved

Avoid using hardcoded hadoop string, switch to private static final f…

34f4771

…ormat string per PR feedback

pvary approved these changes Jul 17, 2021

View reviewed changes

RussellSpitzer merged commit 1b3dbb6 into apache:master Jul 19, 2021

pvary mentioned this pull request Sep 16, 2021

how create iceberg hive catalog kerberos #3128

Closed

[SPARK] Allow spark catalogs to have hadoop configuration overrides p… #2792

[SPARK] Allow spark catalogs to have hadoop configuration overrides p… #2792

Uh oh!

Conversation

kbendick commented Jul 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented Jul 7, 2021

Uh oh!

Uh oh!

kbendick commented Jul 8, 2021

Uh oh!

Uh oh!

Uh oh!

flyrain commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented Jul 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kbendick commented Jul 15, 2021

Uh oh!

Uh oh!

kbendick commented Jul 16, 2021

Uh oh!

Uh oh!

RussellSpitzer commented Jul 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kbendick commented Jul 7, 2021 •

edited

Loading

flyrain commented Jul 8, 2021 •

edited

Loading

kbendick commented Jul 9, 2021 •

edited

Loading