From 37110df7ee9dba380e54defde43bd833dee7fb21 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Wed, 2 Feb 2022 01:26:25 -0800 Subject: [PATCH 1/8] Add 0.13.0 release note --- landing-page/config.toml | 2 +- .../content/common/releases/release-notes.md | 99 ++++++++++++++++++- 2 files changed, 98 insertions(+), 3 deletions(-) diff --git a/landing-page/config.toml b/landing-page/config.toml index a9be5fdd0..3b13e879d 100644 --- a/landing-page/config.toml +++ b/landing-page/config.toml @@ -4,7 +4,7 @@ title = "Apache Iceberg" [params] description = "The open table format for analytic datasets." - latestVersions.iceberg = "0.12.1" + latestVersions.iceberg = "0.13.0" docsBaseURL = "" [[params.social]] diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index ba400862f..8e6ec7649 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -62,6 +62,103 @@ To add a dependency on Iceberg in Maven, add the following to your `pom.xml`: ``` +## 0.13.0 Release Notes + +Apache Iceberg 0.13.0 was released on February 4th, 2022. + +**High-level features:** + +* **Core** + * Partition spec ID is added to the `data_files` spec and can be queried in related metadata tables [[\#3015](https://github.com/apache/iceberg/pull/3015)] + * ORC delete file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] + * Catalog caching supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] + * Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)] + * `NOT_STARTS_WITH` expression support is added for improved predicate-pushdown query performance [[\#2062](https://github.com/apache/iceberg/pull/2062)] + * Hadoop catalog can support atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] + * Iceberg catalog supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] +* **Vendor Integrations** + * `ResolvingFileIO` is added to support using multiple `FileIO`s [[\#3593](https://github.com/apache/iceberg/pull/3593)] + * Google Cloud `FileIO` support is added [[\#3711](https://github.com/apache/iceberg/pull/3711)] + * Aliyun OSS `FileIO` support is added [[\#3553](https://github.com/apache/iceberg/pull/3553)] + * AWS `S3FileIO` supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)] + * S3-compatible cloud storage can use `S3FileIO` for vendor integration [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)] +* **Spark** + * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] + * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] + * `RewriteDataFiles` action supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] + * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)] + * Spark SQL time travel support is added. It also uses snapshot schema instead of table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] + * Spark supports vectorized merge-on-read read [[\#3557](https://github.com/apache/iceberg/pull/3557)] + * Call procedure `ancestors_of` is added to access snapshot ancestor information [[\#3444](https://github.com/apache/iceberg/pull/3444)] + * Truncate [[\#3708](https://github.com/apache/iceberg/pull/3708)] and bucket [[\#3089](https://github.com/apache/iceberg/pull/3368)] UDFs for partition transform value calculation are fully supported +* **Flink** + * Flink 1.13 and 1.14 supports are added [[\#3116](https://github.com/apache/iceberg/pull/3116)] [[\#3434](https://github.com/apache/iceberg/pull/3434)] + * Flink connector support is added [[\#2666](https://github.com/apache/iceberg/pull/2666)] + * Upsert write option is added [[\#2863](https://github.com/apache/iceberg/pull/2863)] + * Avro delete file read support is added [[\#3540](https://github.com/apache/iceberg/pull/3540)] +* **Hive** + * `IcebergInputFormat` supports reading Hive table during Hive-to-Iceberg table migration through name mapping [[\#3312](https://github.com/apache/iceberg/pull/3312)] + * Table listing in Hive catalog can skip non-Iceberg tables using flag `list-all-tables` [[\#3908](https://github.com/apache/iceberg/pull/3908)] + * `uuid` is a reserved table property and exposed for Iceberg table in a Hive metastore for duplication check [[\#3914](https://github.com/apache/iceberg/pull/3914)] + +**Important bug fixes:** + +* **Core** + * Iceberg new data file root path is configured through `write.data.path`. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)] + * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)] + * Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)] + * RowDelta transactions can commit delete files of multiple partition specs instead of just a single one [[\#2985](https://github.com/apache/iceberg/pull/2985)] + * Hadoop catalog returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] + * ORC vectorized read can be configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] + * Dynamic loading of `Catalog` and `FileIO` no longer require Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] + * Dropping table deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] + * Iceberg thread pool now uses at least 2 threads for query planning instead of (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)] + * `history` and `snapshots` metadata tables now support querying table with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] + * `partition` metadata table supports partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] + * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)] + * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)] + * Deleting and adding partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)] + * Parquet file writing now succeeds for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] + * Delete manifests with only existing files are included instead of ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)] +* **Vendor Integrations** + * AWS related client connection resources are properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)] + * AWS Glue catalog displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] +* **Spark** + * `RewriteDataFiles` action is improved to produce more balanced output files in size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] + * `REFRESH TABLE` can be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] + * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)] + * Insert overwrite mode skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] + * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning when scanning files to import [[\#3745](https://github.com/apache/iceberg/issues/3745)] + * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)] + * Snapshot expiration supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] + * REPLACE TABLE AS SELECT can work for tables with columns that have changed partition transform, where each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] + * SQL containing binary or fixed literals can be parsed instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] +* **Flink** + * A `ValidationException` will be thrown if a user configures both `catalog-type` and `catalog-impl`. Previously it always used a `catalog-type`. The new behavior brings Flink consistent with Spark and Hive [[\#3308](https://github.com/apache/iceberg/issues/3308)] + * Change log tables can be queried without RowData serialization issues. [[\#3240](https://github.com/apache/iceberg/pull/3240)] + * Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)] +* **Hive** + * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)] + * Hive catalog can be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] + * Table creation can succeed when some columns do not have comments instead of throwing exception [[\#3531](https://github.com/apache/iceberg/pull/3531)] + * `VectorizedOrcInputFormat` supports split offsets in `OrcTail` for better read performance [[\#3748](https://github.com/apache/iceberg/pull/3748)] + * `FileIO` serialization can be disabled using Hadoop config `iceberg.mr.config.serialization.disabled` for better performance [[\#3752](https://github.com/apache/iceberg/pull/3752)] +**Other notable changes:** + +* The community has finalized the long-term strategy of multi-version support of Spark, Flink and Hive. See [Multi-engine Support](/multi-engine-support) for more details. +* The Iceberg Python module is renamed as [python_legacy](https://github.com/apache/iceberg/tree/master/python_legacy) [[\#3074](https://github.com/apache/iceberg/pull/3074)]. A [new Python module](https://github.com/apache/iceberg/tree/master/python) is under development to provide better user experience for the Python community. See the [Github Project](https://github.com/apache/iceberg/projects/7) for progress. +* Iceberg starts to publish daily snapshot to the [Apache snapshot repository](https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/) [[\#3353](https://github.com/apache/iceberg/pull/3353)] for developers that would like to consume the latest unreleased artifact. +* Iceberg website is now managed by a separated repository [iceberg-docs](https://github.com/apache/iceberg-docs/) with a new layout. See [README](https://github.com/apache/iceberg-docs/blob/main/README.md) for how to make documentation contributions going forward. +* An OpenAPI specification is developed for Iceberg catalog to prepare for a REST-based Iceberg catalog implementation [[\#3770](https://github.com/apache/iceberg/pull/3770)] +* Dependency version upgrades: + * Gradle is upgraded to 7.3 [[\#3525](https://github.com/apache/iceberg/pull/3525)] + * ORC is upgraded to 1.7 [[\#3493](https://github.com/apache/iceberg/pull/3493)] + * Nessie is upgraded to 0.18 [[\#3890](https://github.com/apache/iceberg/pull/3890)] + * Arrow is upgraded to 6.0 [[\#3690](https://github.com/apache/iceberg/pull/3446)] + * Parquet is upgraded to 1.12.2 [[\#3551](https://github.com/apache/iceberg/pull/3551)] + +## Past releases + ## 0.12.1 Release Notes Apache Iceberg 0.12.1 was released on November 8th, 2021. @@ -80,8 +177,6 @@ Important bug fixes and changes: A more exhaustive list of changes is available under the [0.12.1 release milestone](https://github.com/apache/iceberg/milestone/15?closed=1). -## Past releases - ### 0.12.0 Apache Iceberg 0.12.0 was released on August 15, 2021. It consists of 395 commits authored by 74 contributors over a 139 day period. From 73c1d0a6230d24afe5ffcff2adfc05a233c89aae Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Fri, 4 Feb 2022 13:23:55 -0800 Subject: [PATCH 2/8] update 0.12.1 jar locations --- landing-page/content/common/releases/release-notes.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index 8e6ec7649..f430c9a9f 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -163,6 +163,13 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. Apache Iceberg 0.12.1 was released on November 8th, 2021. +* Git tag: [0.12.1](https://github.com/apache/iceberg/releases/tag/apache-iceberg-0.12.1) +* [0.12.1 source tar.gz](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-0.12.1/apache-iceberg-0.12.1.tar.gz) -- [signature](https://downloads.apache.org/iceberg/apache-iceberg-0.12.1/apache-iceberg-0.12.1.tar.gz.asc) -- [sha512](https://downloads.apache.org/iceberg/apache-iceberg-0.12.1/apache-iceberg-0.12.1.tar.gz.sha512) +* [0.12.1 Spark 3.x runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.12.1/iceberg-spark3-runtime-0.12.1.jar) +* [0.12.1 Spark 2.4 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.12.1/iceberg-spark-runtime-0.12.1.jar) +* [0.12.1 Flink runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime/0.12.1/iceberg-flink-runtime-0.12.1.jar) +* [0.12.1 Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.12.1/iceberg-hive-runtime-0.12.1.jar) + Important bug fixes and changes: * [\#3264](https://github.com/apache/iceberg/pull/3258) fixes validation failures that occurred after snapshot expiration when writing Flink CDC streams to Iceberg tables. From fde110629038801975a90ecc51d2c39f86063a2e Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Fri, 4 Feb 2022 13:30:24 -0800 Subject: [PATCH 3/8] update 0.13.0 new jars for different engine versions --- .../content/common/releases/release-notes.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index f430c9a9f..515d5fea0 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -25,14 +25,18 @@ url: releases The latest version of Iceberg is [{{% icebergVersion %}}](https://github.com/apache/iceberg/releases/tag/apache-iceberg-{{% icebergVersion %}}). * [{{% icebergVersion %}} source tar.gz](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz) -- [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc) -- [sha512](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512) +* [{{% icebergVersion %}} Spark 3.2 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-{{% icebergVersion %}}-3.2_2.12.jar) +* [{{% icebergVersion %}} Spark 3.1 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-{{% icebergVersion %}}-3.1_2.12.jar) * [{{% icebergVersion %}} Spark 3.0 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/{{% icebergVersion %}}/iceberg-spark3-runtime-{{% icebergVersion %}}.jar) * [{{% icebergVersion %}} Spark 2.4 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/{{% icebergVersion %}}/iceberg-spark-runtime-{{% icebergVersion %}}.jar) -* [{{% icebergVersion %}} Flink runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime/{{% icebergVersion %}}/iceberg-flink-runtime-{{% icebergVersion %}}.jar) +* [{{% icebergVersion %}} Flink 1.14 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.14/{{% icebergVersion %}}/iceberg-flink-runtime-{{% icebergVersion %}}-1.14.jar) +* [{{% icebergVersion %}} Flink 1.13 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/{{% icebergVersion %}}/iceberg-flink-runtime-{{% icebergVersion %}}-1.13.jar) +* [{{% icebergVersion %}} Flink 1.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/{{% icebergVersion %}}/iceberg-flink-runtime-{{% icebergVersion %}}-1.12.jar) * [{{% icebergVersion %}} Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar) -To use Iceberg in Spark, download the runtime JAR and add it to the jars folder of your Spark install. Use iceberg-spark3-runtime for Spark 3, and iceberg-spark-runtime for Spark 2.4. +To use Iceberg in Spark/Flink, download the runtime JAR based on your Spark/Flink version and add it to the jars folder of your Spark/Flink install. -To use Iceberg in Hive, download the iceberg-hive-runtime JAR and add it to Hive using `ADD JAR`. +To use Iceberg in Hive, download the `iceberg-hive-runtime` JAR and add it to Hive using `ADD JAR`. ### Gradle @@ -145,7 +149,7 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * `FileIO` serialization can be disabled using Hadoop config `iceberg.mr.config.serialization.disabled` for better performance [[\#3752](https://github.com/apache/iceberg/pull/3752)] **Other notable changes:** -* The community has finalized the long-term strategy of multi-version support of Spark, Flink and Hive. See [Multi-engine Support](/multi-engine-support) for more details. +* The community has finalized the long-term strategy of multi-version support of Spark, Flink and Hive. Iceberg will start to provide version-specific implementations and runtime executables to ensure a smooth integration experience with latest engine features for Iceberg users. * The Iceberg Python module is renamed as [python_legacy](https://github.com/apache/iceberg/tree/master/python_legacy) [[\#3074](https://github.com/apache/iceberg/pull/3074)]. A [new Python module](https://github.com/apache/iceberg/tree/master/python) is under development to provide better user experience for the Python community. See the [Github Project](https://github.com/apache/iceberg/projects/7) for progress. * Iceberg starts to publish daily snapshot to the [Apache snapshot repository](https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/) [[\#3353](https://github.com/apache/iceberg/pull/3353)] for developers that would like to consume the latest unreleased artifact. * Iceberg website is now managed by a separated repository [iceberg-docs](https://github.com/apache/iceberg-docs/) with a new layout. See [README](https://github.com/apache/iceberg-docs/blob/main/README.md) for how to make documentation contributions going forward. From a51c35e45b88742e8b97ae905a8dea3eff3e8a80 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Fri, 4 Feb 2022 14:20:44 -0800 Subject: [PATCH 4/8] minor fixes --- .../content/common/releases/release-notes.md | 109 +++++++++--------- 1 file changed, 55 insertions(+), 54 deletions(-) diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index 515d5fea0..4ce3f36ef 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -25,18 +25,18 @@ url: releases The latest version of Iceberg is [{{% icebergVersion %}}](https://github.com/apache/iceberg/releases/tag/apache-iceberg-{{% icebergVersion %}}). * [{{% icebergVersion %}} source tar.gz](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz) -- [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc) -- [sha512](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512) -* [{{% icebergVersion %}} Spark 3.2 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-{{% icebergVersion %}}-3.2_2.12.jar) -* [{{% icebergVersion %}} Spark 3.1 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-{{% icebergVersion %}}-3.1_2.12.jar) +* [{{% icebergVersion %}} Spark 3.2 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.2_2.12-{{% icebergVersion %}}.jar) +* [{{% icebergVersion %}} Spark 3.1 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.1_2.12-{{% icebergVersion %}}.jar) * [{{% icebergVersion %}} Spark 3.0 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/{{% icebergVersion %}}/iceberg-spark3-runtime-{{% icebergVersion %}}.jar) * [{{% icebergVersion %}} Spark 2.4 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/{{% icebergVersion %}}/iceberg-spark-runtime-{{% icebergVersion %}}.jar) -* [{{% icebergVersion %}} Flink 1.14 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.14/{{% icebergVersion %}}/iceberg-flink-runtime-{{% icebergVersion %}}-1.14.jar) -* [{{% icebergVersion %}} Flink 1.13 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/{{% icebergVersion %}}/iceberg-flink-runtime-{{% icebergVersion %}}-1.13.jar) -* [{{% icebergVersion %}} Flink 1.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/{{% icebergVersion %}}/iceberg-flink-runtime-{{% icebergVersion %}}-1.12.jar) +* [{{% icebergVersion %}} Flink 1.14 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.14/{{% icebergVersion %}}/iceberg-flink-runtime-1.14-{{% icebergVersion %}}.jar) +* [{{% icebergVersion %}} Flink 1.13 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/{{% icebergVersion %}}/iceberg-flink-runtime-1.13-{{% icebergVersion %}}.jar) +* [{{% icebergVersion %}} Flink 1.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/{{% icebergVersion %}}/iceberg-flink-runtime-1.12-{{% icebergVersion %}}.jar) * [{{% icebergVersion %}} Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar) To use Iceberg in Spark/Flink, download the runtime JAR based on your Spark/Flink version and add it to the jars folder of your Spark/Flink install. -To use Iceberg in Hive, download the `iceberg-hive-runtime` JAR and add it to Hive using `ADD JAR`. +To use Iceberg in Hive, download the Hive runtime JAR and add it to Hive using `ADD JAR`. ### Gradle @@ -73,97 +73,98 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. **High-level features:** * **Core** - * Partition spec ID is added to the `data_files` spec and can be queried in related metadata tables [[\#3015](https://github.com/apache/iceberg/pull/3015)] + * Partition spec ID (`spec_id`) is added to the `data_files` spec and can be queried in related metadata tables [[\#3015](https://github.com/apache/iceberg/pull/3015)] * ORC delete file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] - * Catalog caching supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] + * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] * Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)] - * `NOT_STARTS_WITH` expression support is added for improved predicate-pushdown query performance [[\#2062](https://github.com/apache/iceberg/pull/2062)] - * Hadoop catalog can support atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] - * Iceberg catalog supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] + * `NOT_STARTS_WITH` expression support is added to improve Iceberg predicate-pushdown query performance [[\#2062](https://github.com/apache/iceberg/pull/2062)] + * Hadoop catalog now supports atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] + * Iceberg catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] * **Vendor Integrations** - * `ResolvingFileIO` is added to support using multiple `FileIO`s [[\#3593](https://github.com/apache/iceberg/pull/3593)] - * Google Cloud `FileIO` support is added [[\#3711](https://github.com/apache/iceberg/pull/3711)] - * Aliyun OSS `FileIO` support is added [[\#3553](https://github.com/apache/iceberg/pull/3553)] - * AWS `S3FileIO` supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)] - * S3-compatible cloud storage can use `S3FileIO` for vendor integration [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)] + * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)] + * Google Cloud Storage (GCS) `FileIO` support is added [[\#3711](https://github.com/apache/iceberg/pull/3711)] + * Aliyun Object Storage Service (OSS) `FileIO` support is added [[\#3553](https://github.com/apache/iceberg/pull/3553)] + * AWS `S3FileIO` now supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)] + * S3-compatible cloud storages (e.g. MinIO) can now be accessed through AWS `S3FileIO` with custom endpoint and credential configurations [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)] * **Spark** * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] - * `RewriteDataFiles` action supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] + * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)] - * Spark SQL time travel support is added. It also uses snapshot schema instead of table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] - * Spark supports vectorized merge-on-read read [[\#3557](https://github.com/apache/iceberg/pull/3557)] + * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] + * Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)] * Call procedure `ancestors_of` is added to access snapshot ancestor information [[\#3444](https://github.com/apache/iceberg/pull/3444)] - * Truncate [[\#3708](https://github.com/apache/iceberg/pull/3708)] and bucket [[\#3089](https://github.com/apache/iceberg/pull/3368)] UDFs for partition transform value calculation are fully supported + * Truncate [[\#3708](https://github.com/apache/iceberg/pull/3708)] and bucket [[\#3089](https://github.com/apache/iceberg/pull/3368)] UDFs are added for calculating for partition transform values * **Flink** * Flink 1.13 and 1.14 supports are added [[\#3116](https://github.com/apache/iceberg/pull/3116)] [[\#3434](https://github.com/apache/iceberg/pull/3434)] * Flink connector support is added [[\#2666](https://github.com/apache/iceberg/pull/2666)] * Upsert write option is added [[\#2863](https://github.com/apache/iceberg/pull/2863)] * Avro delete file read support is added [[\#3540](https://github.com/apache/iceberg/pull/3540)] * **Hive** - * `IcebergInputFormat` supports reading Hive table during Hive-to-Iceberg table migration through name mapping [[\#3312](https://github.com/apache/iceberg/pull/3312)] + * Hive tables can now be read through name mapping during Hive-to-Iceberg table migration [[\#3312](https://github.com/apache/iceberg/pull/3312)] * Table listing in Hive catalog can skip non-Iceberg tables using flag `list-all-tables` [[\#3908](https://github.com/apache/iceberg/pull/3908)] - * `uuid` is a reserved table property and exposed for Iceberg table in a Hive metastore for duplication check [[\#3914](https://github.com/apache/iceberg/pull/3914)] + * `uuid` is now a reserved Iceberg table property and exposed for Iceberg table in a Hive metastore for duplication check [[\#3914](https://github.com/apache/iceberg/pull/3914)] **Important bug fixes:** * **Core** - * Iceberg new data file root path is configured through `write.data.path`. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)] + * Iceberg new data file root path is configured through `write.data.path` going forward. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)] * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)] * Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)] - * RowDelta transactions can commit delete files of multiple partition specs instead of just a single one [[\#2985](https://github.com/apache/iceberg/pull/2985)] - * Hadoop catalog returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] + * `RowDelta` transactions can commit delete files of multiple partition specs instead of just a single one [[\#2985](https://github.com/apache/iceberg/pull/2985)] + * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] * ORC vectorized read can be configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] - * Dynamic loading of `Catalog` and `FileIO` no longer require Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] - * Dropping table deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] - * Iceberg thread pool now uses at least 2 threads for query planning instead of (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)] - * `history` and `snapshots` metadata tables now support querying table with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] - * `partition` metadata table supports partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] + * Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] + * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] + * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)] + * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] + * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)] * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)] - * Deleting and adding partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)] - * Parquet file writing now succeeds for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] - * Delete manifests with only existing files are included instead of ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)] + * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)] + * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] + * Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)] * **Vendor Integrations** - * AWS related client connection resources are properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)] - * AWS Glue catalog displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] + * AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)] + * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] * **Spark** - * `RewriteDataFiles` action is improved to produce more balanced output files in size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] - * `REFRESH TABLE` can be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] + * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] + * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)] - * Insert overwrite mode skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] - * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning when scanning files to import [[\#3745](https://github.com/apache/iceberg/issues/3745)] + * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] + * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)] - * Snapshot expiration supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] - * REPLACE TABLE AS SELECT can work for tables with columns that have changed partition transform, where each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] - * SQL containing binary or fixed literals can be parsed instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] + * Snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] + * `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] + * SQLs containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] * **Flink** - * A `ValidationException` will be thrown if a user configures both `catalog-type` and `catalog-impl`. Previously it always used a `catalog-type`. The new behavior brings Flink consistent with Spark and Hive [[\#3308](https://github.com/apache/iceberg/issues/3308)] - * Change log tables can be queried without RowData serialization issues. [[\#3240](https://github.com/apache/iceberg/pull/3240)] + * A `ValidationException` will be thrown if a user configures both `catalog-type` and `catalog-impl`. Previously it chose to use `catalog-type`. The new behavior brings Flink consistent with Spark and Hive [[\#3308](https://github.com/apache/iceberg/issues/3308)] + * Changelog tables can now be queried without `RowData` serialization issues [[\#3240](https://github.com/apache/iceberg/pull/3240)] * Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)] * **Hive** * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)] - * Hive catalog can be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] - * Table creation can succeed when some columns do not have comments instead of throwing exception [[\#3531](https://github.com/apache/iceberg/pull/3531)] - * `VectorizedOrcInputFormat` supports split offsets in `OrcTail` for better read performance [[\#3748](https://github.com/apache/iceberg/pull/3748)] - * `FileIO` serialization can be disabled using Hadoop config `iceberg.mr.config.serialization.disabled` for better performance [[\#3752](https://github.com/apache/iceberg/pull/3752)] + * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] + * Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)] + * Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)] + * Read performance can now be improved by disabling `FileIO` serialization using Hadoop config `iceberg.mr.config.serialization.disabled` [[\#3752](https://github.com/apache/iceberg/pull/3752)] + **Other notable changes:** -* The community has finalized the long-term strategy of multi-version support of Spark, Flink and Hive. Iceberg will start to provide version-specific implementations and runtime executables to ensure a smooth integration experience with latest engine features for Iceberg users. +* The community has finalized the long-term strategy of Spark, Flink and Hive support. Iceberg will start to provide version-specific implementations and runtime executables to ensure a smooth integration experience for Iceberg users. * The Iceberg Python module is renamed as [python_legacy](https://github.com/apache/iceberg/tree/master/python_legacy) [[\#3074](https://github.com/apache/iceberg/pull/3074)]. A [new Python module](https://github.com/apache/iceberg/tree/master/python) is under development to provide better user experience for the Python community. See the [Github Project](https://github.com/apache/iceberg/projects/7) for progress. -* Iceberg starts to publish daily snapshot to the [Apache snapshot repository](https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/) [[\#3353](https://github.com/apache/iceberg/pull/3353)] for developers that would like to consume the latest unreleased artifact. -* Iceberg website is now managed by a separated repository [iceberg-docs](https://github.com/apache/iceberg-docs/) with a new layout. See [README](https://github.com/apache/iceberg-docs/blob/main/README.md) for how to make documentation contributions going forward. +* Iceberg starts to publish daily snapshot in the [Apache snapshot repository](https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/) [[\#3353](https://github.com/apache/iceberg/pull/3353)] for developers that would like to consume the latest unreleased artifact. +* Iceberg website is now managed by a separated repository [iceberg-docs](https://github.com/apache/iceberg-docs/) with a new look. See [README](https://github.com/apache/iceberg-docs/blob/main/README.md) for contribution guidelines going forward. * An OpenAPI specification is developed for Iceberg catalog to prepare for a REST-based Iceberg catalog implementation [[\#3770](https://github.com/apache/iceberg/pull/3770)] * Dependency version upgrades: * Gradle is upgraded to 7.3 [[\#3525](https://github.com/apache/iceberg/pull/3525)] + * Parquet is upgraded to 1.12.2 [[\#3551](https://github.com/apache/iceberg/pull/3551)] * ORC is upgraded to 1.7 [[\#3493](https://github.com/apache/iceberg/pull/3493)] - * Nessie is upgraded to 0.18 [[\#3890](https://github.com/apache/iceberg/pull/3890)] * Arrow is upgraded to 6.0 [[\#3690](https://github.com/apache/iceberg/pull/3446)] - * Parquet is upgraded to 1.12.2 [[\#3551](https://github.com/apache/iceberg/pull/3551)] + * Nessie is upgraded to 0.18 [[\#3890](https://github.com/apache/iceberg/pull/3890)] ## Past releases -## 0.12.1 Release Notes +### 0.12.1 Apache Iceberg 0.12.1 was released on November 8th, 2021. From b642ee29bde3d4a23a6559f40c6f78147887df8c Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Mon, 7 Feb 2022 13:43:41 -0800 Subject: [PATCH 5/8] remove some release items, add engine support page --- .../common/project/multi-engine-support.md | 93 +++++++++++++++++++ .../content/common/releases/release-notes.md | 71 +++++--------- 2 files changed, 116 insertions(+), 48 deletions(-) create mode 100644 landing-page/content/common/project/multi-engine-support.md diff --git a/landing-page/content/common/project/multi-engine-support.md b/landing-page/content/common/project/multi-engine-support.md new file mode 100644 index 000000000..3b9c67628 --- /dev/null +++ b/landing-page/content/common/project/multi-engine-support.md @@ -0,0 +1,93 @@ +--- +title: "Multi-Engine Support" +bookHidden: true +url: multi-engine-support +--- + + +# Multi-Engine Support + +Multi-engine support is a core tenant of Apache Iceberg. +The community continuously improves Iceberg core library components to enable integrations with different compute engines that power analytics, business intelligence, machine learning, etc. +Support of [Apache Spark](../../../docs/spark-configuration), [Apache Flink](../../../docs/flink) and [Apache Hive](../../../docs/hive) are provided inside the Iceberg main repository. + +## Multi-Version Support + +Engines maintained within the Iceberg repository have multi-version support. +This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts. +For example, the code for Iceberg Spark 3.1 integration is under `/spark/v3.1`, and for Iceberg Spark 3.2 integration is under `/spark/v3.2`, +Different artifacts (`iceberg-spark-3.1_2.12` and `iceberg-spark-3.2_2.12`) are released for users to consume. +By doing this, changes across versions are isolated. New features in Iceberg could be developed against the latest features of an engine without breaking support of old APIs in past engine versions. + +## Engine Version Lifecycle + +Each engine version undergoes the following lifecycle stages: + +1. **Beta**: a new engine version is supported, but still in the experimental stage. Maybe the engine version itself is still in preview (e.g. Spark `3.0.0-preview`), or the engine does not yet have full feature compatibility compared to old versions yet. This stage allows Iceberg to release an engine version support without the need to wait for feature parity, shortening the release time. +2. **Maintained**: an engine version is actively maintained by the community. Users can expect parity for most features across all the maintained versions. If a feature has to leverage some new engine functionalities that older versions don't have, then feature parity across maintained versions is not guaranteed. +3. **Deprecated**: an engine version is no longer actively maintained. People who are still interested in the version can backport any necessary feature or bug fix from newer versions, but the community will not spend effort in achieving feature parity. Iceberg recommends users to move towards a newer version. Contributions to a deprecated version is expected to diminish over time, so that eventually no change is added to a deprecated version. +4. **End-of-life**: a vote can be initiated in the community to fully remove a deprecated version out of the Iceberg repository to mark as its end of life. + +## Current Engine Version Lifecycle Status + +### Apache Spark + +| Version | Lifecycle Stage | +| ---------- | ------------------ | +| 2.4 | Deprecated | +| 3.0 | Maintained | +| 3.1 | Maintained | +| 3.2 | Beta | + +### Apache Flink + +Based on the guideline of the Flink community, only the latest 2 minor versions are actively maintained. +Users should continuously upgrade their Flink version to stay up-to-date. + +| Version | Lifecycle Stage | +| ---------- | ----------------- | +| 1.12 | Deprecated | +| 1.13 | Maintained | +| 1.14 | Maintained | +### Apache Hive + +| Version | Lifecycle Stage | +| ------------------------------- | ----------------- | +| 2 (recommended >= 2.3) | Maintained | +| 3 | Maintained | + +## Developer Guide + +### Maintaining existing engine versions + +Iceberg recommends the following for developers who are maintaining existing engine versions: + +1. New features should always be prioritized first in the latest version, which is either a maintained or beta version. +2. For features that could be backported, contributors are encouraged to either perform backports to all maintained versions, or at least create some issues to track the backport. +3. If the change is small enough, updating all versions in a single PR is acceptable. Otherwise, using separated PRs for each version is recommended. + +### Supporting new engines + +Iceberg recommends new engines to build support by importing the Iceberg libraries to the engine's project. +This allows the Iceberg support to evolve with the engine. +Projects such as [Trino](https://trino.io/docs/current/connector/iceberg.html) and [Presto](https://prestodb.io/docs/current/connector/iceberg.html) are good examples of such support strategy. + +In this approach, an Iceberg version upgrade is needed for an engine to consume new Iceberg features. +To facilitate engine development against unreleased Iceberg features, a snapshot release version could be found at the [Apache snapshot repository](https://repository.apache.org/content/repositories/snapshots/org/apache/iceberg/). + +If bringing an engine directly to the Iceberg main repository is needed, please raise a discussion thread in the [Iceberg community](../community). \ No newline at end of file diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index 4ce3f36ef..e42ffec70 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -73,94 +73,69 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. **High-level features:** * **Core** - * Partition spec ID (`spec_id`) is added to the `data_files` spec and can be queried in related metadata tables [[\#3015](https://github.com/apache/iceberg/pull/3015)] - * ORC delete file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] - * Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)] - * `NOT_STARTS_WITH` expression support is added to improve Iceberg predicate-pushdown query performance [[\#2062](https://github.com/apache/iceberg/pull/2062)] + * Catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] * Hadoop catalog now supports atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] - * Iceberg catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] * **Vendor Integrations** - * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)] - * Google Cloud Storage (GCS) `FileIO` support is added [[\#3711](https://github.com/apache/iceberg/pull/3711)] + * Google Cloud Storage (GCS) `FileIO` support is added with optimized read and write using GCS streaming transfer [[\#3711](https://github.com/apache/iceberg/pull/3711)] * Aliyun Object Storage Service (OSS) `FileIO` support is added [[\#3553](https://github.com/apache/iceberg/pull/3553)] + * Any S3-compatible storage (e.g. MinIO) can now be accessed through AWS `S3FileIO` with custom endpoint and credential configurations [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)] * AWS `S3FileIO` now supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)] - * S3-compatible cloud storages (e.g. MinIO) can now be accessed through AWS `S3FileIO` with custom endpoint and credential configurations [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)] + * AWS `GlueCatalog` now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] + * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)] +* **File Formats** + * Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)] to facilitate Hive to Iceberg table migration. + * ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] * **Spark** - * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] - * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] - * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] - * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)] + * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] + * `RewriteDataFiles` action now supports sort-based table optimization [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]. The corresponding Spark call procedure `rewrite_data_files` is also added [[\#3375](https://github.com/apache/iceberg/pull/3375)] * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] * Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)] - * Call procedure `ancestors_of` is added to access snapshot ancestor information [[\#3444](https://github.com/apache/iceberg/pull/3444)] - * Truncate [[\#3708](https://github.com/apache/iceberg/pull/3708)] and bucket [[\#3089](https://github.com/apache/iceberg/pull/3368)] UDFs are added for calculating for partition transform values * **Flink** * Flink 1.13 and 1.14 supports are added [[\#3116](https://github.com/apache/iceberg/pull/3116)] [[\#3434](https://github.com/apache/iceberg/pull/3434)] * Flink connector support is added [[\#2666](https://github.com/apache/iceberg/pull/2666)] * Upsert write option is added [[\#2863](https://github.com/apache/iceberg/pull/2863)] - * Avro delete file read support is added [[\#3540](https://github.com/apache/iceberg/pull/3540)] * **Hive** - * Hive tables can now be read through name mapping during Hive-to-Iceberg table migration [[\#3312](https://github.com/apache/iceberg/pull/3312)] - * Table listing in Hive catalog can skip non-Iceberg tables using flag `list-all-tables` [[\#3908](https://github.com/apache/iceberg/pull/3908)] - * `uuid` is now a reserved Iceberg table property and exposed for Iceberg table in a Hive metastore for duplication check [[\#3914](https://github.com/apache/iceberg/pull/3914)] + * Table listing in Hive catalog can now skip non-Iceberg tables using flag `list-all-tables` [[\#3908](https://github.com/apache/iceberg/pull/3908)] + * Hive tables imported to Iceberg can now be read by `IcebergInputFormat` [[\#3312](https://github.com/apache/iceberg/pull/3312)] **Important bug fixes:** * **Core** * Iceberg new data file root path is configured through `write.data.path` going forward. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)] * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)] - * Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)] - * `RowDelta` transactions can commit delete files of multiple partition specs instead of just a single one [[\#2985](https://github.com/apache/iceberg/pull/2985)] * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] - * ORC vectorized read can be configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] - * Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] - * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] - * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)] - * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] - * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] - * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)] - * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)] - * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)] - * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] - * Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)] + * Using non-Hadoop `Catalog` and `FileIO` no longer fail when missing Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] + * Dropping table now also deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] + * `history` and `snapshots` metadata tables can now query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] * **Vendor Integrations** - * AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)] - * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] + * AWS clients are now auto-closed when `FileIO` or `Catalog` is closed. There is no need to close the AWS clients separately [[\#2878](https://github.com/apache/iceberg/pull/2878)] +* **File Formats** + * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] + * ORC vectorized read is now configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] * **Spark** - * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] - * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)] * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)] * Snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] * `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] - * SQLs containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] + * Spark SQL statements containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] * **Flink** * A `ValidationException` will be thrown if a user configures both `catalog-type` and `catalog-impl`. Previously it chose to use `catalog-type`. The new behavior brings Flink consistent with Spark and Hive [[\#3308](https://github.com/apache/iceberg/issues/3308)] * Changelog tables can now be queried without `RowData` serialization issues [[\#3240](https://github.com/apache/iceberg/pull/3240)] - * Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)] + * `java.sql.Time` data type can now be written without data overflow problem [[\#3740](https://github.com/apache/iceberg/pull/3740)] * **Hive** - * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)] * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] * Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)] - * Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)] - * Read performance can now be improved by disabling `FileIO` serialization using Hadoop config `iceberg.mr.config.serialization.disabled` [[\#3752](https://github.com/apache/iceberg/pull/3752)] **Other notable changes:** -* The community has finalized the long-term strategy of Spark, Flink and Hive support. Iceberg will start to provide version-specific implementations and runtime executables to ensure a smooth integration experience for Iceberg users. +* The community has finalized the long-term strategy of Spark, Flink and Hive support. See [Multi-Engine Support](../multi-engine-support) page for more details. * The Iceberg Python module is renamed as [python_legacy](https://github.com/apache/iceberg/tree/master/python_legacy) [[\#3074](https://github.com/apache/iceberg/pull/3074)]. A [new Python module](https://github.com/apache/iceberg/tree/master/python) is under development to provide better user experience for the Python community. See the [Github Project](https://github.com/apache/iceberg/projects/7) for progress. * Iceberg starts to publish daily snapshot in the [Apache snapshot repository](https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/) [[\#3353](https://github.com/apache/iceberg/pull/3353)] for developers that would like to consume the latest unreleased artifact. * Iceberg website is now managed by a separated repository [iceberg-docs](https://github.com/apache/iceberg-docs/) with a new look. See [README](https://github.com/apache/iceberg-docs/blob/main/README.md) for contribution guidelines going forward. -* An OpenAPI specification is developed for Iceberg catalog to prepare for a REST-based Iceberg catalog implementation [[\#3770](https://github.com/apache/iceberg/pull/3770)] -* Dependency version upgrades: - * Gradle is upgraded to 7.3 [[\#3525](https://github.com/apache/iceberg/pull/3525)] - * Parquet is upgraded to 1.12.2 [[\#3551](https://github.com/apache/iceberg/pull/3551)] - * ORC is upgraded to 1.7 [[\#3493](https://github.com/apache/iceberg/pull/3493)] - * Arrow is upgraded to 6.0 [[\#3690](https://github.com/apache/iceberg/pull/3446)] - * Nessie is upgraded to 0.18 [[\#3890](https://github.com/apache/iceberg/pull/3890)] +* An OpenAPI specification for Iceberg catalog is approved by the community, and a REST-based Iceberg catalog based on the specification is currently under development [[\#3770](https://github.com/apache/iceberg/pull/3770)] ## Past releases From 3e27a912dc6f429b015eaf76df0db097d778283c Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Mon, 7 Feb 2022 13:55:36 -0800 Subject: [PATCH 6/8] fix typos --- .../content/common/releases/release-notes.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index e42ffec70..b2b45aad7 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -84,7 +84,7 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * AWS `GlueCatalog` now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)] * **File Formats** - * Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)] to facilitate Hive to Iceberg table migration. + * Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported to facilitate Hive to Iceberg table migration [[\#3723](https://github.com/apache/iceberg/pull/3723)] * ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] * **Spark** * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] @@ -96,7 +96,7 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * Flink connector support is added [[\#2666](https://github.com/apache/iceberg/pull/2666)] * Upsert write option is added [[\#2863](https://github.com/apache/iceberg/pull/2863)] * **Hive** - * Table listing in Hive catalog can now skip non-Iceberg tables using flag `list-all-tables` [[\#3908](https://github.com/apache/iceberg/pull/3908)] + * Table listing in Hive catalog can now skip non-Iceberg tables by disabling flag `list-all-tables` [[\#3908](https://github.com/apache/iceberg/pull/3908)] * Hive tables imported to Iceberg can now be read by `IcebergInputFormat` [[\#3312](https://github.com/apache/iceberg/pull/3312)] **Important bug fixes:** @@ -105,20 +105,20 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * Iceberg new data file root path is configured through `write.data.path` going forward. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)] * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)] * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] - * Using non-Hadoop `Catalog` and `FileIO` no longer fail when missing Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] * Dropping table now also deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] * `history` and `snapshots` metadata tables can now query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] * **Vendor Integrations** + * Using cloud service integrations such as AWS `GlueCatalog` and `S3FileIO` no longer fail when missing Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] * AWS clients are now auto-closed when `FileIO` or `Catalog` is closed. There is no need to close the AWS clients separately [[\#2878](https://github.com/apache/iceberg/pull/2878)] * **File Formats** - * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] + * Parquet file writing issue is fixed for string data with over 16 unparseable chars (e.g. high/low surrogates) [[\#3760](https://github.com/apache/iceberg/pull/3760)] * ORC vectorized read is now configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] * **Spark** * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)] - * Snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] + * Spark snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] * `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] * Spark SQL statements containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] * **Flink** @@ -126,8 +126,8 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * Changelog tables can now be queried without `RowData` serialization issues [[\#3240](https://github.com/apache/iceberg/pull/3240)] * `java.sql.Time` data type can now be written without data overflow problem [[\#3740](https://github.com/apache/iceberg/pull/3740)] * **Hive** - * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] - * Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)] + * Hive catalog can now be initialized with a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] + * Table creation can now succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)] **Other notable changes:** From a38f7e3cd1bf0ea8cc96cd31cffe19f4cee0795a Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Mon, 7 Feb 2022 14:13:20 -0800 Subject: [PATCH 7/8] address comments --- .../common/project/multi-engine-support.md | 2 +- .../content/common/releases/release-notes.md | 18 +++++++----------- 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/landing-page/content/common/project/multi-engine-support.md b/landing-page/content/common/project/multi-engine-support.md index 3b9c67628..27e27be79 100644 --- a/landing-page/content/common/project/multi-engine-support.md +++ b/landing-page/content/common/project/multi-engine-support.md @@ -88,6 +88,6 @@ This allows the Iceberg support to evolve with the engine. Projects such as [Trino](https://trino.io/docs/current/connector/iceberg.html) and [Presto](https://prestodb.io/docs/current/connector/iceberg.html) are good examples of such support strategy. In this approach, an Iceberg version upgrade is needed for an engine to consume new Iceberg features. -To facilitate engine development against unreleased Iceberg features, a snapshot release version could be found at the [Apache snapshot repository](https://repository.apache.org/content/repositories/snapshots/org/apache/iceberg/). +To facilitate engine development against unreleased Iceberg features, a daily snapshot is published in the [Apache snapshot repository](https://repository.apache.org/content/repositories/snapshots/org/apache/iceberg/). If bringing an engine directly to the Iceberg main repository is needed, please raise a discussion thread in the [Iceberg community](../community). \ No newline at end of file diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index b2b45aad7..8f24f057f 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -34,7 +34,7 @@ The latest version of Iceberg is [{{% icebergVersion %}}](https://github.com/apa * [{{% icebergVersion %}} Flink 1.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/{{% icebergVersion %}}/iceberg-flink-runtime-1.12-{{% icebergVersion %}}.jar) * [{{% icebergVersion %}} Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar) -To use Iceberg in Spark/Flink, download the runtime JAR based on your Spark/Flink version and add it to the jars folder of your Spark/Flink install. +To use Iceberg in Spark or Flink, download the runtime JAR for your engine version and add it to the jars folder of your installation. To use Iceberg in Hive, download the Hive runtime JAR and add it to Hive using `ADD JAR`. @@ -75,7 +75,7 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * **Core** * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] * Catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] - * Hadoop catalog now supports atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] + * Hadoop catalog now supports atomic commit using a lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] * **Vendor Integrations** * Google Cloud Storage (GCS) `FileIO` support is added with optimized read and write using GCS streaming transfer [[\#3711](https://github.com/apache/iceberg/pull/3711)] * Aliyun Object Storage Service (OSS) `FileIO` support is added [[\#3553](https://github.com/apache/iceberg/pull/3553)] @@ -89,8 +89,9 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * **Spark** * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] * `RewriteDataFiles` action now supports sort-based table optimization [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]. The corresponding Spark call procedure `rewrite_data_files` is also added [[\#3375](https://github.com/apache/iceberg/pull/3375)] - * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] - * Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)] + * Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] + * Spark vectorized reads now support row-level deletes [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)] + * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] * **Flink** * Flink 1.13 and 1.14 supports are added [[\#3116](https://github.com/apache/iceberg/pull/3116)] [[\#3434](https://github.com/apache/iceberg/pull/3434)] * Flink connector support is added [[\#2666](https://github.com/apache/iceberg/pull/2666)] @@ -104,7 +105,6 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * **Core** * Iceberg new data file root path is configured through `write.data.path` going forward. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)] * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)] - * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] * Dropping table now also deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] * `history` and `snapshots` metadata tables can now query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] * **Vendor Integrations** @@ -114,10 +114,8 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * Parquet file writing issue is fixed for string data with over 16 unparseable chars (e.g. high/low surrogates) [[\#3760](https://github.com/apache/iceberg/pull/3760)] * ORC vectorized read is now configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] * **Spark** - * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] - * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] - * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] - * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)] + * For Spark >= 3.1, `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] + * Insert overwrite mode now skips partition with 0 records instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] * Spark snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] * `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] * Spark SQL statements containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] @@ -132,8 +130,6 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. **Other notable changes:** * The community has finalized the long-term strategy of Spark, Flink and Hive support. See [Multi-Engine Support](../multi-engine-support) page for more details. -* The Iceberg Python module is renamed as [python_legacy](https://github.com/apache/iceberg/tree/master/python_legacy) [[\#3074](https://github.com/apache/iceberg/pull/3074)]. A [new Python module](https://github.com/apache/iceberg/tree/master/python) is under development to provide better user experience for the Python community. See the [Github Project](https://github.com/apache/iceberg/projects/7) for progress. -* Iceberg starts to publish daily snapshot in the [Apache snapshot repository](https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/) [[\#3353](https://github.com/apache/iceberg/pull/3353)] for developers that would like to consume the latest unreleased artifact. * Iceberg website is now managed by a separated repository [iceberg-docs](https://github.com/apache/iceberg-docs/) with a new look. See [README](https://github.com/apache/iceberg-docs/blob/main/README.md) for contribution guidelines going forward. * An OpenAPI specification for Iceberg catalog is approved by the community, and a REST-based Iceberg catalog based on the specification is currently under development [[\#3770](https://github.com/apache/iceberg/pull/3770)] From 1de71eb5119e4070c85ffc01b494a43a0ec41f83 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Mon, 7 Feb 2022 20:22:34 -0800 Subject: [PATCH 8/8] address comments --- .../common/project/multi-engine-support.md | 61 ++++++++++++------- .../content/common/releases/release-notes.md | 51 ++++++++-------- 2 files changed, 64 insertions(+), 48 deletions(-) diff --git a/landing-page/content/common/project/multi-engine-support.md b/landing-page/content/common/project/multi-engine-support.md index 27e27be79..77dced433 100644 --- a/landing-page/content/common/project/multi-engine-support.md +++ b/landing-page/content/common/project/multi-engine-support.md @@ -22,19 +22,33 @@ url: multi-engine-support # Multi-Engine Support -Multi-engine support is a core tenant of Apache Iceberg. +Apache Iceberg is an open standard for huge analytic tables that can be used by any processing engine. The community continuously improves Iceberg core library components to enable integrations with different compute engines that power analytics, business intelligence, machine learning, etc. -Support of [Apache Spark](../../../docs/spark-configuration), [Apache Flink](../../../docs/flink) and [Apache Hive](../../../docs/hive) are provided inside the Iceberg main repository. +Connectors for Spark, Flink and Hive are maintained in the main Iceberg repository. ## Multi-Version Support -Engines maintained within the Iceberg repository have multi-version support. -This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts. -For example, the code for Iceberg Spark 3.1 integration is under `/spark/v3.1`, and for Iceberg Spark 3.2 integration is under `/spark/v3.2`, +Processing engine connectors maintained in the iceberg repository are built for multiple versions. + +For Spark and Flink, each new version that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts. +For example, the code for Iceberg Spark 3.1 integration is under `/spark/v3.1` and the code for Iceberg Spark 3.2 integration is under `/spark/v3.2`. Different artifacts (`iceberg-spark-3.1_2.12` and `iceberg-spark-3.2_2.12`) are released for users to consume. -By doing this, changes across versions are isolated. New features in Iceberg could be developed against the latest features of an engine without breaking support of old APIs in past engine versions. +By doing this, changes across versions are isolated. +New features in Iceberg could be developed against the latest features of an engine without breaking support of old APIs in past engine versions. + +For Hive, Hive 2 uses the `iceberg-mr` package for Iceberg integration, and Hive 3 requires an additional dependency of the `iceberg-hive3` package. + +### Runtime Jar + +Iceberg provides a runtime connector Jar for each supported version of Spark, Flink and Hive. +When using Iceberg with these engines, the runtime jar is the only addition to the classpath needed in addition to vendor dependencies. +For example, to use Iceberg with Spark 3.2 and AWS integrations, `iceberg-spark-runtime-3.2_2.12` and AWS SDK dependencies are needed for the Spark installation. -## Engine Version Lifecycle +Spark and Flink provide different runtime jars for each supported engine version. +Hive 2 and Hive 3 currently share the same runtime jar. +The runtime jar names and latest version download links are listed in [the tables below](./multi-engine-support/#current-engine-version-lifecycle-status). + +### Engine Version Lifecycle Each engine version undergoes the following lifecycle stages: @@ -47,29 +61,32 @@ Each engine version undergoes the following lifecycle stages: ### Apache Spark -| Version | Lifecycle Stage | -| ---------- | ------------------ | -| 2.4 | Deprecated | -| 3.0 | Maintained | -| 3.1 | Maintained | -| 3.2 | Beta | +Note that Spark 2.4 and 3.0 artifact names do not comply to the naming convention of later versions for backwards compatibility. + +| Version | Lifecycle Stage | Runtime Artifact | +| ---------- | ------------------ | ---------------- | +| 2.4 | Deprecated | [iceberg-spark-runtime](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/{{% icebergVersion %}}/iceberg-spark-runtime-{{% icebergVersion %}}.jar) | +| 3.0 | Maintained | [iceberg-spark3-runtime](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/{{% icebergVersion %}}/iceberg-spark3-runtime-{{% icebergVersion %}}.jar) | +| 3.1 | Maintained | [iceberg-spark-runtime-3.1_2.12](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.1_2.12-{{% icebergVersion %}}.jar) | +| 3.2 | Maintained | [iceberg-spark-runtime-3.2_2.12](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.2_2.12-{{% icebergVersion %}}.jar) | ### Apache Flink Based on the guideline of the Flink community, only the latest 2 minor versions are actively maintained. Users should continuously upgrade their Flink version to stay up-to-date. -| Version | Lifecycle Stage | -| ---------- | ----------------- | -| 1.12 | Deprecated | -| 1.13 | Maintained | -| 1.14 | Maintained | +| Version | Lifecycle Stage | Runtime Artifact | +| ---------- | ----------------- | ---------------- | +| 1.12 | Deprecated | [iceberg-flink-runtime-1.12](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/{{% icebergVersion %}}/iceberg-flink-runtime-1.12-{{% icebergVersion %}}.jar) | +| 1.13 | Maintained | [iceberg-flink-runtime-1.13](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/{{% icebergVersion %}}/iceberg-flink-runtime-1.13-{{% icebergVersion %}}.jar) | +| 1.14 | Maintained | [iceberg-flink-runtime-1.14](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.14/{{% icebergVersion %}}/iceberg-flink-runtime-1.14-{{% icebergVersion %}}.jar) | + ### Apache Hive -| Version | Lifecycle Stage | -| ------------------------------- | ----------------- | -| 2 (recommended >= 2.3) | Maintained | -| 3 | Maintained | +| Version | Recommended minor version | Lifecycle Stage | Runtime Artifact | +| -------------- | ------------------------- | ----------------- | ---------------- | +| 2 | 2.3.8 | Maintained | [iceberg-hive-runtime](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar) | +| 3 | 3.1.2 | Maintained | [iceberg-hive-runtime](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar) | ## Developer Guide diff --git a/landing-page/content/common/releases/release-notes.md b/landing-page/content/common/releases/release-notes.md index 8f24f057f..94803039e 100644 --- a/landing-page/content/common/releases/release-notes.md +++ b/landing-page/content/common/releases/release-notes.md @@ -36,7 +36,7 @@ The latest version of Iceberg is [{{% icebergVersion %}}](https://github.com/apa To use Iceberg in Spark or Flink, download the runtime JAR for your engine version and add it to the jars folder of your installation. -To use Iceberg in Hive, download the Hive runtime JAR and add it to Hive using `ADD JAR`. +To use Iceberg in Hive 2 or Hive 3, download the Hive runtime JAR and add it to Hive using `ADD JAR`. ### Gradle @@ -75,31 +75,31 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * **Core** * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] * Catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] - * Hadoop catalog now supports atomic commit using a lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] + * Hadoop catalog can be used with S3 and other file systems safely by using a lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] * **Vendor Integrations** - * Google Cloud Storage (GCS) `FileIO` support is added with optimized read and write using GCS streaming transfer [[\#3711](https://github.com/apache/iceberg/pull/3711)] - * Aliyun Object Storage Service (OSS) `FileIO` support is added [[\#3553](https://github.com/apache/iceberg/pull/3553)] + * Google Cloud Storage (GCS) `FileIO` is supported with optimized read and write using GCS streaming transfer [[\#3711](https://github.com/apache/iceberg/pull/3711)] + * Aliyun Object Storage Service (OSS) `FileIO` is supported [[\#3553](https://github.com/apache/iceberg/pull/3553)] * Any S3-compatible storage (e.g. MinIO) can now be accessed through AWS `S3FileIO` with custom endpoint and credential configurations [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)] * AWS `S3FileIO` now supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)] - * AWS `GlueCatalog` now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] - * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)] -* **File Formats** - * Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported to facilitate Hive to Iceberg table migration [[\#3723](https://github.com/apache/iceberg/pull/3723)] - * ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] + * AWS `GlueCatalog` now displays more table information including table location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] + * Using multiple `FileIO`s based on file path scheme is supported by configuring a `ResolvingFileIO` [[\#3593](https://github.com/apache/iceberg/pull/3593)] * **Spark** - * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] - * `RewriteDataFiles` action now supports sort-based table optimization [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]. The corresponding Spark call procedure `rewrite_data_files` is also added [[\#3375](https://github.com/apache/iceberg/pull/3375)] - * Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] + * Spark 3.2 is supported [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] + * `RewriteDataFiles` action now supports sort-based table optimization [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]. The corresponding Spark call procedure `rewrite_data_files` is also supported [[\#3375](https://github.com/apache/iceberg/pull/3375)] + * Time travel queries now use snapshot schema instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] * Spark vectorized reads now support row-level deletes [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)] * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] * **Flink** - * Flink 1.13 and 1.14 supports are added [[\#3116](https://github.com/apache/iceberg/pull/3116)] [[\#3434](https://github.com/apache/iceberg/pull/3434)] - * Flink connector support is added [[\#2666](https://github.com/apache/iceberg/pull/2666)] - * Upsert write option is added [[\#2863](https://github.com/apache/iceberg/pull/2863)] + * Flink 1.13 and 1.14 are supported [[\#3116](https://github.com/apache/iceberg/pull/3116)] [[\#3434](https://github.com/apache/iceberg/pull/3434)] + * Flink connector support is supported [[\#2666](https://github.com/apache/iceberg/pull/2666)] + * Upsert write option is supported [[\#2863](https://github.com/apache/iceberg/pull/2863)] * **Hive** * Table listing in Hive catalog can now skip non-Iceberg tables by disabling flag `list-all-tables` [[\#3908](https://github.com/apache/iceberg/pull/3908)] * Hive tables imported to Iceberg can now be read by `IcebergInputFormat` [[\#3312](https://github.com/apache/iceberg/pull/3312)] - +* **File Formats** + * Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported to facilitate Hive to Iceberg table migration [[\#3723](https://github.com/apache/iceberg/pull/3723)] + * ORC now supports writing delete file [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] + **Important bug fixes:** * **Core** @@ -109,29 +109,28 @@ Apache Iceberg 0.13.0 was released on February 4th, 2022. * `history` and `snapshots` metadata tables can now query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] * **Vendor Integrations** * Using cloud service integrations such as AWS `GlueCatalog` and `S3FileIO` no longer fail when missing Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] - * AWS clients are now auto-closed when `FileIO` or `Catalog` is closed. There is no need to close the AWS clients separately [[\#2878](https://github.com/apache/iceberg/pull/2878)] -* **File Formats** - * Parquet file writing issue is fixed for string data with over 16 unparseable chars (e.g. high/low surrogates) [[\#3760](https://github.com/apache/iceberg/pull/3760)] - * ORC vectorized read is now configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] + * AWS clients are now auto-closed when related `FileIO` or `Catalog` is closed. There is no need to close the AWS clients separately [[\#2878](https://github.com/apache/iceberg/pull/2878)] * **Spark** * For Spark >= 3.1, `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] - * Insert overwrite mode now skips partition with 0 records instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] - * Spark snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] + * Insert overwrite mode now skips partition with 0 record instead of failing the write operation [[\#2895](https://github.com/apache/iceberg/issues/2895)] + * Spark snapshot expiration action now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] * `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] - * Spark SQL statements containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] + * Spark SQL filters containing binary or fixed literals can now be pushed down instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] * **Flink** * A `ValidationException` will be thrown if a user configures both `catalog-type` and `catalog-impl`. Previously it chose to use `catalog-type`. The new behavior brings Flink consistent with Spark and Hive [[\#3308](https://github.com/apache/iceberg/issues/3308)] * Changelog tables can now be queried without `RowData` serialization issues [[\#3240](https://github.com/apache/iceberg/pull/3240)] * `java.sql.Time` data type can now be written without data overflow problem [[\#3740](https://github.com/apache/iceberg/pull/3740)] + * Avro position delete files can now be read without encountering `NullPointerException` [[\#3540](https://github.com/apache/iceberg/pull/3540)] * **Hive** - * Hive catalog can now be initialized with a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] + * Hive catalog can now be initialized with a `null` Hadoop configuration instead of throwing exception [[\#3252](https://github.com/apache/iceberg/pull/3252)] * Table creation can now succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)] +* **File Formats** + * Parquet file writing issue is fixed for string data with over 16 unparseable chars (e.g. high/low surrogates) [[\#3760](https://github.com/apache/iceberg/pull/3760)] + * ORC vectorized read is now configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] **Other notable changes:** * The community has finalized the long-term strategy of Spark, Flink and Hive support. See [Multi-Engine Support](../multi-engine-support) page for more details. -* Iceberg website is now managed by a separated repository [iceberg-docs](https://github.com/apache/iceberg-docs/) with a new look. See [README](https://github.com/apache/iceberg-docs/blob/main/README.md) for contribution guidelines going forward. -* An OpenAPI specification for Iceberg catalog is approved by the community, and a REST-based Iceberg catalog based on the specification is currently under development [[\#3770](https://github.com/apache/iceberg/pull/3770)] ## Past releases