Add 0.13.0 release note #27

jackye1995 · 2022-02-02T09:27:14Z

jackye1995 · 2022-02-02T09:28:01Z

landing-page/content/common/releases/release-notes.md

 The latest version of Iceberg is [{{% icebergVersion %}}](https://github.com/apache/iceberg/releases/tag/apache-iceberg-{{% icebergVersion %}}).
+You can download the [source code](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz) 
+and verify its [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc) 
+and [SHA512 checksum](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512).


I think it's better to have a Multi-engine Support page in the landing page (and as a highlight feature at home page), instead of tracking all the runtime jars here. Any thoughts?

landing-page/content/common/releases/release-notes.md

rdblue · 2022-02-05T00:32:28Z

landing-page/content/common/releases/release-notes.md

 * [{{% icebergVersion %}} Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar)

-To use Iceberg in Spark, download the runtime JAR and add it to the jars folder of your Spark install. Use iceberg-spark3-runtime for Spark 3, and iceberg-spark-runtime for Spark 2.4.
+To use Iceberg in Spark/Flink, download the runtime JAR based on your Spark/Flink version and add it to the jars folder of your Spark/Flink install.


I think it's awkward to do this to save space. I'd start initially with "Spark or Flink" and then just be more generic in the rest of the sentence, "for your engine version" and "of your installation"

rdblue · 2022-02-05T00:34:21Z

landing-page/content/common/releases/release-notes.md

+  * S3-compatible cloud storages (e.g. MinIO) can now be accessed through AWS `S3FileIO` with custom endpoint and credential configurations [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)]
+* **Spark**
+  * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)]
+  * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]


Is "merge-on-read" a term people are familiar with? We could also say "Spark 3.2 DELETE supports logical deletes (v2) in addition to rewriting data files"

rdblue · 2022-02-05T00:34:40Z

landing-page/content/common/releases/release-notes.md

+* **Spark**
+  * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)]
+  * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
+  * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]


I'd say "delete file compaction" here for the same reason.

rdblue · 2022-02-05T00:34:52Z

landing-page/content/common/releases/release-notes.md

+  * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)]
+  * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
+  * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]
+  * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)]


"Stored procedure"?

rdblue · 2022-02-05T00:39:24Z

landing-page/content/common/releases/release-notes.md

+  * Partition spec ID (`spec_id`) is added to the `data_files` spec and can be queried in related metadata tables [[\#3015](https://github.com/apache/iceberg/pull/3015)]
+  * ORC delete file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)]
+  * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)]
+  * Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)]


I don't think this is big enough to mention in release notes yet. 2-level lists aren't supported, so I think we should be careful about saying "fully supported" to avoid confusion. Maybe just remove this? It's not very high level.

rdblue · 2022-02-05T00:40:35Z

landing-page/content/common/releases/release-notes.md

+  * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)]
+  * Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)]
+  * `NOT_STARTS_WITH` expression support is added to improve Iceberg predicate-pushdown query performance [[\#2062](https://github.com/apache/iceberg/pull/2062)]
+  * Hadoop catalog now supports atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)]


I wouldn't say the lock manager is pessimistic. Yes, it is an exclusive lock... but pessimistic/optimistic is more about how Iceberg behaves to most readers and we still commit optimistically and will retry if the commit fails.

rdblue · 2022-02-05T00:41:55Z

landing-page/content/common/releases/release-notes.md

+  * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
+  * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]
+  * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)]
+  * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)]


I think we just want to state that the snapshot schema is used instead of the latest. The SQL time travel support isn't something that we plan to expose this way long term.

rdblue · 2022-02-05T00:42:29Z

landing-page/content/common/releases/release-notes.md

+  * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]
+  * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)]
+  * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)]
+  * Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)]


Spark vectorized reads now support row-level deletes

rdblue · 2022-02-05T00:43:40Z

landing-page/content/common/releases/release-notes.md

+  * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)]
+  * Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)]
+  * Call procedure `ancestors_of` is added to access snapshot ancestor information [[\#3444](https://github.com/apache/iceberg/pull/3444)]
+  * Truncate [[\#3708](https://github.com/apache/iceberg/pull/3708)] and bucket [[\#3089](https://github.com/apache/iceberg/pull/3368)] UDFs are added for calculating for partition transform values


This didn't add the bucket support. Plus, this is a helper to register support for truncate. I would probably leave this out or be more specific about what was added.

rdblue · 2022-02-05T00:45:30Z

landing-page/content/common/releases/release-notes.md

+* **Core**
+  * Iceberg new data file root path is configured through `write.data.path` going forward. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)]
+  * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)]
+  * Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)]


I think this is a feature to highlight, not a bug fix.

rdblue · 2022-02-05T00:45:52Z

landing-page/content/common/releases/release-notes.md

+  * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)]
+  * Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)]
+  * `RowDelta` transactions can commit delete files of multiple partition specs instead of just a single one [[\#2985](https://github.com/apache/iceberg/pull/2985)]
+  * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)]


Is this notable enough to mention?

rdblue · 2022-02-05T00:46:47Z

landing-page/content/common/releases/release-notes.md

+  * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)]
+  * ORC vectorized read can be configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size`  [[\#3133](https://github.com/apache/iceberg/pull/3133)]
+  * Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)]
+  * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)]


strained? I would just remove the last part of the sentence, from "instead" onward.

rdblue · 2022-02-05T00:47:33Z

landing-page/content/common/releases/release-notes.md

+  * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)]
+  * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)]
+  * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)]
+  * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] 


This is very specific and probably not notable enough to be included.

rdblue · 2022-02-05T00:47:47Z

landing-page/content/common/releases/release-notes.md

+  * Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)]
+  * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)]
+  * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)]
+  * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)]


I don't think this was a common problem. Is it notable enough to include?

rdblue · 2022-02-05T00:49:11Z

landing-page/content/common/releases/release-notes.md

+  * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)]
+  * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)]
+  * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] 
+  * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)]


The deadlock was not in a released version, so we can remove this. Also, I don't recall the "immediately refreshed" one, but I don't see how that's possible. Do you have more information on what that is?

rdblue · 2022-02-05T00:50:07Z

landing-page/content/common/releases/release-notes.md

+  * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] 
+  * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)]
+  * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)]
+  * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)]


Is this notable?

rdblue · 2022-02-05T00:50:29Z

landing-page/content/common/releases/release-notes.md

+  * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)]
+  * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)]
+  * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)]
+  * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)]


This is definitely not something wide-spread enough to include.

rdblue · 2022-02-05T00:50:38Z

landing-page/content/common/releases/release-notes.md

+  * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)]
+  * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)]
+  * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)]
+  * Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)]


This is important and should be moved to the top.

rdblue · 2022-02-05T00:50:51Z

landing-page/content/common/releases/release-notes.md

+  * Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)]
+* **Vendor Integrations**
+  * AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)]
+  * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]


This sin't a bug, it's a minor improvement right?

rdblue · 2022-02-05T00:51:17Z

landing-page/content/common/releases/release-notes.md

+  * AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)]
+  * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]
+* **Spark**
+  * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]


Minor, not a bug?

rdblue · 2022-02-05T00:51:26Z

landing-page/content/common/releases/release-notes.md

+  * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]
+* **Spark**
+  * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]
+  * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]


Which version does it work with?

added for Spark >= 3.1

rdblue · 2022-02-05T00:51:57Z

landing-page/content/common/releases/release-notes.md

+* **Spark**
+  * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]
+  * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]
+  * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]


This isn't a bug and it's for Spark 2, so it probably isn't relevant enough to mention.

rdblue · 2022-02-05T00:53:08Z

landing-page/content/common/releases/release-notes.md

+  * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]
+  * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]
+  * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]
+  * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]


This is confusing because Iceberg doesn't have empty partitions.

Looks like the problem was that dynamic partition overwrite would fail to write a dataframe with 0 records.

rdblue · 2022-02-05T00:53:54Z

landing-page/content/common/releases/release-notes.md

+  * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]
+  * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]
+  * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]
+  * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] 


I think this is an improvement, not a bug fix (though there are fixes). And probably notable enough to include above.

rdblue · 2022-02-05T00:54:35Z

landing-page/content/common/releases/release-notes.md

+  * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]
+  * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]
+  * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] 
+  * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)]


This is a minor fix, probably shouldn't include it.

rdblue · 2022-02-05T00:54:50Z

landing-page/content/common/releases/release-notes.md

+  * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]
+  * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] 
+  * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)]
+  * Snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)]


Expiration action, right?

rdblue · 2022-02-05T00:58:09Z

landing-page/content/common/releases/release-notes.md

+  * Changelog tables can now be queried without `RowData` serialization issues [[\#3240](https://github.com/apache/iceberg/pull/3240)]
+  * Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)]
+* **Hive**
+  * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)]


I don't think this a bug or notable enough to include.

rdblue · 2022-02-05T00:58:48Z

landing-page/content/common/releases/release-notes.md

+  * Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)]
+* **Hive**
+  * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)]
+  * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)]


I think this is minor and also not user facing. I wouldn't include it.

rdblue · 2022-02-05T00:59:24Z

landing-page/content/common/releases/release-notes.md

+  * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)]
+  * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)]
+  * Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)]
+  * Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)]


Performance improvements aren't bug fixes. Maybe include this above, but I'm not sure.

rdblue · 2022-02-05T00:59:43Z

landing-page/content/common/releases/release-notes.md

+  * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)]
+  * Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)]
+  * Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)]
+  * Read performance can now be improved by disabling `FileIO` serialization using Hadoop config `iceberg.mr.config.serialization.disabled` [[\#3752](https://github.com/apache/iceberg/pull/3752)]


This is also an optimization.

openinx · 2022-02-07T03:42:45Z

landing-page/content/common/releases/release-notes.md


 * [{{% icebergVersion %}} source tar.gz](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz) -- [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc) -- [sha512](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512)
+* [{{% icebergVersion %}} Spark 3.2 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.2_2.12-{{% icebergVersion %}}.jar)
+* [{{% icebergVersion %}} Spark 3.1 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.1_2.12-{{% icebergVersion %}}.jar)


Is there any plan to align the iceberg-spark runtime jar among 2.4, 3.0, 3.1 and 3.2 ? I see the spark 2.4 has the name iceberg-spark-runtime-0.13.0, spark 3.0 has the name iceberg-spark3-runtime-0.13.0 but spark 3.1 or spark 3.2 has the name iceberg-spark-runtime-3.1_2.12-0.13.0.

Why not just name all the runtime jar as format iceberg-spark-runtime-$sparkVersion_$scalaVersion-$icebergVersion ?

I don't know how important it is to keep backwards compatibility for jar names, but for now it seems that we are trying to keep 3.0 and 2.4 with old names. Any thoughts? @rdblue

I think this jar naming issue don't block this PR, but the inconsistent name approach for different spark versions seems increase the cost to understand which is the correct groupId/artifactId to use for the downstream users. For my personal understanding, I will recommend to use the similar name approach for spark 2.4 & 3.0.

I think it's too late to do this for the 0.13.0 release. We may want to consider it for 0.14.0, but we're stuck with it for now.

rdblue · 2022-02-08T02:53:04Z

landing-page/content/common/project/multi-engine-support.md

+
+Multi-engine support is a core tenant of Apache Iceberg.
+The community continuously improves Iceberg core library components to enable integrations with different compute engines that power analytics, business intelligence, machine learning, etc.
+Support of [Apache Spark](../../../docs/spark-configuration), [Apache Flink](../../../docs/flink) and [Apache Hive](../../../docs/hive) are provided inside the Iceberg main repository.


Nit: "Support for" is more natural than "Support of"

Or maybe be more specific: "Connectors for Spark, Flink, and Hive are maintained in the main Iceberg repository." Do we need to mention the Hive version so that it isn't confusing when support is built directly into Hive releases?

Do we need to mention the Hive version so that it isn't confusing when support is built directly into Hive releases?

I thought about that and was a bit hesitated, because people have been using even Hive 2.1 with Hive 3 jar and somehow things still work in various cases... I don't how backwards compatible Hive will be in the future.

rdblue · 2022-02-08T02:55:03Z

landing-page/content/common/project/multi-engine-support.md

+
+# Multi-Engine Support
+
+Multi-engine support is a core tenant of Apache Iceberg.


I don't think it is clear what multi-engine support is. We should probably change this to something like "Apache Iceberg is an open standard for huge analytic tables that can be used by any processing engine"

rdblue · 2022-02-08T02:57:18Z

landing-page/content/common/project/multi-engine-support.md

+
+## Multi-Version Support
+
+Engines maintained within the Iceberg repository have multi-version support.


Rather than "have multi-version support" I would recommend saying "Processing engine connectors maintained in the iceberg repository are built for multiple versions."

rdblue · 2022-02-08T02:57:40Z

landing-page/content/common/project/multi-engine-support.md

+| 2.4        | Deprecated         | 
+| 3.0        | Maintained         | 
+| 3.1        | Maintained         |
+| 3.2        | Beta               |


Is this true? Why would this not be "Maintained"?

yes it's true, I forgot to update this line

rdblue · 2022-02-08T02:59:39Z

landing-page/content/common/project/multi-engine-support.md

+## Multi-Version Support
+
+Engines maintained within the Iceberg repository have multi-version support.
+This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts.


I think this is only true for Spark and Flink. What about saying that "Iceberg provides a runtime connector Jar for each supported version of Spark and Flink."

We should also note that these are the only additions to the classpath needed. You don't have to add any other dependencies to get support.

I see, let me update this section title then and specify this is only for Spark and Flink. I was trying to also make a case for Hive in case it could do that in the future, but it seems unlikely as of today.

Updated the doc saying that only Spark and Flink have versioned codebase and jars, Hive as of today continues to use the same runtime for Hive 2 and 3.

rdblue · 2022-02-08T02:59:49Z

landing-page/content/common/project/multi-engine-support.md

+| 1.12       | Deprecated        | 
+| 1.13       | Maintained        | 
+| 1.14       | Maintained        | 
+### Apache Hive


Newline above?

rdblue · 2022-02-08T03:01:55Z

landing-page/content/common/releases/release-notes.md

+  * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)]
+* **File Formats**
+  * Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported  to facilitate Hive to Iceberg table migration [[\#3723](https://github.com/apache/iceberg/pull/3723)]
+  * ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)]


I don't think this is accurate. This is support for writing delete files in ORC format. I don't think we should refer to those as "merge-on-read" files.

Sure, I was thinking if readers would understand "delete file" that's why I changed.

rdblue · 2022-02-08T03:02:24Z

landing-page/content/common/releases/release-notes.md

+  * AWS `S3FileIO` now supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)]
+  * AWS `GlueCatalog` now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]
+  * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)]
+* **File Formats**


Could we move this below the engines? I think those are more significant changes.

rdblue · 2022-02-08T03:03:29Z

landing-page/content/common/releases/release-notes.md

+* **Core**
+  * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)]
+  * Catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)]
+  * Hadoop catalog now supports atomic commit using a lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)]


I wonder if we should be more clear that this is an alternative. Maybe "Hadoop catalog can be used with S3 and other file systems safely by using a lock manager"?

rdblue · 2022-02-08T03:06:47Z

landing-page/content/common/releases/release-notes.md

+  * ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)]
+* **Spark**
+  * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
+  * `RewriteDataFiles` action now supports sort-based table optimization [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]. The corresponding Spark call procedure `rewrite_data_files` is also added [[\#3375](https://github.com/apache/iceberg/pull/3375)]


Using "is added" is a bit strange because it combines a present-tense verb ("is") with a past-tense verb ("added") that doesn't describe a state of being ("is supported" works because "supported" is a state). The sentence on this line reads much more naturally because it uses "now supports" and those forms agree. You could change the line above to "Spark 3.2 is now supported".

okay I will change added to supported for all

rdblue

I left a few comments, but I think this is about ready to commit. While we should probably fix some of the phrasing and make things more clear on the multi-engine support page, I think it's close enough to get the update out and fix it later.

openinx

Looks good to me overall, just left several minor comments. Thanks @jackye1995 for the detailed release notes !

openinx · 2022-02-08T03:03:59Z

landing-page/content/common/project/multi-engine-support.md

+Engines maintained within the Iceberg repository have multi-version support.
+This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts.
+For example, the code for Iceberg Spark 3.1 integration is under `/spark/v3.1`, and for Iceberg Spark 3.2 integration is under `/spark/v3.2`,
+Different artifacts (`iceberg-spark-3.1_2.12` and `iceberg-spark-3.2_2.12`) are released for users to consume.


If spark 2.4 & 3.0 also follow the 3.1 & 3.2 naming approach, then this sentence should be correct. That's why I raise this question before: https://github.com/apache/iceberg-docs/pull/27/files#r800297155

Yeah I agree, that's also why I only picked 3.1 and 3.2. Let me add the runtime artifact name to the table below, hopefully that could clear user's doubts.

openinx · 2022-02-08T03:15:38Z

landing-page/content/common/releases/release-notes.md

+To use Iceberg in Spark or Flink, download the runtime JAR for your engine version and add it to the jars folder of your installation.

-To use Iceberg in Hive, download the iceberg-hive-runtime JAR and add it to Hive using `ADD JAR`.
+To use Iceberg in Hive, download the Hive runtime JAR and add it to Hive using `ADD JAR`.


Should we add a sentence to show users that both hive2 and hive3 are using the same hive runtime jar ? I see both spark & flink have provided their version specific runtime jar but hive only provided the single shared jar.

yes agree, let me add a sentence here and also in the multi-engine page.

openinx · 2022-02-08T03:23:39Z

landing-page/content/common/releases/release-notes.md

+  * Spark snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)]
+  * `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)]
+  * Spark SQL statements containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)]
+* **Flink**


I think we missed a critical bug fix here: apache/iceberg#3540

yes agree, I removed it thinking it might be too much detail to mention, let me add it back.

jackye1995 · 2022-02-08T04:24:31Z

@openinx @rdblue @samredai thanks for the reviews and approvals, l will merge this for now to unblock the 0.13.0 release announcement. Let me know if there is any further change needed!

jackye1995 commented Feb 2, 2022

View reviewed changes

landing-page/content/common/releases/release-notes.md Show resolved Hide resolved

jackye1995 force-pushed the 13-release branch from 7eda380 to b1073a5 Compare February 4, 2022 21:22

samredai mentioned this pull request Feb 5, 2022

Add contribution guidelines #35

Merged

rdblue reviewed Feb 5, 2022

View reviewed changes

openinx reviewed Feb 7, 2022

View reviewed changes

rdblue reviewed Feb 8, 2022

View reviewed changes

rdblue approved these changes Feb 8, 2022

View reviewed changes

Jack Ye added 7 commits February 7, 2022 19:11

Add 0.13.0 release note

37110df

update 0.12.1 jar locations

73c1d0a

update 0.13.0 new jars for different engine versions

fde1106

minor fixes

a51c35e

remove some release items, add engine support page

b642ee2

fix typos

3e27a91

address comments

a38f7e3

openinx approved these changes Feb 8, 2022

View reviewed changes

address comments

1de71eb

jackye1995 force-pushed the 13-release branch from 70336aa to 1de71eb Compare February 8, 2022 04:22

jackye1995 merged commit e217017 into apache:main Feb 8, 2022

jackye1995 mentioned this pull request Feb 8, 2022

Remove Parquet legacy file read support from 0.13.0 release note #46

Merged

openinx mentioned this pull request Feb 18, 2022

Build: Align the module name among spark2.4, spark3.0, spark3.1 and spark3.2 apache/iceberg#4158

Merged


		# Multi-Engine Support

		Multi-engine support is a core tenant of Apache Iceberg.


		## Multi-Version Support

		Engines maintained within the Iceberg repository have multi-version support.

Add 0.13.0 release note #27

Add 0.13.0 release note #27

Uh oh!

Conversation

jackye1995 commented Feb 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment