Skip to content
This repository was archived by the owner on Feb 6, 2024. It is now read-only.

Conversation

@jackye1995
Copy link

The latest version of Iceberg is [{{% icebergVersion %}}](https://github.com/apache/iceberg/releases/tag/apache-iceberg-{{% icebergVersion %}}).
You can download the [source code](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz)
and verify its [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc)
and [SHA512 checksum](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have a Multi-engine Support page in the landing page (and as a highlight feature at home page), instead of tracking all the runtime jars here. Any thoughts?

* [{{% icebergVersion %}} Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar)

To use Iceberg in Spark, download the runtime JAR and add it to the jars folder of your Spark install. Use iceberg-spark3-runtime for Spark 3, and iceberg-spark-runtime for Spark 2.4.
To use Iceberg in Spark/Flink, download the runtime JAR based on your Spark/Flink version and add it to the jars folder of your Spark/Flink install.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's awkward to do this to save space. I'd start initially with "Spark or Flink" and then just be more generic in the rest of the sentence, "for your engine version" and "of your installation"

* S3-compatible cloud storages (e.g. MinIO) can now be accessed through AWS `S3FileIO` with custom endpoint and credential configurations [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)]
* **Spark**
* Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)]
* Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "merge-on-read" a term people are familiar with? We could also say "Spark 3.2 DELETE supports logical deletes (v2) in addition to rewriting data files"

* **Spark**
* Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)]
* Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
* `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say "delete file compaction" here for the same reason.

* Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)]
* Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
* `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]
* Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stored procedure"?

* Partition spec ID (`spec_id`) is added to the `data_files` spec and can be queried in related metadata tables [[\#3015](https://github.com/apache/iceberg/pull/3015)]
* ORC delete file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)]
* Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)]
* Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is big enough to mention in release notes yet. 2-level lists aren't supported, so I think we should be careful about saying "fully supported" to avoid confusion. Maybe just remove this? It's not very high level.

* Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)]
* Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)]
* `NOT_STARTS_WITH` expression support is added to improve Iceberg predicate-pushdown query performance [[\#2062](https://github.com/apache/iceberg/pull/2062)]
* Hadoop catalog now supports atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say the lock manager is pessimistic. Yes, it is an exclusive lock... but pessimistic/optimistic is more about how Iceberg behaves to most readers and we still commit optimistically and will retry if the commit fails.

* Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
* `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]
* Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)]
* Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just want to state that the snapshot schema is used instead of the latest. The SQL time travel support isn't something that we plan to expose this way long term.

* `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]
* Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)]
* Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)]
* Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark vectorized reads now support row-level deletes

* Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)]
* Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)]
* Call procedure `ancestors_of` is added to access snapshot ancestor information [[\#3444](https://github.com/apache/iceberg/pull/3444)]
* Truncate [[\#3708](https://github.com/apache/iceberg/pull/3708)] and bucket [[\#3089](https://github.com/apache/iceberg/pull/3368)] UDFs are added for calculating for partition transform values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't add the bucket support. Plus, this is a helper to register support for truncate. I would probably leave this out or be more specific about what was added.

* **Core**
* Iceberg new data file root path is configured through `write.data.path` going forward. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)]
* Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)]
* Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a feature to highlight, not a bug fix.

* Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)]
* Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)]
* `RowDelta` transactions can commit delete files of multiple partition specs instead of just a single one [[\#2985](https://github.com/apache/iceberg/pull/2985)]
* Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this notable enough to mention?

* Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)]
* ORC vectorized read can be configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)]
* Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)]
* Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strained? I would just remove the last part of the sentence, from "instead" onward.

* Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)]
* Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)]
* `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)]
* `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very specific and probably not notable enough to be included.

* Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)]
* Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)]
* Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)]
* `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was a common problem. Is it notable enough to include?

* Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)]
* `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)]
* `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)]
* Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deadlock was not in a released version, so we can remove this. Also, I don't recall the "immediately refreshed" one, but I don't see how that's possible. Do you have more information on what that is?

* `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)]
* Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)]
* `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)]
* Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this notable?

* Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)]
* `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)]
* Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)]
* Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely not something wide-spread enough to include.

* `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)]
* Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)]
* Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)]
* Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important and should be moved to the top.

* Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)]
* **Vendor Integrations**
* AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)]
* AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sin't a bug, it's a minor improvement right?

* AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)]
* AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]
* **Spark**
* `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, not a bug?

* AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]
* **Spark**
* `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]
* `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which version does it work with?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added for Spark >= 3.1

* **Spark**
* `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]
* `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]
* Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a bug and it's for Spark 2, so it probably isn't relevant enough to mention.

* `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)]
* `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]
* Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]
* Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing because Iceberg doesn't have empty partitions.

Looks like the problem was that dynamic partition overwrite would fail to write a dataframe with 0 records.

* `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)]
* Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]
* Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]
* `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement, not a bug fix (though there are fixes). And probably notable enough to include above.

* Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)]
* Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]
* `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)]
* Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor fix, probably shouldn't include it.

* Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)]
* `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)]
* Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)]
* Snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expiration action, right?

* Changelog tables can now be queried without `RowData` serialization issues [[\#3240](https://github.com/apache/iceberg/pull/3240)]
* Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)]
* **Hive**
* Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this a bug or notable enough to include.

* Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)]
* **Hive**
* Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)]
* Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is minor and also not user facing. I wouldn't include it.

* Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)]
* Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)]
* Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)]
* Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance improvements aren't bug fixes. Maybe include this above, but I'm not sure.

* Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)]
* Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)]
* Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)]
* Read performance can now be improved by disabling `FileIO` serialization using Hadoop config `iceberg.mr.config.serialization.disabled` [[\#3752](https://github.com/apache/iceberg/pull/3752)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also an optimization.


* [{{% icebergVersion %}} source tar.gz](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz) -- [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc) -- [sha512](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512)
* [{{% icebergVersion %}} Spark 3.2 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.2_2.12-{{% icebergVersion %}}.jar)
* [{{% icebergVersion %}} Spark 3.1 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.1_2.12-{{% icebergVersion %}}.jar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any plan to align the iceberg-spark runtime jar among 2.4, 3.0, 3.1 and 3.2 ? I see the spark 2.4 has the name iceberg-spark-runtime-0.13.0, spark 3.0 has the name iceberg-spark3-runtime-0.13.0 but spark 3.1 or spark 3.2 has the name iceberg-spark-runtime-3.1_2.12-0.13.0.

Why not just name all the runtime jar as format iceberg-spark-runtime-$sparkVersion_$scalaVersion-$icebergVersion ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how important it is to keep backwards compatibility for jar names, but for now it seems that we are trying to keep 3.0 and 2.4 with old names. Any thoughts? @rdblue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this jar naming issue don't block this PR, but the inconsistent name approach for different spark versions seems increase the cost to understand which is the correct groupId/artifactId to use for the downstream users. For my personal understanding, I will recommend to use the similar name approach for spark 2.4 & 3.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too late to do this for the 0.13.0 release. We may want to consider it for 0.14.0, but we're stuck with it for now.


Multi-engine support is a core tenant of Apache Iceberg.
The community continuously improves Iceberg core library components to enable integrations with different compute engines that power analytics, business intelligence, machine learning, etc.
Support of [Apache Spark](../../../docs/spark-configuration), [Apache Flink](../../../docs/flink) and [Apache Hive](../../../docs/hive) are provided inside the Iceberg main repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Support for" is more natural than "Support of"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe be more specific: "Connectors for Spark, Flink, and Hive are maintained in the main Iceberg repository." Do we need to mention the Hive version so that it isn't confusing when support is built directly into Hive releases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention the Hive version so that it isn't confusing when support is built directly into Hive releases?

I thought about that and was a bit hesitated, because people have been using even Hive 2.1 with Hive 3 jar and somehow things still work in various cases... I don't how backwards compatible Hive will be in the future.


# Multi-Engine Support

Multi-engine support is a core tenant of Apache Iceberg.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is clear what multi-engine support is. We should probably change this to something like "Apache Iceberg is an open standard for huge analytic tables that can be used by any processing engine"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


## Multi-Version Support

Engines maintained within the Iceberg repository have multi-version support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than "have multi-version support" I would recommend saying "Processing engine connectors maintained in the iceberg repository are built for multiple versions."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

| 2.4 | Deprecated |
| 3.0 | Maintained |
| 3.1 | Maintained |
| 3.2 | Beta |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? Why would this not be "Maintained"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's true, I forgot to update this line

## Multi-Version Support

Engines maintained within the Iceberg repository have multi-version support.
This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only true for Spark and Flink. What about saying that "Iceberg provides a runtime connector Jar for each supported version of Spark and Flink."

We should also note that these are the only additions to the classpath needed. You don't have to add any other dependencies to get support.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let me update this section title then and specify this is only for Spark and Flink. I was trying to also make a case for Hive in case it could do that in the future, but it seems unlikely as of today.

Copy link
Author

@jackye1995 jackye1995 Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc saying that only Spark and Flink have versioned codebase and jars, Hive as of today continues to use the same runtime for Hive 2 and 3.

| 1.12 | Deprecated |
| 1.13 | Maintained |
| 1.14 | Maintained |
### Apache Hive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)]
* **File Formats**
* Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported to facilitate Hive to Iceberg table migration [[\#3723](https://github.com/apache/iceberg/pull/3723)]
* ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is accurate. This is support for writing delete files in ORC format. I don't think we should refer to those as "merge-on-read" files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was thinking if readers would understand "delete file" that's why I changed.

* AWS `S3FileIO` now supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)]
* AWS `GlueCatalog` now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)]
* `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)]
* **File Formats**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this below the engines? I think those are more significant changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

* **Core**
* Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)]
* Catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)]
* Hadoop catalog now supports atomic commit using a lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be more clear that this is an alternative. Maybe "Hadoop catalog can be used with S3 and other file systems safely by using a lock manager"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

* ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)]
* **Spark**
* Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)]
* `RewriteDataFiles` action now supports sort-based table optimization [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]. The corresponding Spark call procedure `rewrite_data_files` is also added [[\#3375](https://github.com/apache/iceberg/pull/3375)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using "is added" is a bit strange because it combines a present-tense verb ("is") with a past-tense verb ("added") that doesn't describe a state of being ("is supported" works because "supported" is a state). The sentence on this line reads much more naturally because it uses "now supports" and those forms agree. You could change the line above to "Spark 3.2 is now supported".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I will change added to supported for all

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, but I think this is about ready to commit. While we should probably fix some of the phrasing and make things more clear on the multi-engine support page, I think it's close enough to get the update out and fix it later.

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall, just left several minor comments. Thanks @jackye1995 for the detailed release notes !

Engines maintained within the Iceberg repository have multi-version support.
This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts.
For example, the code for Iceberg Spark 3.1 integration is under `/spark/v3.1`, and for Iceberg Spark 3.2 integration is under `/spark/v3.2`,
Different artifacts (`iceberg-spark-3.1_2.12` and `iceberg-spark-3.2_2.12`) are released for users to consume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If spark 2.4 & 3.0 also follow the 3.1 & 3.2 naming approach, then this sentence should be correct. That's why I raise this question before: https://github.com/apache/iceberg-docs/pull/27/files#r800297155

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, that's also why I only picked 3.1 and 3.2. Let me add the runtime artifact name to the table below, hopefully that could clear user's doubts.

To use Iceberg in Spark or Flink, download the runtime JAR for your engine version and add it to the jars folder of your installation.

To use Iceberg in Hive, download the iceberg-hive-runtime JAR and add it to Hive using `ADD JAR`.
To use Iceberg in Hive, download the Hive runtime JAR and add it to Hive using `ADD JAR`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a sentence to show users that both hive2 and hive3 are using the same hive runtime jar ? I see both spark & flink have provided their version specific runtime jar but hive only provided the single shared jar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agree, let me add a sentence here and also in the multi-engine page.

* Spark snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)]
* `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)]
* Spark SQL statements containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)]
* **Flink**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we missed a critical bug fix here: apache/iceberg#3540

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agree, I removed it thinking it might be too much detail to mention, let me add it back.

@jackye1995
Copy link
Author

@openinx @rdblue @samredai thanks for the reviews and approvals, l will merge this for now to unblock the 0.13.0 release announcement. Let me know if there is any further change needed!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants