-
Notifications
You must be signed in to change notification settings - Fork 96
Add 0.13.0 release note #27
Conversation
| The latest version of Iceberg is [{{% icebergVersion %}}](https://github.com/apache/iceberg/releases/tag/apache-iceberg-{{% icebergVersion %}}). | ||
| You can download the [source code](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz) | ||
| and verify its [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc) | ||
| and [SHA512 checksum](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have a Multi-engine Support page in the landing page (and as a highlight feature at home page), instead of tracking all the runtime jars here. Any thoughts?
7eda380 to
b1073a5
Compare
| * [{{% icebergVersion %}} Hive runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/{{% icebergVersion %}}/iceberg-hive-runtime-{{% icebergVersion %}}.jar) | ||
|
|
||
| To use Iceberg in Spark, download the runtime JAR and add it to the jars folder of your Spark install. Use iceberg-spark3-runtime for Spark 3, and iceberg-spark-runtime for Spark 2.4. | ||
| To use Iceberg in Spark/Flink, download the runtime JAR based on your Spark/Flink version and add it to the jars folder of your Spark/Flink install. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's awkward to do this to save space. I'd start initially with "Spark or Flink" and then just be more generic in the rest of the sentence, "for your engine version" and "of your installation"
| * S3-compatible cloud storages (e.g. MinIO) can now be accessed through AWS `S3FileIO` with custom endpoint and credential configurations [[\#3656](https://github.com/apache/iceberg/pull/3656)] [[\#3658](https://github.com/apache/iceberg/pull/3658)] | ||
| * **Spark** | ||
| * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] | ||
| * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "merge-on-read" a term people are familiar with? We could also say "Spark 3.2 DELETE supports logical deletes (v2) in addition to rewriting data files"
| * **Spark** | ||
| * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] | ||
| * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] | ||
| * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say "delete file compaction" here for the same reason.
| * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] | ||
| * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] | ||
| * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] | ||
| * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Stored procedure"?
| * Partition spec ID (`spec_id`) is added to the `data_files` spec and can be queried in related metadata tables [[\#3015](https://github.com/apache/iceberg/pull/3015)] | ||
| * ORC delete file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] | ||
| * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] | ||
| * Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is big enough to mention in release notes yet. 2-level lists aren't supported, so I think we should be careful about saying "fully supported" to avoid confusion. Maybe just remove this? It's not very high level.
| * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] | ||
| * Legacy Parquet tables (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true` and migrated to Iceberg) are fully supported [[\#3723](https://github.com/apache/iceberg/pull/3723)] | ||
| * `NOT_STARTS_WITH` expression support is added to improve Iceberg predicate-pushdown query performance [[\#2062](https://github.com/apache/iceberg/pull/2062)] | ||
| * Hadoop catalog now supports atomic commit using a pessimistic lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say the lock manager is pessimistic. Yes, it is an exclusive lock... but pessimistic/optimistic is more about how Iceberg behaves to most readers and we still commit optimistically and will retry if the commit fails.
| * Spark 3.2 supports merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] | ||
| * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] | ||
| * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)] | ||
| * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just want to state that the snapshot schema is used instead of the latest. The SQL time travel support isn't something that we plan to expose this way long term.
| * `RewriteDataFiles` action now supports sorting [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)] | ||
| * Call procedure `rewrite_data_files` is added to perform Iceberg data file optimization and compaction [[\#3375](https://github.com/apache/iceberg/pull/3375)] | ||
| * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] | ||
| * Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark vectorized reads now support row-level deletes
| * Spark SQL time travel support is added. Snapshot schema is now used instead of the table's latest schema [[\#3722](https://github.com/apache/iceberg/pull/3722)] | ||
| * Spark vectorized merge-on-read support is added [[\#3557](https://github.com/apache/iceberg/pull/3557)] [[\#3287](https://github.com/apache/iceberg/pull/3287)] | ||
| * Call procedure `ancestors_of` is added to access snapshot ancestor information [[\#3444](https://github.com/apache/iceberg/pull/3444)] | ||
| * Truncate [[\#3708](https://github.com/apache/iceberg/pull/3708)] and bucket [[\#3089](https://github.com/apache/iceberg/pull/3368)] UDFs are added for calculating for partition transform values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't add the bucket support. Plus, this is a helper to register support for truncate. I would probably leave this out or be more specific about what was added.
| * **Core** | ||
| * Iceberg new data file root path is configured through `write.data.path` going forward. `write.folder-storage.path` and `write.object-storage.path` are deprecated [[\#3094](https://github.com/apache/iceberg/pull/3094)] | ||
| * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)] | ||
| * Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a feature to highlight, not a bug fix.
| * Catalog commit status is `UNKNOWN` instead of `FAILURE` when new metadata location cannot be found in snapshot history [[\#3717](https://github.com/apache/iceberg/pull/3717)] | ||
| * Metrics mode for sort order source columns is default to at least `truncate[16]` for better predicate pushdown performance [[\#2240](https://github.com/apache/iceberg/pull/2240)] | ||
| * `RowDelta` transactions can commit delete files of multiple partition specs instead of just a single one [[\#2985](https://github.com/apache/iceberg/pull/2985)] | ||
| * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this notable enough to mention?
| * Hadoop catalog now returns false when dropping a table that does not exist instead of returning true [[\#3097](https://github.com/apache/iceberg/pull/3097)] | ||
| * ORC vectorized read can be configured using `read.orc.vectorization.batch-size` instead of `read.parquet.vectorization.batch-size` [[\#3133](https://github.com/apache/iceberg/pull/3133)] | ||
| * Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] | ||
| * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strained? I would just remove the last part of the sentence, from "instead" onward.
| * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] | ||
| * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)] | ||
| * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] | ||
| * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very specific and probably not notable enough to be included.
| * Using `Catalog` and `FileIO` no longer requires Hadoop dependencies in the execution environment [[\#3590](https://github.com/apache/iceberg/pull/3590)] | ||
| * Dropping table now deletes old metadata files instead of leaving them strained [[\#3622](https://github.com/apache/iceberg/pull/3622)] | ||
| * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)] | ||
| * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this was a common problem. Is it notable enough to include?
| * Iceberg thread pool now uses at least 2 threads for query planning (can be changed with the `iceberg.worker.num-threads` config) [[\#3811](https://github.com/apache/iceberg/pull/3811)] | ||
| * `history` and `snapshots` metadata tables can query tables with no current snapshot instead of returning empty [[\#3812](https://github.com/apache/iceberg/pull/3812)] | ||
| * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] | ||
| * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deadlock was not in a released version, so we can remove this. Also, I don't recall the "immediately refreshed" one, but I don't see how that's possible. Do you have more information on what that is?
| * `partition` metadata table supports tables with a partition column named `partition` [[\#3845](https://github.com/apache/iceberg/pull/3845)] | ||
| * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)] | ||
| * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)] | ||
| * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this notable?
| * Potential deadlock risk in catalog caching is resolved [[\#3801](https://github.com/apache/iceberg/pull/3801)], and cache is immediately refreshed when table is reloaded in another program [[\#3873](https://github.com/apache/iceberg/pull/3873)] | ||
| * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)] | ||
| * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)] | ||
| * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely not something wide-spread enough to include.
| * `STARTS_WITH` expression now supports filtering `null` values instead of throwing exception [[\#3645](https://github.com/apache/iceberg/pull/3645)] | ||
| * Deleting and adding a partition field with the same name is supported instead of throwing exception (deleting and adding the same field is a noop) [[\#3632](https://github.com/apache/iceberg/pull/3632)] [[\#3954](https://github.com/apache/iceberg/pull/3954)] | ||
| * Parquet file writing issue is fixed for data with over 16 unparseable chars [[\#3760](https://github.com/apache/iceberg/pull/3760)] | ||
| * Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important and should be moved to the top.
| * Delete manifests with only existing files are now included in scan planning instead of being ignored [[\#3945](https://github.com/apache/iceberg/pull/3945)] | ||
| * **Vendor Integrations** | ||
| * AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)] | ||
| * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sin't a bug, it's a minor improvement right?
| * AWS related client connection resources are now properly closed when not used [[\#2878](https://github.com/apache/iceberg/pull/2878)] | ||
| * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] | ||
| * **Spark** | ||
| * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, not a bug?
| * AWS Glue catalog now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] | ||
| * **Spark** | ||
| * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] | ||
| * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which version does it work with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added for Spark >= 3.1
| * **Spark** | ||
| * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] | ||
| * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] | ||
| * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a bug and it's for Spark 2, so it probably isn't relevant enough to mention.
| * `RewriteDataFiles` action is improved to produce files with more balanced output size [[\#3073](https://github.com/apache/iceberg/pull/3073)] [[\#3292](https://github.com/apache/iceberg/pull/3292)] | ||
| * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] | ||
| * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)] | ||
| * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing because Iceberg doesn't have empty partitions.
Looks like the problem was that dynamic partition overwrite would fail to write a dataframe with 0 records.
| * `REFRESH TABLE` can now be used with Spark session catalog instead of throwing exception [[\#3072](https://github.com/apache/iceberg/pull/3072)] | ||
| * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)] | ||
| * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] | ||
| * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an improvement, not a bug fix (though there are fixes). And probably notable enough to include above.
| * Read performance is improved using better table size estimation [[\#3134](https://github.com/apache/iceberg/pull/3134)] | ||
| * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] | ||
| * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] | ||
| * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor fix, probably shouldn't include it.
| * Insert overwrite mode now skips empty partition instead of throwing exception [[\#2895](https://github.com/apache/iceberg/issues/2895)] | ||
| * `add_files` procedure now skips duplicated files by default (can be turned off with the `check_duplicate_files` flag) [[\#2895](https://github.com/apache/iceberg/issues/2779)], skips folder without file [[\#2895](https://github.com/apache/iceberg/issues/3455)] and partitions with `null` values [[\#2895](https://github.com/apache/iceberg/issues/3778)] instead of throwing exception, and supports partition pruning for faster table import [[\#3745](https://github.com/apache/iceberg/issues/3745)] | ||
| * Reading unknown partition transform (e.g. old reader reading new transform type) will now throw `ValidationException` instead of causing unknown behavior downstream [[\#2992](https://github.com/apache/iceberg/issues/2992)] | ||
| * Snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expiration action, right?
| * Changelog tables can now be queried without `RowData` serialization issues [[\#3240](https://github.com/apache/iceberg/pull/3240)] | ||
| * Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)] | ||
| * **Hive** | ||
| * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this a bug or notable enough to include.
| * Data overflow problem is fixed when writing time data of type `java.sql.Time` [[\#3740](https://github.com/apache/iceberg/pull/3740)] | ||
| * **Hive** | ||
| * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)] | ||
| * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is minor and also not user facing. I wouldn't include it.
| * Hive metastore client retry logic is improved using `RetryingMetaStoreClient` [[\#3099](https://github.com/apache/iceberg/pull/3099)] | ||
| * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] | ||
| * Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)] | ||
| * Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance improvements aren't bug fixes. Maybe include this above, but I'm not sure.
| * Hive catalog can now be initialized using a `null` Hadoop configuration instead of throwing exception [[\3252](https://github.com/apache/iceberg/pull/3252)] | ||
| * Table creation can succeed instead of throwing exception when some columns do not have comments [[\#3531](https://github.com/apache/iceberg/pull/3531)] | ||
| * Vectorized read performance is improved by using split offset information in `OrcTail` [[\#3748](https://github.com/apache/iceberg/pull/3748)] | ||
| * Read performance can now be improved by disabling `FileIO` serialization using Hadoop config `iceberg.mr.config.serialization.disabled` [[\#3752](https://github.com/apache/iceberg/pull/3752)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also an optimization.
|
|
||
| * [{{% icebergVersion %}} source tar.gz](https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz) -- [signature](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.asc) -- [sha512](https://downloads.apache.org/iceberg/apache-iceberg-{{% icebergVersion %}}/apache-iceberg-{{% icebergVersion %}}.tar.gz.sha512) | ||
| * [{{% icebergVersion %}} Spark 3.2 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.2_2.12-{{% icebergVersion %}}.jar) | ||
| * [{{% icebergVersion %}} Spark 3.1 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.1_2.12-{{% icebergVersion %}}.jar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any plan to align the iceberg-spark runtime jar among 2.4, 3.0, 3.1 and 3.2 ? I see the spark 2.4 has the name iceberg-spark-runtime-0.13.0, spark 3.0 has the name iceberg-spark3-runtime-0.13.0 but spark 3.1 or spark 3.2 has the name iceberg-spark-runtime-3.1_2.12-0.13.0.
Why not just name all the runtime jar as format iceberg-spark-runtime-$sparkVersion_$scalaVersion-$icebergVersion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how important it is to keep backwards compatibility for jar names, but for now it seems that we are trying to keep 3.0 and 2.4 with old names. Any thoughts? @rdblue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this jar naming issue don't block this PR, but the inconsistent name approach for different spark versions seems increase the cost to understand which is the correct groupId/artifactId to use for the downstream users. For my personal understanding, I will recommend to use the similar name approach for spark 2.4 & 3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too late to do this for the 0.13.0 release. We may want to consider it for 0.14.0, but we're stuck with it for now.
|
|
||
| Multi-engine support is a core tenant of Apache Iceberg. | ||
| The community continuously improves Iceberg core library components to enable integrations with different compute engines that power analytics, business intelligence, machine learning, etc. | ||
| Support of [Apache Spark](../../../docs/spark-configuration), [Apache Flink](../../../docs/flink) and [Apache Hive](../../../docs/hive) are provided inside the Iceberg main repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "Support for" is more natural than "Support of"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe be more specific: "Connectors for Spark, Flink, and Hive are maintained in the main Iceberg repository." Do we need to mention the Hive version so that it isn't confusing when support is built directly into Hive releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mention the Hive version so that it isn't confusing when support is built directly into Hive releases?
I thought about that and was a bit hesitated, because people have been using even Hive 2.1 with Hive 3 jar and somehow things still work in various cases... I don't how backwards compatible Hive will be in the future.
|
|
||
| # Multi-Engine Support | ||
|
|
||
| Multi-engine support is a core tenant of Apache Iceberg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is clear what multi-engine support is. We should probably change this to something like "Apache Iceberg is an open standard for huge analytic tables that can be used by any processing engine"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
|
||
| ## Multi-Version Support | ||
|
|
||
| Engines maintained within the Iceberg repository have multi-version support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than "have multi-version support" I would recommend saying "Processing engine connectors maintained in the iceberg repository are built for multiple versions."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| | 2.4 | Deprecated | | ||
| | 3.0 | Maintained | | ||
| | 3.1 | Maintained | | ||
| | 3.2 | Beta | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? Why would this not be "Maintained"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's true, I forgot to update this line
| ## Multi-Version Support | ||
|
|
||
| Engines maintained within the Iceberg repository have multi-version support. | ||
| This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only true for Spark and Flink. What about saying that "Iceberg provides a runtime connector Jar for each supported version of Spark and Flink."
We should also note that these are the only additions to the classpath needed. You don't have to add any other dependencies to get support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, let me update this section title then and specify this is only for Spark and Flink. I was trying to also make a case for Hive in case it could do that in the future, but it seems unlikely as of today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc saying that only Spark and Flink have versioned codebase and jars, Hive as of today continues to use the same runtime for Hive 2 and 3.
| | 1.12 | Deprecated | | ||
| | 1.13 | Maintained | | ||
| | 1.14 | Maintained | | ||
| ### Apache Hive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)] | ||
| * **File Formats** | ||
| * Reading legacy Parquet file (e.g. produced by `ParquetHiveSerDe` or Spark `spark.sql.parquet.writeLegacyFormat=true`) is now fully supported to facilitate Hive to Iceberg table migration [[\#3723](https://github.com/apache/iceberg/pull/3723)] | ||
| * ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is accurate. This is support for writing delete files in ORC format. I don't think we should refer to those as "merge-on-read" files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was thinking if readers would understand "delete file" that's why I changed.
| * AWS `S3FileIO` now supports server-side checksum validation [[\#3813](https://github.com/apache/iceberg/pull/3813)] | ||
| * AWS `GlueCatalog` now displays more table information including location, description [[\#3467](https://github.com/apache/iceberg/pull/3467)] and used columns [[\#3888](https://github.com/apache/iceberg/pull/3888)] | ||
| * `ResolvingFileIO` is added to support using multiple `FileIO`s to access different storage providers based on file scheme. [[\#3593](https://github.com/apache/iceberg/pull/3593)] | ||
| * **File Formats** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this below the engines? I think those are more significant changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| * **Core** | ||
| * Catalog caching now supports cache expiration through catalog property `cache.expiration-interval-ms` [[\#3543](https://github.com/apache/iceberg/pull/3543)] | ||
| * Catalog now supports registration of Iceberg table from a given metadata file location [[\#3851](https://github.com/apache/iceberg/pull/3851)] | ||
| * Hadoop catalog now supports atomic commit using a lock manager [[\#3663](https://github.com/apache/iceberg/pull/3663)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be more clear that this is an alternative. Maybe "Hadoop catalog can be used with S3 and other file systems safely by using a lock manager"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| * ORC merge-on-read file write support is added [[\#3248](https://github.com/apache/iceberg/pull/3248)] [[\#3250](https://github.com/apache/iceberg/pull/3250)] [[\#3366](https://github.com/apache/iceberg/pull/3366)] | ||
| * **Spark** | ||
| * Spark 3.2 support is added [[\#3335](https://github.com/apache/iceberg/pull/3335)] with merge-on-read `DELETE` [[\#3970](https://github.com/apache/iceberg/pull/3970)] | ||
| * `RewriteDataFiles` action now supports sort-based table optimization [[\#2829](https://github.com/apache/iceberg/pull/2829)] and merge-on-read delete compaction [[\#3454](https://github.com/apache/iceberg/pull/3454)]. The corresponding Spark call procedure `rewrite_data_files` is also added [[\#3375](https://github.com/apache/iceberg/pull/3375)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using "is added" is a bit strange because it combines a present-tense verb ("is") with a past-tense verb ("added") that doesn't describe a state of being ("is supported" works because "supported" is a state). The sentence on this line reads much more naturally because it uses "now supports" and those forms agree. You could change the line above to "Spark 3.2 is now supported".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I will change added to supported for all
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, but I think this is about ready to commit. While we should probably fix some of the phrasing and make things more clear on the multi-engine support page, I think it's close enough to get the update out and fix it later.
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall, just left several minor comments. Thanks @jackye1995 for the detailed release notes !
| Engines maintained within the Iceberg repository have multi-version support. | ||
| This means each new version of an engine that introduces backwards incompatible upgrade has its dedicated integration codebase and release artifacts. | ||
| For example, the code for Iceberg Spark 3.1 integration is under `/spark/v3.1`, and for Iceberg Spark 3.2 integration is under `/spark/v3.2`, | ||
| Different artifacts (`iceberg-spark-3.1_2.12` and `iceberg-spark-3.2_2.12`) are released for users to consume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If spark 2.4 & 3.0 also follow the 3.1 & 3.2 naming approach, then this sentence should be correct. That's why I raise this question before: https://github.com/apache/iceberg-docs/pull/27/files#r800297155
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, that's also why I only picked 3.1 and 3.2. Let me add the runtime artifact name to the table below, hopefully that could clear user's doubts.
| To use Iceberg in Spark or Flink, download the runtime JAR for your engine version and add it to the jars folder of your installation. | ||
|
|
||
| To use Iceberg in Hive, download the iceberg-hive-runtime JAR and add it to Hive using `ADD JAR`. | ||
| To use Iceberg in Hive, download the Hive runtime JAR and add it to Hive using `ADD JAR`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a sentence to show users that both hive2 and hive3 are using the same hive runtime jar ? I see both spark & flink have provided their version specific runtime jar but hive only provided the single shared jar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes agree, let me add a sentence here and also in the multi-engine page.
| * Spark snapshot expiration now supports custom `FileIO` instead of just `HadoopFileIO` [[\#3089](https://github.com/apache/iceberg/pull/3089)] | ||
| * `REPLACE TABLE AS SELECT` can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [[\#3421](https://github.com/apache/iceberg/issues/3421)] | ||
| * Spark SQL statements containing binary or fixed literals can now be parsed correctly instead of throwing exception [[\#3728](https://github.com/apache/iceberg/pull/3728)] | ||
| * **Flink** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we missed a critical bug fix here: apache/iceberg#3540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes agree, I removed it thinking it might be too much detail to mention, let me add it back.
70336aa to
1de71eb
Compare
@rdblue @samredai