diff --git a/1.5.0/docs/configuration.md b/1.5.0/docs/configuration.md index 8a548f51d81f..2c5b96f7e0bd 100644 --- a/1.5.0/docs/configuration.md +++ b/1.5.0/docs/configuration.md @@ -110,9 +110,9 @@ Iceberg tables support table properties to configure table behavior, like the de Reserved table properties are only used to control behaviors when creating or updating a table. The value of these properties are not persisted as a part of the table metadata. -| Property | Default | Description | -| -------------- | -------- | ------------------------------------------------------------- | -| format-version | 2 | Table's format version (can be 1 or 2) as defined in the [Spec](../../../spec/#format-versioning). Defaults to 2 since version 1.4.0. | +| Property | Default | Description | +| -------------- | -------- |--------------------------------------------------------------------------------------------------------------------------------------| +| format-version | 2 | Table's format version (can be 1 or 2) as defined in the [Spec](../../spec.md#format-versioning). Defaults to 2 since version 1.4.0. | ### Compatibility flags @@ -133,7 +133,7 @@ Iceberg catalogs support using catalog properties to configure catalog behaviors | clients | 2 | client pool size | | cache-enabled | true | Whether to cache catalog entries | | cache.expiration-interval-ms | 30000 | How long catalog entries are locally cached, in milliseconds; 0 disables caching, negative values disable expiration | -| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](../metrics-reporting.md) section for additional details | +| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](metrics-reporting.md) section for additional details | `HadoopCatalog` and `HiveCatalog` can access the properties in their constructors. Any other custom catalog can access the properties by implementing `Catalog.initialize(catalogName, catalogProperties)`. diff --git a/1.5.0/docs/flink-actions.md b/1.5.0/docs/flink-actions.md index fc1bdbbebd27..7abddb8c824d 100644 --- a/1.5.0/docs/flink-actions.md +++ b/1.5.0/docs/flink-actions.md @@ -22,7 +22,7 @@ search: ## Rewrite files action -Iceberg provides API to rewrite small files into large files by submitting Flink batch jobs. The behavior of this Flink action is the same as Spark's [rewriteDataFiles](../maintenance.md#compact-data-files). +Iceberg provides API to rewrite small files into large files by submitting Flink batch jobs. The behavior of this Flink action is the same as Spark's [rewriteDataFiles](maintenance.md#compact-data-files). ```java import org.apache.iceberg.flink.actions.Actions; diff --git a/1.5.0/docs/flink-connector.md b/1.5.0/docs/flink-connector.md index c14e73a15a02..99358685e8ba 100644 --- a/1.5.0/docs/flink-connector.md +++ b/1.5.0/docs/flink-connector.md @@ -31,13 +31,13 @@ To create the table in Flink SQL by using SQL syntax `CREATE TABLE test (..) WIT * `connector`: Use the constant `iceberg`. * `catalog-name`: User-specified catalog name. It's required because the connector don't have any default value. * `catalog-type`: `hive` or `hadoop` for built-in catalogs (defaults to `hive`), or left unset for custom catalog implementations using `catalog-impl`. -* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. See also [custom catalog](../flink.md#adding-catalogs) for more details. +* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. See also [custom catalog](flink.md#adding-catalogs) for more details. * `catalog-database`: The iceberg database name in the backend catalog, use the current flink database name by default. * `catalog-table`: The iceberg table name in the backend catalog. Default to use the table name in the flink `CREATE TABLE` sentence. ## Table managed in Hive catalog. -Before executing the following SQL, please make sure you've configured the Flink SQL client correctly according to the [quick start documentation](../flink.md). +Before executing the following SQL, please make sure you've configured the Flink SQL client correctly according to the [quick start documentation](flink.md). The following SQL will create a Flink table in the current Flink catalog, which maps to the iceberg table `default_database.flink_table` managed in iceberg catalog. @@ -140,4 +140,4 @@ SELECT * FROM flink_table; 3 rows in set ``` -For more details, please refer to the Iceberg [Flink documentation](../flink.md). +For more details, please refer to the Iceberg [Flink documentation](flink.md). diff --git a/1.5.0/docs/flink-ddl.md b/1.5.0/docs/flink-ddl.md index 096cc4987610..7b9e4b51a29c 100644 --- a/1.5.0/docs/flink-ddl.md +++ b/1.5.0/docs/flink-ddl.md @@ -152,7 +152,7 @@ Table create commands support the commonly used [Flink create clauses](https://n * `PARTITION BY (column1, column2, ...)` to configure partitioning, Flink does not yet support hidden partitioning. * `COMMENT 'table document'` to set a table description. -* `WITH ('key'='value', ...)` to set [table configuration](../configuration.md) which will be stored in Iceberg table properties. +* `WITH ('key'='value', ...)` to set [table configuration](configuration.md) which will be stored in Iceberg table properties. Currently, it does not support computed column and watermark definition etc. diff --git a/1.5.0/docs/flink-queries.md b/1.5.0/docs/flink-queries.md index cc3478ef5831..10a19a610e27 100644 --- a/1.5.0/docs/flink-queries.md +++ b/1.5.0/docs/flink-queries.md @@ -77,7 +77,7 @@ SET table.exec.iceberg.use-flip27-source = true; ### Reading branches and tags with SQL Branch and tags can be read via SQL by specifying options. For more details -refer to [Flink Configuration](../flink-configuration.md#read-options) +refer to [Flink Configuration](flink-configuration.md#read-options) ```sql --- Read from branch b1 diff --git a/1.5.0/docs/flink-writes.md b/1.5.0/docs/flink-writes.md index 8bedda310d50..edf59776bdf1 100644 --- a/1.5.0/docs/flink-writes.md +++ b/1.5.0/docs/flink-writes.md @@ -69,7 +69,7 @@ Iceberg supports `UPSERT` based on the primary key when writing data into v2 tab ) with ('format-version'='2', 'write.upsert.enabled'='true'); ``` -2. Enabling `UPSERT` mode using `upsert-enabled` in the [write options](#write-options) provides more flexibility than a table level config. Note that you still need to use v2 table format and specify the [primary key](../flink-ddl.md/#primary-key) or [identifier fields](../../spec.md#identifier-field-ids) when creating the table. +2. Enabling `UPSERT` mode using `upsert-enabled` in the [write options](#write-options) provides more flexibility than a table level config. Note that you still need to use v2 table format and specify the [primary key](flink-ddl.md/#primary-key) or [identifier fields](../../spec.md#identifier-field-ids) when creating the table. ```sql INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ @@ -187,7 +187,7 @@ FlinkSink.builderFor( ### Branch Writes Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```java FlinkSink.forRowData(input) .tableLoader(tableLoader) @@ -264,13 +264,13 @@ INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ ... ``` -Check out all the options here: [write-options](../flink-configuration.md#write-options) +Check out all the options here: [write-options](flink-configuration.md#write-options) ## Notes Flink streaming write jobs rely on snapshot summary to keep the last committed checkpoint ID, and -store uncommitted data as temporary files. Therefore, [expiring snapshots](../maintenance.md#expire-snapshots) -and [deleting orphan files](../maintenance.md#delete-orphan-files) could possibly corrupt +store uncommitted data as temporary files. Therefore, [expiring snapshots](maintenance.md#expire-snapshots) +and [deleting orphan files](maintenance.md#delete-orphan-files) could possibly corrupt the state of the Flink job. To avoid that, make sure to keep the last snapshot created by the Flink job (which can be identified by the `flink.job-id` property in the summary), and only delete orphan files that are old enough. diff --git a/1.5.0/docs/flink.md b/1.5.0/docs/flink.md index 82a73ebbb9ce..274a42e358a6 100644 --- a/1.5.0/docs/flink.md +++ b/1.5.0/docs/flink.md @@ -24,22 +24,22 @@ search: Apache Iceberg supports both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API. See the [Multi-Engine Support](../../multi-engine-support.md#apache-flink) page for the integration of Apache Flink. -| Feature support | Flink | Notes | -| ----------------------------------------------------------- |-------|----------------------------------------------------------------------------------------| -| [SQL create catalog](../flink-ddl.md#create-catalog) | ✔️ | | -| [SQL create database](../flink-ddl.md#create-database) | ✔️ | | -| [SQL create table](../flink-ddl.md#create-table) | ✔️ | | -| [SQL create table like](../flink-ddl.md#create-table-like) | ✔️ | | -| [SQL alter table](../flink-ddl.md#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported | -| [SQL drop_table](../flink-ddl.md#drop-table) | ✔️ | | -| [SQL select](../flink-queries.md#reading-with-sql) | ✔️ | Support both streaming and batch mode | -| [SQL insert into](../flink-writes.md#insert-into) | ✔️ ️ | Support both streaming and batch mode | -| [SQL insert overwrite](../flink-writes.md#insert-overwrite) | ✔️ ️ | | -| [DataStream read](../flink-queries.md#reading-with-datastream) | ✔️ ️ | | -| [DataStream append](../flink-writes.md#appending-data) | ✔️ ️ | | -| [DataStream overwrite](../flink-writes.md#overwrite-data) | ✔️ ️ | | -| [Metadata tables](../flink-queries.md#inspecting-tables) | ✔️ | | -| [Rewrite files action](../flink-actions.md#rewrite-files-action) | ✔️ ️ | | +| Feature support | Flink | Notes | +| -------------------------------------------------------- |-------|----------------------------------------------------------------------------------------| +| [SQL create catalog](flink-ddl.md#create-catalog) | ✔️ | | +| [SQL create database](flink-ddl.md#create-database) | ✔️ | | +| [SQL create table](flink-ddl.md#create-table) | ✔️ | | +| [SQL create table like](flink-ddl.md#create-table-like) | ✔️ | | +| [SQL alter table](flink-ddl.md#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported | +| [SQL drop_table](flink-ddl.md#drop-table) | ✔️ | | +| [SQL select](flink-queries.md#reading-with-sql) | ✔️ | Support both streaming and batch mode | +| [SQL insert into](flink-writes.md#insert-into) | ✔️ ️ | Support both streaming and batch mode | +| [SQL insert overwrite](flink-writes.md#insert-overwrite) | ✔️ ️ | | +| [DataStream read](flink-queries.md#reading-with-datastream) | ✔️ ️ | | +| [DataStream append](flink-writes.md#appending-data) | ✔️ ️ | | +| [DataStream overwrite](flink-writes.md#overwrite-data) | ✔️ ️ | | +| [Metadata tables](flink-queries.md#inspecting-tables) | ✔️ | | +| [Rewrite files action](flink-actions.md#rewrite-files-action) | ✔️ ️ | | ## Preparation when using Flink SQL Client @@ -71,6 +71,7 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` ./bin/start-cluster.sh ``` + Start the Flink SQL client. There is a separate `flink-runtime` module in the Iceberg project to generate a bundled jar, which could be loaded by Flink SQL client directly. To build the `flink-runtime` bundled jar manually, build the `iceberg` project, and it will generate the jar under `/flink-runtime/build/libs`. Or download the `flink-runtime` jar from the [Apache repository](https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.16/{{ icebergVersion }}/). ```bash @@ -273,7 +274,7 @@ env.execute("Test Iceberg DataStream"); ### Branch Writes Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```java FlinkSink.forRowData(input) .tableLoader(tableLoader) diff --git a/1.5.0/docs/spark-configuration.md b/1.5.0/docs/spark-configuration.md index ebaab54ebbf5..5b13ce8c7c93 100644 --- a/1.5.0/docs/spark-configuration.md +++ b/1.5.0/docs/spark-configuration.md @@ -80,7 +80,7 @@ Both catalogs are configured using properties nested under the catalog name. Com | spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Default Iceberg table property value for property key _propertyKey_, which will be set on tables created by this catalog if not overridden | | spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Enforced Iceberg table property value for property key _propertyKey_, which cannot be overridden by user | -Additional properties can be found in common [catalog configuration](../configuration.md#catalog-properties). +Additional properties can be found in common [catalog configuration](configuration.md#catalog-properties). ### Using catalogs @@ -187,7 +187,7 @@ df.write | fanout-enabled | false | Overrides this table's write.spark.fanout.enabled | | check-ordering | true | Checks if input schema and table schema are same | | isolation-level | null | Desired isolation level for Dataframe overwrite operations. `null` => no checks (for idempotent writes), `serializable` => check for concurrent inserts or deletes in destination partitions, `snapshot` => checks for concurrent deletes in destination partitions. | -| validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via [Table API](../api.md#table-metadata) or [Snapshots table](../spark-queries.md#snapshots). If null, the table's oldest known snapshot is used. | +| validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via [Table API](api.md#table-metadata) or [Snapshots table](spark-queries.md#snapshots). If null, the table's oldest known snapshot is used. | | compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | | compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | | compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | diff --git a/1.5.0/docs/spark-ddl.md b/1.5.0/docs/spark-ddl.md index b0627c35e612..0c344715b55a 100644 --- a/1.5.0/docs/spark-ddl.md +++ b/1.5.0/docs/spark-ddl.md @@ -35,14 +35,14 @@ CREATE TABLE prod.db.sample ( USING iceberg; ``` -Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](../spark-getting-started.md#spark-type-to-iceberg-type) for details. +Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](spark-getting-started.md#spark-type-to-iceberg-type) for details. Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including: * `PARTITIONED BY (partition-expressions)` to configure partitioning * `LOCATION '(fully-qualified-uri)'` to set the table location * `COMMENT 'table documentation'` to set a table description -* `TBLPROPERTIES ('key'='value', ...)` to set [table configuration](../configuration.md) +* `TBLPROPERTIES ('key'='value', ...)` to set [table configuration](configuration.md) Create commands may also set the default format with the `USING` clause. This is only supported for `SparkCatalog` because Spark handles the `USING` clause differently for the built-in catalog. @@ -61,7 +61,7 @@ USING iceberg PARTITIONED BY (category); ``` -The `PARTITIONED BY` clause supports transform expressions to create [hidden partitions](../partitioning.md). +The `PARTITIONED BY` clause supports transform expressions to create [hidden partitions](partitioning.md). ```sql CREATE TABLE prod.db.sample ( @@ -88,7 +88,7 @@ Note: Old syntax of `years(ts)`, `months(ts)`, `days(ts)` and `hours(ts)` are al ## `CREATE TABLE ... AS SELECT` -Iceberg supports CTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration.md#catalog-configuration). CTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration.md#replacing-the-session-catalog). +Iceberg supports CTAS as an atomic operation when using a [`SparkCatalog`](spark-configuration.md#catalog-configuration). CTAS is supported, but is not atomic when using [`SparkSessionCatalog`](spark-configuration.md#replacing-the-session-catalog). ```sql CREATE TABLE prod.db.sample @@ -108,7 +108,7 @@ AS SELECT ... ## `REPLACE TABLE ... AS SELECT` -Iceberg supports RTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration.md#catalog-configuration). RTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration.md#replacing-the-session-catalog). +Iceberg supports RTAS as an atomic operation when using a [`SparkCatalog`](spark-configuration.md#catalog-configuration). RTAS is supported, but is not atomic when using [`SparkSessionCatalog`](spark-configuration.md#replacing-the-session-catalog). Atomic table replacement creates a new snapshot with the results of the `SELECT` query, but keeps table history. @@ -170,7 +170,7 @@ Iceberg has full `ALTER TABLE` support in Spark 3, including: * Widening the type of `int`, `float`, and `decimal` fields * Making required columns optional -In addition, [SQL extensions](../spark-configuration.md#sql-extensions) can be used to add support for partition evolution and setting a table's write order +In addition, [SQL extensions](spark-configuration.md#sql-extensions) can be used to add support for partition evolution and setting a table's write order ### `ALTER TABLE ... RENAME TO` @@ -186,7 +186,7 @@ ALTER TABLE prod.db.sample SET TBLPROPERTIES ( ); ``` -Iceberg uses table properties to control table behavior. For a list of available properties, see [Table configuration](../configuration.md). +Iceberg uses table properties to control table behavior. For a list of available properties, see [Table configuration](configuration.md). `UNSET` is used to remove properties: @@ -327,7 +327,7 @@ ALTER TABLE prod.db.sample DROP COLUMN point.z; ## `ALTER TABLE` SQL extensions -These commands are available in Spark 3 when using Iceberg [SQL extensions](../spark-configuration.md#sql-extensions). +These commands are available in Spark 3 when using Iceberg [SQL extensions](spark-configuration.md#sql-extensions). ### `ALTER TABLE ... ADD PARTITION FIELD` diff --git a/1.5.0/docs/spark-getting-started.md b/1.5.0/docs/spark-getting-started.md index 3db83aaa437f..149a8654c4ae 100644 --- a/1.5.0/docs/spark-getting-started.md +++ b/1.5.0/docs/spark-getting-started.md @@ -37,12 +37,13 @@ spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceb ``` !!! info + If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.5_2.12` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar) to Spark's `jars` folder. ### Adding catalogs -Iceberg comes with [catalogs](../spark-configuration.md#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. +Iceberg comes with [catalogs](spark-configuration.md#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog: @@ -58,7 +59,7 @@ spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceber ### Creating a table -To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](../spark-ddl.md#create-table) command: +To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](spark-ddl.md#create-table) command: ```sql -- local is the path-based catalog defined above @@ -67,21 +68,21 @@ CREATE TABLE local.db.table (id bigint, data string) USING iceberg; Iceberg catalogs support the full range of SQL DDL commands, including: -* [`CREATE TABLE ... PARTITIONED BY`](../spark-ddl.md#create-table) -* [`CREATE TABLE ... AS SELECT`](../spark-ddl.md#create-table-as-select) -* [`ALTER TABLE`](../spark-ddl.md#alter-table) -* [`DROP TABLE`](../spark-ddl.md#drop-table) +* [`CREATE TABLE ... PARTITIONED BY`](spark-ddl.md#create-table) +* [`CREATE TABLE ... AS SELECT`](spark-ddl.md#create-table-as-select) +* [`ALTER TABLE`](spark-ddl.md#alter-table) +* [`DROP TABLE`](spark-ddl.md#drop-table) ### Writing -Once your table is created, insert data using [`INSERT INTO`](../spark-writes.md#insert-into): +Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#insert-into): ```sql INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c'); INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1; ``` -Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](../spark-writes.md#merge-into) and [`DELETE FROM`](../spark-writes.md#delete-from): +Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](spark-writes.md#merge-into) and [`DELETE FROM`](spark-writes.md#delete-from): ```sql MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id @@ -89,7 +90,7 @@ WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count WHEN NOT MATCHED THEN INSERT *; ``` -Iceberg supports writing DataFrames using the new [v2 DataFrame write API](../spark-writes.md#writing-with-dataframes): +Iceberg supports writing DataFrames using the new [v2 DataFrame write API](spark-writes.md#writing-with-dataframes): ```scala spark.table("source").select("id", "data") @@ -108,7 +109,7 @@ FROM local.db.table GROUP BY data; ``` -SQL is also the recommended way to [inspect tables](../spark-queries.md#inspecting-tables). To view all snapshots in a table, use the `snapshots` metadata table: +SQL is also the recommended way to [inspect tables](spark-queries.md#inspecting-tables). To view all snapshots in a table, use the `snapshots` metadata table: ```sql SELECT * FROM local.db.table.snapshots; ``` @@ -123,7 +124,7 @@ SELECT * FROM local.db.table.snapshots; +-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ ``` -[DataFrame reads](../spark-queries.md#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: +[DataFrame reads](spark-queries.md#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: ```scala val df = spark.table("local.db.table") @@ -194,7 +195,7 @@ This type conversion table describes how Iceberg types are converted to the Spar Next, you can learn more about Iceberg tables in Spark: -* [DDL commands](../spark-ddl.md): `CREATE`, `ALTER`, and `DROP` -* [Querying data](../spark-queries.md): `SELECT` queries and metadata tables -* [Writing data](../spark-writes.md): `INSERT INTO` and `MERGE INTO` -* [Maintaining tables](../spark-procedures.md) with stored procedures +* [DDL commands](spark-ddl.md): `CREATE`, `ALTER`, and `DROP` +* [Querying data](spark-queries.md): `SELECT` queries and metadata tables +* [Writing data](spark-writes.md): `INSERT INTO` and `MERGE INTO` +* [Maintaining tables](spark-procedures.md) with stored procedures diff --git a/1.5.0/docs/spark-procedures.md b/1.5.0/docs/spark-procedures.md index e6a480264b6a..de66a428b0de 100644 --- a/1.5.0/docs/spark-procedures.md +++ b/1.5.0/docs/spark-procedures.md @@ -22,7 +22,7 @@ search: # Spark Procedures -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](../spark-configuration.md#sql-extensions) in Spark 3. +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3. ## Usage @@ -274,7 +274,7 @@ the `expire_snapshots` procedure will never remove files which are still require | `stream_results` | | boolean | When true, deletion files will be sent to Spark driver by RDD partition (by default, all the files will be sent to Spark driver). This option is recommended to set to `true` to prevent Spark driver OOM from large file size | | `snapshot_ids` | | array of long | Array of snapshot IDs to expire. | -If `older_than` and `retain_last` are omitted, the table's [expiration properties](../configuration.md#table-behavior-properties) will be used. +If `older_than` and `retain_last` are omitted, the table's [expiration properties](configuration.md#table-behavior-properties) will be used. Snapshots that are still referenced by branches or tags won't be removed. By default, branches and tags never expire, but their retention policy can be changed with the table property `history.expire.max-ref-age-ms`. The `main` branch never expires. #### Output @@ -359,7 +359,7 @@ Iceberg can compact data files in parallel using Spark with the `rewriteDataFile | `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | | `use-starting-sequence-number` | true | Use the sequence number of the snapshot at compaction start time instead of that of the newly produced snapshot | | `rewrite-job-order` | none | Force the rewrite job order based on the value. | -| `target-file-size-bytes` | 536870912 (512 MB, default value of `write.target-file-size-bytes` from [table properties](../configuration.md#write-properties)) | Target output file size | +| `target-file-size-bytes` | 536870912 (512 MB, default value of `write.target-file-size-bytes` from [table properties](configuration.md#write-properties)) | Target output file size | | `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | | `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | | `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | @@ -482,7 +482,7 @@ Dangling deletes are always filtered out during rewriting. | `partial-progress.enabled` | false | Enable committing groups of files prior to the entire rewrite completing | | `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | | `rewrite-job-order` | none | Force the rewrite job order based on the value. | -| `target-file-size-bytes` | 67108864 (64MB, default value of `write.delete.target-file-size-bytes` from [table properties](../configuration.md#write-properties)) | Target output file size | +| `target-file-size-bytes` | 67108864 (64MB, default value of `write.delete.target-file-size-bytes` from [table properties](configuration.md#write-properties)) | Target output file size | | `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | | `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | | `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | diff --git a/1.5.0/docs/spark-queries.md b/1.5.0/docs/spark-queries.md index e66fa9d0ae04..71f31a012841 100644 --- a/1.5.0/docs/spark-queries.md +++ b/1.5.0/docs/spark-queries.md @@ -22,11 +22,11 @@ search: # Spark Queries -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. ## Querying with SQL -In Spark 3, tables use identifiers that include a [catalog name](../spark-configuration.md#using-catalogs). +In Spark 3, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs). ```sql SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table diff --git a/1.5.0/docs/spark-structured-streaming.md b/1.5.0/docs/spark-structured-streaming.md index dcef85284d08..738221fbb1e9 100644 --- a/1.5.0/docs/spark-structured-streaming.md +++ b/1.5.0/docs/spark-structured-streaming.md @@ -70,7 +70,7 @@ Iceberg supports `append` and `complete` output modes: * `append`: appends the rows of every micro-batch to the table * `complete`: replaces the table contents every micro-batch -Prior to starting the streaming query, ensure you created the table. Refer to the [SQL create table](../spark-ddl.md#create-table) documentation to learn how to create the Iceberg table. +Prior to starting the streaming query, ensure you created the table. Refer to the [SQL create table](spark-ddl.md#create-table) documentation to learn how to create the Iceberg table. Iceberg doesn't support experimental [continuous processing](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing), as it doesn't provide the interface to "commit" the output. @@ -78,7 +78,7 @@ Iceberg doesn't support experimental [continuous processing](https://spark.apach Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition. against partitioned table. For batch queries you're encouraged to do explicit sort to fulfill the requirement -(see [here](../spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as +(see [here](spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as repartition and sort are considered as heavy operations for streaming workload. To avoid additional latency, you can enable fanout writer to eliminate the requirement. @@ -109,13 +109,13 @@ documents how to configure the interval. ### Expire old snapshots -Each batch written to a table produces a new snapshot. Iceberg tracks snapshots in table metadata until they are expired. Snapshots accumulate quickly with frequent commits, so it is highly recommended that tables written by streaming queries are [regularly maintained](../maintenance.md#expire-snapshots). [Snapshot expiration](../spark-procedures.md#expire_snapshots) is the procedure of removing the metadata and any data files that are no longer needed. By default, the procedure will expire the snapshots older than five days. +Each batch written to a table produces a new snapshot. Iceberg tracks snapshots in table metadata until they are expired. Snapshots accumulate quickly with frequent commits, so it is highly recommended that tables written by streaming queries are [regularly maintained](maintenance.md#expire-snapshots). [Snapshot expiration](spark-procedures.md#expire_snapshots) is the procedure of removing the metadata and any data files that are no longer needed. By default, the procedure will expire the snapshots older than five days. ### Compacting data files -The amount of data written from a streaming process is typically small, which can cause the table metadata to track lots of small files. [Compacting small files into larger files](../maintenance.md#compact-data-files) reduces the metadata needed by the table, and increases query efficiency. Iceberg and Spark [comes with the `rewrite_data_files` procedure](../spark-procedures.md#rewrite_data_files). +The amount of data written from a streaming process is typically small, which can cause the table metadata to track lots of small files. [Compacting small files into larger files](maintenance.md#compact-data-files) reduces the metadata needed by the table, and increases query efficiency. Iceberg and Spark [comes with the `rewrite_data_files` procedure](spark-procedures.md#rewrite_data_files). ### Rewrite manifests To optimize write latency on a streaming workload, Iceberg can write the new snapshot with a "fast" append that does not automatically compact manifests. -This could lead lots of small manifest files. Iceberg can [rewrite the number of manifest files to improve query performance](../maintenance.md#rewrite-manifests). Iceberg and Spark [come with the `rewrite_manifests` procedure](../spark-procedures.md#rewrite_manifests). +This could lead lots of small manifest files. Iceberg can [rewrite the number of manifest files to improve query performance](maintenance.md#rewrite-manifests). Iceberg and Spark [come with the `rewrite_manifests` procedure](spark-procedures.md#rewrite_manifests). diff --git a/1.5.0/docs/spark-writes.md b/1.5.0/docs/spark-writes.md index cb75937ce7a7..8dce4b572ba4 100644 --- a/1.5.0/docs/spark-writes.md +++ b/1.5.0/docs/spark-writes.md @@ -22,9 +22,9 @@ search: # Spark Writes -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). -Some plans are only available when using [Iceberg SQL extensions](../spark-configuration.md#sql-extensions) in Spark 3. +Some plans are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions: @@ -202,7 +202,7 @@ Branch writes can also be performed as part of a write-audit-publish (WAP) workf Note WAP branch and branch identifier cannot both be specified. Also, the branch must exist before performing the write. The operation does **not** create the branch if it does not exist. -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```sql -- INSERT (1,' a') (2, 'b') into the audit branch. @@ -366,7 +366,7 @@ There are 3 options for `write.distribution-mode` This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done automatically by Spark, the data must be *manually* sorted by partition value. The data must be sorted either within each spark task, or globally within the entire dataset. A global sort will minimize the number of output files. -A sort can be avoided by using the Spark [write fanout](../spark-configuration.md#write-options) property but this will cause all +A sort can be avoided by using the Spark [write fanout](spark-configuration.md#write-options) property but this will cause all file handles to remain open until each write task has completed. * `hash` - This mode is the new default and requests that Spark uses a hash-based exchange to shuffle the incoming write data before writing. @@ -387,7 +387,7 @@ sort-order. Further division and coalescing of tasks may take place because of When writing data to Iceberg with Spark, it's important to note that Spark cannot write a file larger than a Spark task and a file cannot span an Iceberg partition boundary. This means although Iceberg will always roll over a file -when it grows to [`write.target-file-size-bytes`](../configuration.md#write-properties), but unless the Spark task is +when it grows to [`write.target-file-size-bytes`](configuration.md#write-properties), but unless the Spark task is large enough that will not happen. The size of the file created on disk will also be much smaller than the Spark task since the on disk data will be both compressed and in columnar format as opposed to Spark's uncompressed row representation. This means a 100 megabyte Spark task will create a file much smaller than 100 megabytes even if that diff --git a/1.5.1/docs/configuration.md b/1.5.1/docs/configuration.md index 8a548f51d81f..2c5b96f7e0bd 100644 --- a/1.5.1/docs/configuration.md +++ b/1.5.1/docs/configuration.md @@ -110,9 +110,9 @@ Iceberg tables support table properties to configure table behavior, like the de Reserved table properties are only used to control behaviors when creating or updating a table. The value of these properties are not persisted as a part of the table metadata. -| Property | Default | Description | -| -------------- | -------- | ------------------------------------------------------------- | -| format-version | 2 | Table's format version (can be 1 or 2) as defined in the [Spec](../../../spec/#format-versioning). Defaults to 2 since version 1.4.0. | +| Property | Default | Description | +| -------------- | -------- |--------------------------------------------------------------------------------------------------------------------------------------| +| format-version | 2 | Table's format version (can be 1 or 2) as defined in the [Spec](../../spec.md#format-versioning). Defaults to 2 since version 1.4.0. | ### Compatibility flags @@ -133,7 +133,7 @@ Iceberg catalogs support using catalog properties to configure catalog behaviors | clients | 2 | client pool size | | cache-enabled | true | Whether to cache catalog entries | | cache.expiration-interval-ms | 30000 | How long catalog entries are locally cached, in milliseconds; 0 disables caching, negative values disable expiration | -| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](../metrics-reporting.md) section for additional details | +| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](metrics-reporting.md) section for additional details | `HadoopCatalog` and `HiveCatalog` can access the properties in their constructors. Any other custom catalog can access the properties by implementing `Catalog.initialize(catalogName, catalogProperties)`. diff --git a/1.5.1/docs/flink-actions.md b/1.5.1/docs/flink-actions.md index fc1bdbbebd27..7abddb8c824d 100644 --- a/1.5.1/docs/flink-actions.md +++ b/1.5.1/docs/flink-actions.md @@ -22,7 +22,7 @@ search: ## Rewrite files action -Iceberg provides API to rewrite small files into large files by submitting Flink batch jobs. The behavior of this Flink action is the same as Spark's [rewriteDataFiles](../maintenance.md#compact-data-files). +Iceberg provides API to rewrite small files into large files by submitting Flink batch jobs. The behavior of this Flink action is the same as Spark's [rewriteDataFiles](maintenance.md#compact-data-files). ```java import org.apache.iceberg.flink.actions.Actions; diff --git a/1.5.1/docs/flink-connector.md b/1.5.1/docs/flink-connector.md index c14e73a15a02..99358685e8ba 100644 --- a/1.5.1/docs/flink-connector.md +++ b/1.5.1/docs/flink-connector.md @@ -31,13 +31,13 @@ To create the table in Flink SQL by using SQL syntax `CREATE TABLE test (..) WIT * `connector`: Use the constant `iceberg`. * `catalog-name`: User-specified catalog name. It's required because the connector don't have any default value. * `catalog-type`: `hive` or `hadoop` for built-in catalogs (defaults to `hive`), or left unset for custom catalog implementations using `catalog-impl`. -* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. See also [custom catalog](../flink.md#adding-catalogs) for more details. +* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. See also [custom catalog](flink.md#adding-catalogs) for more details. * `catalog-database`: The iceberg database name in the backend catalog, use the current flink database name by default. * `catalog-table`: The iceberg table name in the backend catalog. Default to use the table name in the flink `CREATE TABLE` sentence. ## Table managed in Hive catalog. -Before executing the following SQL, please make sure you've configured the Flink SQL client correctly according to the [quick start documentation](../flink.md). +Before executing the following SQL, please make sure you've configured the Flink SQL client correctly according to the [quick start documentation](flink.md). The following SQL will create a Flink table in the current Flink catalog, which maps to the iceberg table `default_database.flink_table` managed in iceberg catalog. @@ -140,4 +140,4 @@ SELECT * FROM flink_table; 3 rows in set ``` -For more details, please refer to the Iceberg [Flink documentation](../flink.md). +For more details, please refer to the Iceberg [Flink documentation](flink.md). diff --git a/1.5.1/docs/flink-ddl.md b/1.5.1/docs/flink-ddl.md index 096cc4987610..7b9e4b51a29c 100644 --- a/1.5.1/docs/flink-ddl.md +++ b/1.5.1/docs/flink-ddl.md @@ -152,7 +152,7 @@ Table create commands support the commonly used [Flink create clauses](https://n * `PARTITION BY (column1, column2, ...)` to configure partitioning, Flink does not yet support hidden partitioning. * `COMMENT 'table document'` to set a table description. -* `WITH ('key'='value', ...)` to set [table configuration](../configuration.md) which will be stored in Iceberg table properties. +* `WITH ('key'='value', ...)` to set [table configuration](configuration.md) which will be stored in Iceberg table properties. Currently, it does not support computed column and watermark definition etc. diff --git a/1.5.1/docs/flink-queries.md b/1.5.1/docs/flink-queries.md index cc3478ef5831..10a19a610e27 100644 --- a/1.5.1/docs/flink-queries.md +++ b/1.5.1/docs/flink-queries.md @@ -77,7 +77,7 @@ SET table.exec.iceberg.use-flip27-source = true; ### Reading branches and tags with SQL Branch and tags can be read via SQL by specifying options. For more details -refer to [Flink Configuration](../flink-configuration.md#read-options) +refer to [Flink Configuration](flink-configuration.md#read-options) ```sql --- Read from branch b1 diff --git a/1.5.1/docs/flink-writes.md b/1.5.1/docs/flink-writes.md index 8bedda310d50..edf59776bdf1 100644 --- a/1.5.1/docs/flink-writes.md +++ b/1.5.1/docs/flink-writes.md @@ -69,7 +69,7 @@ Iceberg supports `UPSERT` based on the primary key when writing data into v2 tab ) with ('format-version'='2', 'write.upsert.enabled'='true'); ``` -2. Enabling `UPSERT` mode using `upsert-enabled` in the [write options](#write-options) provides more flexibility than a table level config. Note that you still need to use v2 table format and specify the [primary key](../flink-ddl.md/#primary-key) or [identifier fields](../../spec.md#identifier-field-ids) when creating the table. +2. Enabling `UPSERT` mode using `upsert-enabled` in the [write options](#write-options) provides more flexibility than a table level config. Note that you still need to use v2 table format and specify the [primary key](flink-ddl.md/#primary-key) or [identifier fields](../../spec.md#identifier-field-ids) when creating the table. ```sql INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ @@ -187,7 +187,7 @@ FlinkSink.builderFor( ### Branch Writes Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```java FlinkSink.forRowData(input) .tableLoader(tableLoader) @@ -264,13 +264,13 @@ INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ ... ``` -Check out all the options here: [write-options](../flink-configuration.md#write-options) +Check out all the options here: [write-options](flink-configuration.md#write-options) ## Notes Flink streaming write jobs rely on snapshot summary to keep the last committed checkpoint ID, and -store uncommitted data as temporary files. Therefore, [expiring snapshots](../maintenance.md#expire-snapshots) -and [deleting orphan files](../maintenance.md#delete-orphan-files) could possibly corrupt +store uncommitted data as temporary files. Therefore, [expiring snapshots](maintenance.md#expire-snapshots) +and [deleting orphan files](maintenance.md#delete-orphan-files) could possibly corrupt the state of the Flink job. To avoid that, make sure to keep the last snapshot created by the Flink job (which can be identified by the `flink.job-id` property in the summary), and only delete orphan files that are old enough. diff --git a/1.5.1/docs/flink.md b/1.5.1/docs/flink.md index 82a73ebbb9ce..274a42e358a6 100644 --- a/1.5.1/docs/flink.md +++ b/1.5.1/docs/flink.md @@ -24,22 +24,22 @@ search: Apache Iceberg supports both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API. See the [Multi-Engine Support](../../multi-engine-support.md#apache-flink) page for the integration of Apache Flink. -| Feature support | Flink | Notes | -| ----------------------------------------------------------- |-------|----------------------------------------------------------------------------------------| -| [SQL create catalog](../flink-ddl.md#create-catalog) | ✔️ | | -| [SQL create database](../flink-ddl.md#create-database) | ✔️ | | -| [SQL create table](../flink-ddl.md#create-table) | ✔️ | | -| [SQL create table like](../flink-ddl.md#create-table-like) | ✔️ | | -| [SQL alter table](../flink-ddl.md#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported | -| [SQL drop_table](../flink-ddl.md#drop-table) | ✔️ | | -| [SQL select](../flink-queries.md#reading-with-sql) | ✔️ | Support both streaming and batch mode | -| [SQL insert into](../flink-writes.md#insert-into) | ✔️ ️ | Support both streaming and batch mode | -| [SQL insert overwrite](../flink-writes.md#insert-overwrite) | ✔️ ️ | | -| [DataStream read](../flink-queries.md#reading-with-datastream) | ✔️ ️ | | -| [DataStream append](../flink-writes.md#appending-data) | ✔️ ️ | | -| [DataStream overwrite](../flink-writes.md#overwrite-data) | ✔️ ️ | | -| [Metadata tables](../flink-queries.md#inspecting-tables) | ✔️ | | -| [Rewrite files action](../flink-actions.md#rewrite-files-action) | ✔️ ️ | | +| Feature support | Flink | Notes | +| -------------------------------------------------------- |-------|----------------------------------------------------------------------------------------| +| [SQL create catalog](flink-ddl.md#create-catalog) | ✔️ | | +| [SQL create database](flink-ddl.md#create-database) | ✔️ | | +| [SQL create table](flink-ddl.md#create-table) | ✔️ | | +| [SQL create table like](flink-ddl.md#create-table-like) | ✔️ | | +| [SQL alter table](flink-ddl.md#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported | +| [SQL drop_table](flink-ddl.md#drop-table) | ✔️ | | +| [SQL select](flink-queries.md#reading-with-sql) | ✔️ | Support both streaming and batch mode | +| [SQL insert into](flink-writes.md#insert-into) | ✔️ ️ | Support both streaming and batch mode | +| [SQL insert overwrite](flink-writes.md#insert-overwrite) | ✔️ ️ | | +| [DataStream read](flink-queries.md#reading-with-datastream) | ✔️ ️ | | +| [DataStream append](flink-writes.md#appending-data) | ✔️ ️ | | +| [DataStream overwrite](flink-writes.md#overwrite-data) | ✔️ ️ | | +| [Metadata tables](flink-queries.md#inspecting-tables) | ✔️ | | +| [Rewrite files action](flink-actions.md#rewrite-files-action) | ✔️ ️ | | ## Preparation when using Flink SQL Client @@ -71,6 +71,7 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` ./bin/start-cluster.sh ``` + Start the Flink SQL client. There is a separate `flink-runtime` module in the Iceberg project to generate a bundled jar, which could be loaded by Flink SQL client directly. To build the `flink-runtime` bundled jar manually, build the `iceberg` project, and it will generate the jar under `/flink-runtime/build/libs`. Or download the `flink-runtime` jar from the [Apache repository](https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.16/{{ icebergVersion }}/). ```bash @@ -273,7 +274,7 @@ env.execute("Test Iceberg DataStream"); ### Branch Writes Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```java FlinkSink.forRowData(input) .tableLoader(tableLoader) diff --git a/1.5.1/docs/spark-configuration.md b/1.5.1/docs/spark-configuration.md index ebaab54ebbf5..5b13ce8c7c93 100644 --- a/1.5.1/docs/spark-configuration.md +++ b/1.5.1/docs/spark-configuration.md @@ -80,7 +80,7 @@ Both catalogs are configured using properties nested under the catalog name. Com | spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Default Iceberg table property value for property key _propertyKey_, which will be set on tables created by this catalog if not overridden | | spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Enforced Iceberg table property value for property key _propertyKey_, which cannot be overridden by user | -Additional properties can be found in common [catalog configuration](../configuration.md#catalog-properties). +Additional properties can be found in common [catalog configuration](configuration.md#catalog-properties). ### Using catalogs @@ -187,7 +187,7 @@ df.write | fanout-enabled | false | Overrides this table's write.spark.fanout.enabled | | check-ordering | true | Checks if input schema and table schema are same | | isolation-level | null | Desired isolation level for Dataframe overwrite operations. `null` => no checks (for idempotent writes), `serializable` => check for concurrent inserts or deletes in destination partitions, `snapshot` => checks for concurrent deletes in destination partitions. | -| validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via [Table API](../api.md#table-metadata) or [Snapshots table](../spark-queries.md#snapshots). If null, the table's oldest known snapshot is used. | +| validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via [Table API](api.md#table-metadata) or [Snapshots table](spark-queries.md#snapshots). If null, the table's oldest known snapshot is used. | | compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | | compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | | compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | diff --git a/1.5.1/docs/spark-ddl.md b/1.5.1/docs/spark-ddl.md index b0627c35e612..0c344715b55a 100644 --- a/1.5.1/docs/spark-ddl.md +++ b/1.5.1/docs/spark-ddl.md @@ -35,14 +35,14 @@ CREATE TABLE prod.db.sample ( USING iceberg; ``` -Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](../spark-getting-started.md#spark-type-to-iceberg-type) for details. +Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](spark-getting-started.md#spark-type-to-iceberg-type) for details. Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including: * `PARTITIONED BY (partition-expressions)` to configure partitioning * `LOCATION '(fully-qualified-uri)'` to set the table location * `COMMENT 'table documentation'` to set a table description -* `TBLPROPERTIES ('key'='value', ...)` to set [table configuration](../configuration.md) +* `TBLPROPERTIES ('key'='value', ...)` to set [table configuration](configuration.md) Create commands may also set the default format with the `USING` clause. This is only supported for `SparkCatalog` because Spark handles the `USING` clause differently for the built-in catalog. @@ -61,7 +61,7 @@ USING iceberg PARTITIONED BY (category); ``` -The `PARTITIONED BY` clause supports transform expressions to create [hidden partitions](../partitioning.md). +The `PARTITIONED BY` clause supports transform expressions to create [hidden partitions](partitioning.md). ```sql CREATE TABLE prod.db.sample ( @@ -88,7 +88,7 @@ Note: Old syntax of `years(ts)`, `months(ts)`, `days(ts)` and `hours(ts)` are al ## `CREATE TABLE ... AS SELECT` -Iceberg supports CTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration.md#catalog-configuration). CTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration.md#replacing-the-session-catalog). +Iceberg supports CTAS as an atomic operation when using a [`SparkCatalog`](spark-configuration.md#catalog-configuration). CTAS is supported, but is not atomic when using [`SparkSessionCatalog`](spark-configuration.md#replacing-the-session-catalog). ```sql CREATE TABLE prod.db.sample @@ -108,7 +108,7 @@ AS SELECT ... ## `REPLACE TABLE ... AS SELECT` -Iceberg supports RTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration.md#catalog-configuration). RTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration.md#replacing-the-session-catalog). +Iceberg supports RTAS as an atomic operation when using a [`SparkCatalog`](spark-configuration.md#catalog-configuration). RTAS is supported, but is not atomic when using [`SparkSessionCatalog`](spark-configuration.md#replacing-the-session-catalog). Atomic table replacement creates a new snapshot with the results of the `SELECT` query, but keeps table history. @@ -170,7 +170,7 @@ Iceberg has full `ALTER TABLE` support in Spark 3, including: * Widening the type of `int`, `float`, and `decimal` fields * Making required columns optional -In addition, [SQL extensions](../spark-configuration.md#sql-extensions) can be used to add support for partition evolution and setting a table's write order +In addition, [SQL extensions](spark-configuration.md#sql-extensions) can be used to add support for partition evolution and setting a table's write order ### `ALTER TABLE ... RENAME TO` @@ -186,7 +186,7 @@ ALTER TABLE prod.db.sample SET TBLPROPERTIES ( ); ``` -Iceberg uses table properties to control table behavior. For a list of available properties, see [Table configuration](../configuration.md). +Iceberg uses table properties to control table behavior. For a list of available properties, see [Table configuration](configuration.md). `UNSET` is used to remove properties: @@ -327,7 +327,7 @@ ALTER TABLE prod.db.sample DROP COLUMN point.z; ## `ALTER TABLE` SQL extensions -These commands are available in Spark 3 when using Iceberg [SQL extensions](../spark-configuration.md#sql-extensions). +These commands are available in Spark 3 when using Iceberg [SQL extensions](spark-configuration.md#sql-extensions). ### `ALTER TABLE ... ADD PARTITION FIELD` diff --git a/1.5.1/docs/spark-getting-started.md b/1.5.1/docs/spark-getting-started.md index 3db83aaa437f..149a8654c4ae 100644 --- a/1.5.1/docs/spark-getting-started.md +++ b/1.5.1/docs/spark-getting-started.md @@ -37,12 +37,13 @@ spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceb ``` !!! info + If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.5_2.12` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar) to Spark's `jars` folder. ### Adding catalogs -Iceberg comes with [catalogs](../spark-configuration.md#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. +Iceberg comes with [catalogs](spark-configuration.md#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog: @@ -58,7 +59,7 @@ spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceber ### Creating a table -To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](../spark-ddl.md#create-table) command: +To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](spark-ddl.md#create-table) command: ```sql -- local is the path-based catalog defined above @@ -67,21 +68,21 @@ CREATE TABLE local.db.table (id bigint, data string) USING iceberg; Iceberg catalogs support the full range of SQL DDL commands, including: -* [`CREATE TABLE ... PARTITIONED BY`](../spark-ddl.md#create-table) -* [`CREATE TABLE ... AS SELECT`](../spark-ddl.md#create-table-as-select) -* [`ALTER TABLE`](../spark-ddl.md#alter-table) -* [`DROP TABLE`](../spark-ddl.md#drop-table) +* [`CREATE TABLE ... PARTITIONED BY`](spark-ddl.md#create-table) +* [`CREATE TABLE ... AS SELECT`](spark-ddl.md#create-table-as-select) +* [`ALTER TABLE`](spark-ddl.md#alter-table) +* [`DROP TABLE`](spark-ddl.md#drop-table) ### Writing -Once your table is created, insert data using [`INSERT INTO`](../spark-writes.md#insert-into): +Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#insert-into): ```sql INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c'); INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1; ``` -Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](../spark-writes.md#merge-into) and [`DELETE FROM`](../spark-writes.md#delete-from): +Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](spark-writes.md#merge-into) and [`DELETE FROM`](spark-writes.md#delete-from): ```sql MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id @@ -89,7 +90,7 @@ WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count WHEN NOT MATCHED THEN INSERT *; ``` -Iceberg supports writing DataFrames using the new [v2 DataFrame write API](../spark-writes.md#writing-with-dataframes): +Iceberg supports writing DataFrames using the new [v2 DataFrame write API](spark-writes.md#writing-with-dataframes): ```scala spark.table("source").select("id", "data") @@ -108,7 +109,7 @@ FROM local.db.table GROUP BY data; ``` -SQL is also the recommended way to [inspect tables](../spark-queries.md#inspecting-tables). To view all snapshots in a table, use the `snapshots` metadata table: +SQL is also the recommended way to [inspect tables](spark-queries.md#inspecting-tables). To view all snapshots in a table, use the `snapshots` metadata table: ```sql SELECT * FROM local.db.table.snapshots; ``` @@ -123,7 +124,7 @@ SELECT * FROM local.db.table.snapshots; +-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ ``` -[DataFrame reads](../spark-queries.md#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: +[DataFrame reads](spark-queries.md#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: ```scala val df = spark.table("local.db.table") @@ -194,7 +195,7 @@ This type conversion table describes how Iceberg types are converted to the Spar Next, you can learn more about Iceberg tables in Spark: -* [DDL commands](../spark-ddl.md): `CREATE`, `ALTER`, and `DROP` -* [Querying data](../spark-queries.md): `SELECT` queries and metadata tables -* [Writing data](../spark-writes.md): `INSERT INTO` and `MERGE INTO` -* [Maintaining tables](../spark-procedures.md) with stored procedures +* [DDL commands](spark-ddl.md): `CREATE`, `ALTER`, and `DROP` +* [Querying data](spark-queries.md): `SELECT` queries and metadata tables +* [Writing data](spark-writes.md): `INSERT INTO` and `MERGE INTO` +* [Maintaining tables](spark-procedures.md) with stored procedures diff --git a/1.5.1/docs/spark-procedures.md b/1.5.1/docs/spark-procedures.md index e6a480264b6a..de66a428b0de 100644 --- a/1.5.1/docs/spark-procedures.md +++ b/1.5.1/docs/spark-procedures.md @@ -22,7 +22,7 @@ search: # Spark Procedures -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](../spark-configuration.md#sql-extensions) in Spark 3. +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3. ## Usage @@ -274,7 +274,7 @@ the `expire_snapshots` procedure will never remove files which are still require | `stream_results` | | boolean | When true, deletion files will be sent to Spark driver by RDD partition (by default, all the files will be sent to Spark driver). This option is recommended to set to `true` to prevent Spark driver OOM from large file size | | `snapshot_ids` | | array of long | Array of snapshot IDs to expire. | -If `older_than` and `retain_last` are omitted, the table's [expiration properties](../configuration.md#table-behavior-properties) will be used. +If `older_than` and `retain_last` are omitted, the table's [expiration properties](configuration.md#table-behavior-properties) will be used. Snapshots that are still referenced by branches or tags won't be removed. By default, branches and tags never expire, but their retention policy can be changed with the table property `history.expire.max-ref-age-ms`. The `main` branch never expires. #### Output @@ -359,7 +359,7 @@ Iceberg can compact data files in parallel using Spark with the `rewriteDataFile | `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | | `use-starting-sequence-number` | true | Use the sequence number of the snapshot at compaction start time instead of that of the newly produced snapshot | | `rewrite-job-order` | none | Force the rewrite job order based on the value. | -| `target-file-size-bytes` | 536870912 (512 MB, default value of `write.target-file-size-bytes` from [table properties](../configuration.md#write-properties)) | Target output file size | +| `target-file-size-bytes` | 536870912 (512 MB, default value of `write.target-file-size-bytes` from [table properties](configuration.md#write-properties)) | Target output file size | | `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | | `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | | `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | @@ -482,7 +482,7 @@ Dangling deletes are always filtered out during rewriting. | `partial-progress.enabled` | false | Enable committing groups of files prior to the entire rewrite completing | | `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | | `rewrite-job-order` | none | Force the rewrite job order based on the value. | -| `target-file-size-bytes` | 67108864 (64MB, default value of `write.delete.target-file-size-bytes` from [table properties](../configuration.md#write-properties)) | Target output file size | +| `target-file-size-bytes` | 67108864 (64MB, default value of `write.delete.target-file-size-bytes` from [table properties](configuration.md#write-properties)) | Target output file size | | `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | | `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | | `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | diff --git a/1.5.1/docs/spark-queries.md b/1.5.1/docs/spark-queries.md index e66fa9d0ae04..71f31a012841 100644 --- a/1.5.1/docs/spark-queries.md +++ b/1.5.1/docs/spark-queries.md @@ -22,11 +22,11 @@ search: # Spark Queries -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. ## Querying with SQL -In Spark 3, tables use identifiers that include a [catalog name](../spark-configuration.md#using-catalogs). +In Spark 3, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs). ```sql SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table diff --git a/1.5.1/docs/spark-structured-streaming.md b/1.5.1/docs/spark-structured-streaming.md index dcef85284d08..738221fbb1e9 100644 --- a/1.5.1/docs/spark-structured-streaming.md +++ b/1.5.1/docs/spark-structured-streaming.md @@ -70,7 +70,7 @@ Iceberg supports `append` and `complete` output modes: * `append`: appends the rows of every micro-batch to the table * `complete`: replaces the table contents every micro-batch -Prior to starting the streaming query, ensure you created the table. Refer to the [SQL create table](../spark-ddl.md#create-table) documentation to learn how to create the Iceberg table. +Prior to starting the streaming query, ensure you created the table. Refer to the [SQL create table](spark-ddl.md#create-table) documentation to learn how to create the Iceberg table. Iceberg doesn't support experimental [continuous processing](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing), as it doesn't provide the interface to "commit" the output. @@ -78,7 +78,7 @@ Iceberg doesn't support experimental [continuous processing](https://spark.apach Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition. against partitioned table. For batch queries you're encouraged to do explicit sort to fulfill the requirement -(see [here](../spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as +(see [here](spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as repartition and sort are considered as heavy operations for streaming workload. To avoid additional latency, you can enable fanout writer to eliminate the requirement. @@ -109,13 +109,13 @@ documents how to configure the interval. ### Expire old snapshots -Each batch written to a table produces a new snapshot. Iceberg tracks snapshots in table metadata until they are expired. Snapshots accumulate quickly with frequent commits, so it is highly recommended that tables written by streaming queries are [regularly maintained](../maintenance.md#expire-snapshots). [Snapshot expiration](../spark-procedures.md#expire_snapshots) is the procedure of removing the metadata and any data files that are no longer needed. By default, the procedure will expire the snapshots older than five days. +Each batch written to a table produces a new snapshot. Iceberg tracks snapshots in table metadata until they are expired. Snapshots accumulate quickly with frequent commits, so it is highly recommended that tables written by streaming queries are [regularly maintained](maintenance.md#expire-snapshots). [Snapshot expiration](spark-procedures.md#expire_snapshots) is the procedure of removing the metadata and any data files that are no longer needed. By default, the procedure will expire the snapshots older than five days. ### Compacting data files -The amount of data written from a streaming process is typically small, which can cause the table metadata to track lots of small files. [Compacting small files into larger files](../maintenance.md#compact-data-files) reduces the metadata needed by the table, and increases query efficiency. Iceberg and Spark [comes with the `rewrite_data_files` procedure](../spark-procedures.md#rewrite_data_files). +The amount of data written from a streaming process is typically small, which can cause the table metadata to track lots of small files. [Compacting small files into larger files](maintenance.md#compact-data-files) reduces the metadata needed by the table, and increases query efficiency. Iceberg and Spark [comes with the `rewrite_data_files` procedure](spark-procedures.md#rewrite_data_files). ### Rewrite manifests To optimize write latency on a streaming workload, Iceberg can write the new snapshot with a "fast" append that does not automatically compact manifests. -This could lead lots of small manifest files. Iceberg can [rewrite the number of manifest files to improve query performance](../maintenance.md#rewrite-manifests). Iceberg and Spark [come with the `rewrite_manifests` procedure](../spark-procedures.md#rewrite_manifests). +This could lead lots of small manifest files. Iceberg can [rewrite the number of manifest files to improve query performance](maintenance.md#rewrite-manifests). Iceberg and Spark [come with the `rewrite_manifests` procedure](spark-procedures.md#rewrite_manifests). diff --git a/1.5.1/docs/spark-writes.md b/1.5.1/docs/spark-writes.md index cb75937ce7a7..8dce4b572ba4 100644 --- a/1.5.1/docs/spark-writes.md +++ b/1.5.1/docs/spark-writes.md @@ -22,9 +22,9 @@ search: # Spark Writes -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). -Some plans are only available when using [Iceberg SQL extensions](../spark-configuration.md#sql-extensions) in Spark 3. +Some plans are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions: @@ -202,7 +202,7 @@ Branch writes can also be performed as part of a write-audit-publish (WAP) workf Note WAP branch and branch identifier cannot both be specified. Also, the branch must exist before performing the write. The operation does **not** create the branch if it does not exist. -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```sql -- INSERT (1,' a') (2, 'b') into the audit branch. @@ -366,7 +366,7 @@ There are 3 options for `write.distribution-mode` This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done automatically by Spark, the data must be *manually* sorted by partition value. The data must be sorted either within each spark task, or globally within the entire dataset. A global sort will minimize the number of output files. -A sort can be avoided by using the Spark [write fanout](../spark-configuration.md#write-options) property but this will cause all +A sort can be avoided by using the Spark [write fanout](spark-configuration.md#write-options) property but this will cause all file handles to remain open until each write task has completed. * `hash` - This mode is the new default and requests that Spark uses a hash-based exchange to shuffle the incoming write data before writing. @@ -387,7 +387,7 @@ sort-order. Further division and coalescing of tasks may take place because of When writing data to Iceberg with Spark, it's important to note that Spark cannot write a file larger than a Spark task and a file cannot span an Iceberg partition boundary. This means although Iceberg will always roll over a file -when it grows to [`write.target-file-size-bytes`](../configuration.md#write-properties), but unless the Spark task is +when it grows to [`write.target-file-size-bytes`](configuration.md#write-properties), but unless the Spark task is large enough that will not happen. The size of the file created on disk will also be much smaller than the Spark task since the on disk data will be both compressed and in columnar format as opposed to Spark's uncompressed row representation. This means a 100 megabyte Spark task will create a file much smaller than 100 megabytes even if that diff --git a/1.5.2/docs/configuration.md b/1.5.2/docs/configuration.md index 8a548f51d81f..2c5b96f7e0bd 100644 --- a/1.5.2/docs/configuration.md +++ b/1.5.2/docs/configuration.md @@ -110,9 +110,9 @@ Iceberg tables support table properties to configure table behavior, like the de Reserved table properties are only used to control behaviors when creating or updating a table. The value of these properties are not persisted as a part of the table metadata. -| Property | Default | Description | -| -------------- | -------- | ------------------------------------------------------------- | -| format-version | 2 | Table's format version (can be 1 or 2) as defined in the [Spec](../../../spec/#format-versioning). Defaults to 2 since version 1.4.0. | +| Property | Default | Description | +| -------------- | -------- |--------------------------------------------------------------------------------------------------------------------------------------| +| format-version | 2 | Table's format version (can be 1 or 2) as defined in the [Spec](../../spec.md#format-versioning). Defaults to 2 since version 1.4.0. | ### Compatibility flags @@ -133,7 +133,7 @@ Iceberg catalogs support using catalog properties to configure catalog behaviors | clients | 2 | client pool size | | cache-enabled | true | Whether to cache catalog entries | | cache.expiration-interval-ms | 30000 | How long catalog entries are locally cached, in milliseconds; 0 disables caching, negative values disable expiration | -| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](../metrics-reporting.md) section for additional details | +| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](metrics-reporting.md) section for additional details | `HadoopCatalog` and `HiveCatalog` can access the properties in their constructors. Any other custom catalog can access the properties by implementing `Catalog.initialize(catalogName, catalogProperties)`. diff --git a/1.5.2/docs/flink-actions.md b/1.5.2/docs/flink-actions.md index fc1bdbbebd27..7abddb8c824d 100644 --- a/1.5.2/docs/flink-actions.md +++ b/1.5.2/docs/flink-actions.md @@ -22,7 +22,7 @@ search: ## Rewrite files action -Iceberg provides API to rewrite small files into large files by submitting Flink batch jobs. The behavior of this Flink action is the same as Spark's [rewriteDataFiles](../maintenance.md#compact-data-files). +Iceberg provides API to rewrite small files into large files by submitting Flink batch jobs. The behavior of this Flink action is the same as Spark's [rewriteDataFiles](maintenance.md#compact-data-files). ```java import org.apache.iceberg.flink.actions.Actions; diff --git a/1.5.2/docs/flink-connector.md b/1.5.2/docs/flink-connector.md index c14e73a15a02..99358685e8ba 100644 --- a/1.5.2/docs/flink-connector.md +++ b/1.5.2/docs/flink-connector.md @@ -31,13 +31,13 @@ To create the table in Flink SQL by using SQL syntax `CREATE TABLE test (..) WIT * `connector`: Use the constant `iceberg`. * `catalog-name`: User-specified catalog name. It's required because the connector don't have any default value. * `catalog-type`: `hive` or `hadoop` for built-in catalogs (defaults to `hive`), or left unset for custom catalog implementations using `catalog-impl`. -* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. See also [custom catalog](../flink.md#adding-catalogs) for more details. +* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. See also [custom catalog](flink.md#adding-catalogs) for more details. * `catalog-database`: The iceberg database name in the backend catalog, use the current flink database name by default. * `catalog-table`: The iceberg table name in the backend catalog. Default to use the table name in the flink `CREATE TABLE` sentence. ## Table managed in Hive catalog. -Before executing the following SQL, please make sure you've configured the Flink SQL client correctly according to the [quick start documentation](../flink.md). +Before executing the following SQL, please make sure you've configured the Flink SQL client correctly according to the [quick start documentation](flink.md). The following SQL will create a Flink table in the current Flink catalog, which maps to the iceberg table `default_database.flink_table` managed in iceberg catalog. @@ -140,4 +140,4 @@ SELECT * FROM flink_table; 3 rows in set ``` -For more details, please refer to the Iceberg [Flink documentation](../flink.md). +For more details, please refer to the Iceberg [Flink documentation](flink.md). diff --git a/1.5.2/docs/flink-ddl.md b/1.5.2/docs/flink-ddl.md index 096cc4987610..7b9e4b51a29c 100644 --- a/1.5.2/docs/flink-ddl.md +++ b/1.5.2/docs/flink-ddl.md @@ -152,7 +152,7 @@ Table create commands support the commonly used [Flink create clauses](https://n * `PARTITION BY (column1, column2, ...)` to configure partitioning, Flink does not yet support hidden partitioning. * `COMMENT 'table document'` to set a table description. -* `WITH ('key'='value', ...)` to set [table configuration](../configuration.md) which will be stored in Iceberg table properties. +* `WITH ('key'='value', ...)` to set [table configuration](configuration.md) which will be stored in Iceberg table properties. Currently, it does not support computed column and watermark definition etc. diff --git a/1.5.2/docs/flink-queries.md b/1.5.2/docs/flink-queries.md index cc3478ef5831..10a19a610e27 100644 --- a/1.5.2/docs/flink-queries.md +++ b/1.5.2/docs/flink-queries.md @@ -77,7 +77,7 @@ SET table.exec.iceberg.use-flip27-source = true; ### Reading branches and tags with SQL Branch and tags can be read via SQL by specifying options. For more details -refer to [Flink Configuration](../flink-configuration.md#read-options) +refer to [Flink Configuration](flink-configuration.md#read-options) ```sql --- Read from branch b1 diff --git a/1.5.2/docs/flink-writes.md b/1.5.2/docs/flink-writes.md index 8bedda310d50..edf59776bdf1 100644 --- a/1.5.2/docs/flink-writes.md +++ b/1.5.2/docs/flink-writes.md @@ -69,7 +69,7 @@ Iceberg supports `UPSERT` based on the primary key when writing data into v2 tab ) with ('format-version'='2', 'write.upsert.enabled'='true'); ``` -2. Enabling `UPSERT` mode using `upsert-enabled` in the [write options](#write-options) provides more flexibility than a table level config. Note that you still need to use v2 table format and specify the [primary key](../flink-ddl.md/#primary-key) or [identifier fields](../../spec.md#identifier-field-ids) when creating the table. +2. Enabling `UPSERT` mode using `upsert-enabled` in the [write options](#write-options) provides more flexibility than a table level config. Note that you still need to use v2 table format and specify the [primary key](flink-ddl.md/#primary-key) or [identifier fields](../../spec.md#identifier-field-ids) when creating the table. ```sql INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ @@ -187,7 +187,7 @@ FlinkSink.builderFor( ### Branch Writes Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```java FlinkSink.forRowData(input) .tableLoader(tableLoader) @@ -264,13 +264,13 @@ INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ ... ``` -Check out all the options here: [write-options](../flink-configuration.md#write-options) +Check out all the options here: [write-options](flink-configuration.md#write-options) ## Notes Flink streaming write jobs rely on snapshot summary to keep the last committed checkpoint ID, and -store uncommitted data as temporary files. Therefore, [expiring snapshots](../maintenance.md#expire-snapshots) -and [deleting orphan files](../maintenance.md#delete-orphan-files) could possibly corrupt +store uncommitted data as temporary files. Therefore, [expiring snapshots](maintenance.md#expire-snapshots) +and [deleting orphan files](maintenance.md#delete-orphan-files) could possibly corrupt the state of the Flink job. To avoid that, make sure to keep the last snapshot created by the Flink job (which can be identified by the `flink.job-id` property in the summary), and only delete orphan files that are old enough. diff --git a/1.5.2/docs/flink.md b/1.5.2/docs/flink.md index 82a73ebbb9ce..274a42e358a6 100644 --- a/1.5.2/docs/flink.md +++ b/1.5.2/docs/flink.md @@ -24,22 +24,22 @@ search: Apache Iceberg supports both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API. See the [Multi-Engine Support](../../multi-engine-support.md#apache-flink) page for the integration of Apache Flink. -| Feature support | Flink | Notes | -| ----------------------------------------------------------- |-------|----------------------------------------------------------------------------------------| -| [SQL create catalog](../flink-ddl.md#create-catalog) | ✔️ | | -| [SQL create database](../flink-ddl.md#create-database) | ✔️ | | -| [SQL create table](../flink-ddl.md#create-table) | ✔️ | | -| [SQL create table like](../flink-ddl.md#create-table-like) | ✔️ | | -| [SQL alter table](../flink-ddl.md#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported | -| [SQL drop_table](../flink-ddl.md#drop-table) | ✔️ | | -| [SQL select](../flink-queries.md#reading-with-sql) | ✔️ | Support both streaming and batch mode | -| [SQL insert into](../flink-writes.md#insert-into) | ✔️ ️ | Support both streaming and batch mode | -| [SQL insert overwrite](../flink-writes.md#insert-overwrite) | ✔️ ️ | | -| [DataStream read](../flink-queries.md#reading-with-datastream) | ✔️ ️ | | -| [DataStream append](../flink-writes.md#appending-data) | ✔️ ️ | | -| [DataStream overwrite](../flink-writes.md#overwrite-data) | ✔️ ️ | | -| [Metadata tables](../flink-queries.md#inspecting-tables) | ✔️ | | -| [Rewrite files action](../flink-actions.md#rewrite-files-action) | ✔️ ️ | | +| Feature support | Flink | Notes | +| -------------------------------------------------------- |-------|----------------------------------------------------------------------------------------| +| [SQL create catalog](flink-ddl.md#create-catalog) | ✔️ | | +| [SQL create database](flink-ddl.md#create-database) | ✔️ | | +| [SQL create table](flink-ddl.md#create-table) | ✔️ | | +| [SQL create table like](flink-ddl.md#create-table-like) | ✔️ | | +| [SQL alter table](flink-ddl.md#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported | +| [SQL drop_table](flink-ddl.md#drop-table) | ✔️ | | +| [SQL select](flink-queries.md#reading-with-sql) | ✔️ | Support both streaming and batch mode | +| [SQL insert into](flink-writes.md#insert-into) | ✔️ ️ | Support both streaming and batch mode | +| [SQL insert overwrite](flink-writes.md#insert-overwrite) | ✔️ ️ | | +| [DataStream read](flink-queries.md#reading-with-datastream) | ✔️ ️ | | +| [DataStream append](flink-writes.md#appending-data) | ✔️ ️ | | +| [DataStream overwrite](flink-writes.md#overwrite-data) | ✔️ ️ | | +| [Metadata tables](flink-queries.md#inspecting-tables) | ✔️ | | +| [Rewrite files action](flink-actions.md#rewrite-files-action) | ✔️ ️ | | ## Preparation when using Flink SQL Client @@ -71,6 +71,7 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` ./bin/start-cluster.sh ``` + Start the Flink SQL client. There is a separate `flink-runtime` module in the Iceberg project to generate a bundled jar, which could be loaded by Flink SQL client directly. To build the `flink-runtime` bundled jar manually, build the `iceberg` project, and it will generate the jar under `/flink-runtime/build/libs`. Or download the `flink-runtime` jar from the [Apache repository](https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.16/{{ icebergVersion }}/). ```bash @@ -273,7 +274,7 @@ env.execute("Test Iceberg DataStream"); ### Branch Writes Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```java FlinkSink.forRowData(input) .tableLoader(tableLoader) diff --git a/1.5.2/docs/spark-configuration.md b/1.5.2/docs/spark-configuration.md index ebaab54ebbf5..5b13ce8c7c93 100644 --- a/1.5.2/docs/spark-configuration.md +++ b/1.5.2/docs/spark-configuration.md @@ -80,7 +80,7 @@ Both catalogs are configured using properties nested under the catalog name. Com | spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Default Iceberg table property value for property key _propertyKey_, which will be set on tables created by this catalog if not overridden | | spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Enforced Iceberg table property value for property key _propertyKey_, which cannot be overridden by user | -Additional properties can be found in common [catalog configuration](../configuration.md#catalog-properties). +Additional properties can be found in common [catalog configuration](configuration.md#catalog-properties). ### Using catalogs @@ -187,7 +187,7 @@ df.write | fanout-enabled | false | Overrides this table's write.spark.fanout.enabled | | check-ordering | true | Checks if input schema and table schema are same | | isolation-level | null | Desired isolation level for Dataframe overwrite operations. `null` => no checks (for idempotent writes), `serializable` => check for concurrent inserts or deletes in destination partitions, `snapshot` => checks for concurrent deletes in destination partitions. | -| validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via [Table API](../api.md#table-metadata) or [Snapshots table](../spark-queries.md#snapshots). If null, the table's oldest known snapshot is used. | +| validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via [Table API](api.md#table-metadata) or [Snapshots table](spark-queries.md#snapshots). If null, the table's oldest known snapshot is used. | | compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | | compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | | compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | diff --git a/1.5.2/docs/spark-ddl.md b/1.5.2/docs/spark-ddl.md index b0627c35e612..0c344715b55a 100644 --- a/1.5.2/docs/spark-ddl.md +++ b/1.5.2/docs/spark-ddl.md @@ -35,14 +35,14 @@ CREATE TABLE prod.db.sample ( USING iceberg; ``` -Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](../spark-getting-started.md#spark-type-to-iceberg-type) for details. +Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](spark-getting-started.md#spark-type-to-iceberg-type) for details. Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including: * `PARTITIONED BY (partition-expressions)` to configure partitioning * `LOCATION '(fully-qualified-uri)'` to set the table location * `COMMENT 'table documentation'` to set a table description -* `TBLPROPERTIES ('key'='value', ...)` to set [table configuration](../configuration.md) +* `TBLPROPERTIES ('key'='value', ...)` to set [table configuration](configuration.md) Create commands may also set the default format with the `USING` clause. This is only supported for `SparkCatalog` because Spark handles the `USING` clause differently for the built-in catalog. @@ -61,7 +61,7 @@ USING iceberg PARTITIONED BY (category); ``` -The `PARTITIONED BY` clause supports transform expressions to create [hidden partitions](../partitioning.md). +The `PARTITIONED BY` clause supports transform expressions to create [hidden partitions](partitioning.md). ```sql CREATE TABLE prod.db.sample ( @@ -88,7 +88,7 @@ Note: Old syntax of `years(ts)`, `months(ts)`, `days(ts)` and `hours(ts)` are al ## `CREATE TABLE ... AS SELECT` -Iceberg supports CTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration.md#catalog-configuration). CTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration.md#replacing-the-session-catalog). +Iceberg supports CTAS as an atomic operation when using a [`SparkCatalog`](spark-configuration.md#catalog-configuration). CTAS is supported, but is not atomic when using [`SparkSessionCatalog`](spark-configuration.md#replacing-the-session-catalog). ```sql CREATE TABLE prod.db.sample @@ -108,7 +108,7 @@ AS SELECT ... ## `REPLACE TABLE ... AS SELECT` -Iceberg supports RTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration.md#catalog-configuration). RTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration.md#replacing-the-session-catalog). +Iceberg supports RTAS as an atomic operation when using a [`SparkCatalog`](spark-configuration.md#catalog-configuration). RTAS is supported, but is not atomic when using [`SparkSessionCatalog`](spark-configuration.md#replacing-the-session-catalog). Atomic table replacement creates a new snapshot with the results of the `SELECT` query, but keeps table history. @@ -170,7 +170,7 @@ Iceberg has full `ALTER TABLE` support in Spark 3, including: * Widening the type of `int`, `float`, and `decimal` fields * Making required columns optional -In addition, [SQL extensions](../spark-configuration.md#sql-extensions) can be used to add support for partition evolution and setting a table's write order +In addition, [SQL extensions](spark-configuration.md#sql-extensions) can be used to add support for partition evolution and setting a table's write order ### `ALTER TABLE ... RENAME TO` @@ -186,7 +186,7 @@ ALTER TABLE prod.db.sample SET TBLPROPERTIES ( ); ``` -Iceberg uses table properties to control table behavior. For a list of available properties, see [Table configuration](../configuration.md). +Iceberg uses table properties to control table behavior. For a list of available properties, see [Table configuration](configuration.md). `UNSET` is used to remove properties: @@ -327,7 +327,7 @@ ALTER TABLE prod.db.sample DROP COLUMN point.z; ## `ALTER TABLE` SQL extensions -These commands are available in Spark 3 when using Iceberg [SQL extensions](../spark-configuration.md#sql-extensions). +These commands are available in Spark 3 when using Iceberg [SQL extensions](spark-configuration.md#sql-extensions). ### `ALTER TABLE ... ADD PARTITION FIELD` diff --git a/1.5.2/docs/spark-getting-started.md b/1.5.2/docs/spark-getting-started.md index 3db83aaa437f..149a8654c4ae 100644 --- a/1.5.2/docs/spark-getting-started.md +++ b/1.5.2/docs/spark-getting-started.md @@ -37,12 +37,13 @@ spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceb ``` !!! info + If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.5_2.12` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar) to Spark's `jars` folder. ### Adding catalogs -Iceberg comes with [catalogs](../spark-configuration.md#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. +Iceberg comes with [catalogs](spark-configuration.md#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog: @@ -58,7 +59,7 @@ spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ iceber ### Creating a table -To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](../spark-ddl.md#create-table) command: +To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](spark-ddl.md#create-table) command: ```sql -- local is the path-based catalog defined above @@ -67,21 +68,21 @@ CREATE TABLE local.db.table (id bigint, data string) USING iceberg; Iceberg catalogs support the full range of SQL DDL commands, including: -* [`CREATE TABLE ... PARTITIONED BY`](../spark-ddl.md#create-table) -* [`CREATE TABLE ... AS SELECT`](../spark-ddl.md#create-table-as-select) -* [`ALTER TABLE`](../spark-ddl.md#alter-table) -* [`DROP TABLE`](../spark-ddl.md#drop-table) +* [`CREATE TABLE ... PARTITIONED BY`](spark-ddl.md#create-table) +* [`CREATE TABLE ... AS SELECT`](spark-ddl.md#create-table-as-select) +* [`ALTER TABLE`](spark-ddl.md#alter-table) +* [`DROP TABLE`](spark-ddl.md#drop-table) ### Writing -Once your table is created, insert data using [`INSERT INTO`](../spark-writes.md#insert-into): +Once your table is created, insert data using [`INSERT INTO`](spark-writes.md#insert-into): ```sql INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c'); INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1; ``` -Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](../spark-writes.md#merge-into) and [`DELETE FROM`](../spark-writes.md#delete-from): +Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](spark-writes.md#merge-into) and [`DELETE FROM`](spark-writes.md#delete-from): ```sql MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id @@ -89,7 +90,7 @@ WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count WHEN NOT MATCHED THEN INSERT *; ``` -Iceberg supports writing DataFrames using the new [v2 DataFrame write API](../spark-writes.md#writing-with-dataframes): +Iceberg supports writing DataFrames using the new [v2 DataFrame write API](spark-writes.md#writing-with-dataframes): ```scala spark.table("source").select("id", "data") @@ -108,7 +109,7 @@ FROM local.db.table GROUP BY data; ``` -SQL is also the recommended way to [inspect tables](../spark-queries.md#inspecting-tables). To view all snapshots in a table, use the `snapshots` metadata table: +SQL is also the recommended way to [inspect tables](spark-queries.md#inspecting-tables). To view all snapshots in a table, use the `snapshots` metadata table: ```sql SELECT * FROM local.db.table.snapshots; ``` @@ -123,7 +124,7 @@ SELECT * FROM local.db.table.snapshots; +-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ ``` -[DataFrame reads](../spark-queries.md#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: +[DataFrame reads](spark-queries.md#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: ```scala val df = spark.table("local.db.table") @@ -194,7 +195,7 @@ This type conversion table describes how Iceberg types are converted to the Spar Next, you can learn more about Iceberg tables in Spark: -* [DDL commands](../spark-ddl.md): `CREATE`, `ALTER`, and `DROP` -* [Querying data](../spark-queries.md): `SELECT` queries and metadata tables -* [Writing data](../spark-writes.md): `INSERT INTO` and `MERGE INTO` -* [Maintaining tables](../spark-procedures.md) with stored procedures +* [DDL commands](spark-ddl.md): `CREATE`, `ALTER`, and `DROP` +* [Querying data](spark-queries.md): `SELECT` queries and metadata tables +* [Writing data](spark-writes.md): `INSERT INTO` and `MERGE INTO` +* [Maintaining tables](spark-procedures.md) with stored procedures diff --git a/1.5.2/docs/spark-procedures.md b/1.5.2/docs/spark-procedures.md index e6a480264b6a..de66a428b0de 100644 --- a/1.5.2/docs/spark-procedures.md +++ b/1.5.2/docs/spark-procedures.md @@ -22,7 +22,7 @@ search: # Spark Procedures -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](../spark-configuration.md#sql-extensions) in Spark 3. +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3. ## Usage @@ -274,7 +274,7 @@ the `expire_snapshots` procedure will never remove files which are still require | `stream_results` | | boolean | When true, deletion files will be sent to Spark driver by RDD partition (by default, all the files will be sent to Spark driver). This option is recommended to set to `true` to prevent Spark driver OOM from large file size | | `snapshot_ids` | | array of long | Array of snapshot IDs to expire. | -If `older_than` and `retain_last` are omitted, the table's [expiration properties](../configuration.md#table-behavior-properties) will be used. +If `older_than` and `retain_last` are omitted, the table's [expiration properties](configuration.md#table-behavior-properties) will be used. Snapshots that are still referenced by branches or tags won't be removed. By default, branches and tags never expire, but their retention policy can be changed with the table property `history.expire.max-ref-age-ms`. The `main` branch never expires. #### Output @@ -359,7 +359,7 @@ Iceberg can compact data files in parallel using Spark with the `rewriteDataFile | `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | | `use-starting-sequence-number` | true | Use the sequence number of the snapshot at compaction start time instead of that of the newly produced snapshot | | `rewrite-job-order` | none | Force the rewrite job order based on the value.
  • If rewrite-job-order=bytes-asc, then rewrite the smallest job groups first.
  • If rewrite-job-order=bytes-desc, then rewrite the largest job groups first.
  • If rewrite-job-order=files-asc, then rewrite the job groups with the least files first.
  • If rewrite-job-order=files-desc, then rewrite the job groups with the most files first.
  • If rewrite-job-order=none, then rewrite job groups in the order they were planned (no specific ordering).
| -| `target-file-size-bytes` | 536870912 (512 MB, default value of `write.target-file-size-bytes` from [table properties](../configuration.md#write-properties)) | Target output file size | +| `target-file-size-bytes` | 536870912 (512 MB, default value of `write.target-file-size-bytes` from [table properties](configuration.md#write-properties)) | Target output file size | | `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | | `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | | `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | @@ -482,7 +482,7 @@ Dangling deletes are always filtered out during rewriting. | `partial-progress.enabled` | false | Enable committing groups of files prior to the entire rewrite completing | | `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | | `rewrite-job-order` | none | Force the rewrite job order based on the value.
  • If rewrite-job-order=bytes-asc, then rewrite the smallest job groups first.
  • If rewrite-job-order=bytes-desc, then rewrite the largest job groups first.
  • If rewrite-job-order=files-asc, then rewrite the job groups with the least files first.
  • If rewrite-job-order=files-desc, then rewrite the job groups with the most files first.
  • If rewrite-job-order=none, then rewrite job groups in the order they were planned (no specific ordering).
| -| `target-file-size-bytes` | 67108864 (64MB, default value of `write.delete.target-file-size-bytes` from [table properties](../configuration.md#write-properties)) | Target output file size | +| `target-file-size-bytes` | 67108864 (64MB, default value of `write.delete.target-file-size-bytes` from [table properties](configuration.md#write-properties)) | Target output file size | | `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | | `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | | `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | diff --git a/1.5.2/docs/spark-queries.md b/1.5.2/docs/spark-queries.md index e66fa9d0ae04..71f31a012841 100644 --- a/1.5.2/docs/spark-queries.md +++ b/1.5.2/docs/spark-queries.md @@ -22,11 +22,11 @@ search: # Spark Queries -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. ## Querying with SQL -In Spark 3, tables use identifiers that include a [catalog name](../spark-configuration.md#using-catalogs). +In Spark 3, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs). ```sql SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table diff --git a/1.5.2/docs/spark-structured-streaming.md b/1.5.2/docs/spark-structured-streaming.md index dcef85284d08..738221fbb1e9 100644 --- a/1.5.2/docs/spark-structured-streaming.md +++ b/1.5.2/docs/spark-structured-streaming.md @@ -70,7 +70,7 @@ Iceberg supports `append` and `complete` output modes: * `append`: appends the rows of every micro-batch to the table * `complete`: replaces the table contents every micro-batch -Prior to starting the streaming query, ensure you created the table. Refer to the [SQL create table](../spark-ddl.md#create-table) documentation to learn how to create the Iceberg table. +Prior to starting the streaming query, ensure you created the table. Refer to the [SQL create table](spark-ddl.md#create-table) documentation to learn how to create the Iceberg table. Iceberg doesn't support experimental [continuous processing](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing), as it doesn't provide the interface to "commit" the output. @@ -78,7 +78,7 @@ Iceberg doesn't support experimental [continuous processing](https://spark.apach Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition. against partitioned table. For batch queries you're encouraged to do explicit sort to fulfill the requirement -(see [here](../spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as +(see [here](spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as repartition and sort are considered as heavy operations for streaming workload. To avoid additional latency, you can enable fanout writer to eliminate the requirement. @@ -109,13 +109,13 @@ documents how to configure the interval. ### Expire old snapshots -Each batch written to a table produces a new snapshot. Iceberg tracks snapshots in table metadata until they are expired. Snapshots accumulate quickly with frequent commits, so it is highly recommended that tables written by streaming queries are [regularly maintained](../maintenance.md#expire-snapshots). [Snapshot expiration](../spark-procedures.md#expire_snapshots) is the procedure of removing the metadata and any data files that are no longer needed. By default, the procedure will expire the snapshots older than five days. +Each batch written to a table produces a new snapshot. Iceberg tracks snapshots in table metadata until they are expired. Snapshots accumulate quickly with frequent commits, so it is highly recommended that tables written by streaming queries are [regularly maintained](maintenance.md#expire-snapshots). [Snapshot expiration](spark-procedures.md#expire_snapshots) is the procedure of removing the metadata and any data files that are no longer needed. By default, the procedure will expire the snapshots older than five days. ### Compacting data files -The amount of data written from a streaming process is typically small, which can cause the table metadata to track lots of small files. [Compacting small files into larger files](../maintenance.md#compact-data-files) reduces the metadata needed by the table, and increases query efficiency. Iceberg and Spark [comes with the `rewrite_data_files` procedure](../spark-procedures.md#rewrite_data_files). +The amount of data written from a streaming process is typically small, which can cause the table metadata to track lots of small files. [Compacting small files into larger files](maintenance.md#compact-data-files) reduces the metadata needed by the table, and increases query efficiency. Iceberg and Spark [comes with the `rewrite_data_files` procedure](spark-procedures.md#rewrite_data_files). ### Rewrite manifests To optimize write latency on a streaming workload, Iceberg can write the new snapshot with a "fast" append that does not automatically compact manifests. -This could lead lots of small manifest files. Iceberg can [rewrite the number of manifest files to improve query performance](../maintenance.md#rewrite-manifests). Iceberg and Spark [come with the `rewrite_manifests` procedure](../spark-procedures.md#rewrite_manifests). +This could lead lots of small manifest files. Iceberg can [rewrite the number of manifest files to improve query performance](maintenance.md#rewrite-manifests). Iceberg and Spark [come with the `rewrite_manifests` procedure](spark-procedures.md#rewrite_manifests). diff --git a/1.5.2/docs/spark-writes.md b/1.5.2/docs/spark-writes.md index cb75937ce7a7..8dce4b572ba4 100644 --- a/1.5.2/docs/spark-writes.md +++ b/1.5.2/docs/spark-writes.md @@ -22,9 +22,9 @@ search: # Spark Writes -To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration.md). +To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). -Some plans are only available when using [Iceberg SQL extensions](../spark-configuration.md#sql-extensions) in Spark 3. +Some plans are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions: @@ -202,7 +202,7 @@ Branch writes can also be performed as part of a write-audit-publish (WAP) workf Note WAP branch and branch identifier cannot both be specified. Also, the branch must exist before performing the write. The operation does **not** create the branch if it does not exist. -For more information on branches please refer to [branches](../branching.md). +For more information on branches please refer to [branches](branching.md). ```sql -- INSERT (1,' a') (2, 'b') into the audit branch. @@ -366,7 +366,7 @@ There are 3 options for `write.distribution-mode` This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done automatically by Spark, the data must be *manually* sorted by partition value. The data must be sorted either within each spark task, or globally within the entire dataset. A global sort will minimize the number of output files. -A sort can be avoided by using the Spark [write fanout](../spark-configuration.md#write-options) property but this will cause all +A sort can be avoided by using the Spark [write fanout](spark-configuration.md#write-options) property but this will cause all file handles to remain open until each write task has completed. * `hash` - This mode is the new default and requests that Spark uses a hash-based exchange to shuffle the incoming write data before writing. @@ -387,7 +387,7 @@ sort-order. Further division and coalescing of tasks may take place because of When writing data to Iceberg with Spark, it's important to note that Spark cannot write a file larger than a Spark task and a file cannot span an Iceberg partition boundary. This means although Iceberg will always roll over a file -when it grows to [`write.target-file-size-bytes`](../configuration.md#write-properties), but unless the Spark task is +when it grows to [`write.target-file-size-bytes`](configuration.md#write-properties), but unless the Spark task is large enough that will not happen. The size of the file created on disk will also be much smaller than the Spark task since the on disk data will be both compressed and in columnar format as opposed to Spark's uncompressed row representation. This means a 100 megabyte Spark task will create a file much smaller than 100 megabytes even if that