Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/docs/spark-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,17 @@ Spark is currently the most feature-rich compute engine for Iceberg operations.
We recommend you to get started with Spark to understand Iceberg concepts and features with examples.
You can also view documentations of using Iceberg with other compute engine under the [Multi-Engine Support](../../multi-engine-support.md) page.

## Using Iceberg in Spark 3
## Using Iceberg in Spark

To use Iceberg in a Spark shell, use the `--packages` option:

```sh
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:{{ icebergVersion }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```

!!! info
<!-- markdown-link-check-disable-next-line -->
If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.5_2.12` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar) to Spark's `jars` folder.
If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-4.0_2.13` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-4.0_2.13/{{ icebergVersion }}/iceberg-spark-runtime-4.0_2.13-{{ icebergVersion }}.jar) to Spark's `jars` folder.


### Adding catalogs
Expand All @@ -46,7 +46,7 @@ Iceberg comes with [catalogs](spark-configuration.md#catalogs) that enable SQL c
This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog:

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:{{ icebergVersion }}\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/spark-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ title: "Procedures"

# Spark Procedures

To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3.
To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuring the extension is not necessary for Spark 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, we have used Spark Call syntax from Spark 4.0, thanks for point out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also better to document behavior change for Spark 4 in CALL syntax resolving, see #13106 and SPARK-53523

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SPARK-53523, we may ignore it until Spark 4.1.0 is released and supported in Iceberg.


## Usage

Expand Down
9 changes: 2 additions & 7 deletions docs/docs/spark-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md

## Querying with SQL

In Spark 3, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs).
In Spark, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs).

```sql
SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table
Expand All @@ -45,7 +45,7 @@ SELECT * FROM prod.db.table.files;
| 0 | s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 03} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] | null | null |

### Time travel Queries with SQL
Spark 3.3 and later supports time travel in SQL queries using `TIMESTAMP AS OF` or `VERSION AS OF` clauses.
Spark supports time travel in SQL queries using `TIMESTAMP AS OF` or `VERSION AS OF` clauses.
The `VERSION AS OF` clause can contain a long snapshot ID or a string branch or tag name.

!!! info
Expand Down Expand Up @@ -184,11 +184,6 @@ spark.read
.load("path/to/table")
```

!!! info
Spark 3.0 and earlier versions do not support using `option` with `table` in DataFrameReader commands. All options will be silently
ignored. Do not use `table` when attempting to time-travel or use other options. See [SPARK-32592](https://issues.apache.org/jira/browse/SPARK-32592).



### Incremental read

Expand Down
8 changes: 4 additions & 4 deletions docs/docs/spark-structured-streaming.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,11 +73,10 @@ data.writeStream
.outputMode("append")
.trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
.option("checkpointLocation", checkpointPath)
.toTable("database.table_name")
.option("path", "database.table_name")
.start()
```

If you're using Spark 3.0 or earlier, you need to use `.option("path", "database.table_name").start()`, instead of `.toTable("database.table_name")`.

In the case of the directory-based Hadoop catalog:

```scala
Expand Down Expand Up @@ -114,7 +113,8 @@ data.writeStream
.trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
.option("fanout-enabled", "true")
.option("checkpointLocation", checkpointPath)
.toTable("database.table_name")
.option("path", "database.table_name")
.start()
```

Fanout writer opens the files per partition value and doesn't close these files till the write task finishes. Avoid using the fanout writer for batch writing, as explicit sort against output rows is cheap for batch workloads.
Expand Down
34 changes: 17 additions & 17 deletions docs/docs/spark-writes.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,26 +22,26 @@ title: "Writes"

To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md).

Some plans are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3.
Some plans are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions).

Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions:

| Feature support | Spark 3 | Notes |
|--------------------------------------------------|-----------|-----------------------------------------------------------------------------|
| [SQL insert into](#insert-into) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL merge into](#merge-into) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [SQL insert overwrite](#insert-overwrite) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL delete from](#delete-from) | ✔️ | ⚠ Row-level delete requires Iceberg Spark extensions |
| [SQL update](#update) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [DataFrame append](#appending-data) | ✔️ | |
| [DataFrame overwrite](#overwriting-data) | ✔️ | |
| [DataFrame CTAS and RTAS](#creating-tables) | ✔️ | ⚠ Requires DSv2 API |
| [DataFrame merge into](#merging-data) | ✔️ | ⚠ Requires DSv2 API (Spark 4.0 and later) |
| Feature support | Spark | Notes |
|--------------------------------------------------|---------|-----------------------------------------------------------------------------|
| [SQL insert into](#insert-into) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL merge into](#merge-into) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [SQL insert overwrite](#insert-overwrite) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL delete from](#delete-from) | ✔️ | ⚠ Row-level delete requires Iceberg Spark extensions |
| [SQL update](#update) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [DataFrame append](#appending-data) | ✔️ | |
| [DataFrame overwrite](#overwriting-data) | ✔️ | |
| [DataFrame CTAS and RTAS](#creating-tables) | ✔️ | ⚠ Requires DSv2 API |
| [DataFrame merge into](#merging-data) | ✔️ | ⚠ Requires DSv2 API (Spark 4.0 and later) |


## Writing with SQL

Spark 3 supports SQL `INSERT INTO`, `MERGE INTO`, and `INSERT OVERWRITE`, as well as the new `DataFrameWriterV2` API.
Spark supports SQL `INSERT INTO`, `MERGE INTO`, and `INSERT OVERWRITE`, as well as the new `DataFrameWriterV2` API.

### `INSERT INTO`

Expand All @@ -56,7 +56,7 @@ INSERT INTO prod.db.table SELECT ...

### `MERGE INTO`

Spark 3 added support for `MERGE INTO` queries that can express row-level updates.
Spark supports `MERGE INTO` queries that can express row-level updates.

Iceberg supports `MERGE INTO` by rewriting data files that contain rows that need to be updated in an `overwrite` commit.

Expand Down Expand Up @@ -163,7 +163,7 @@ Note that this mode cannot replace hourly partitions like the dynamic example qu

### `DELETE FROM`

Spark 3 added support for `DELETE FROM` queries to remove data from tables.
Spark supports `DELETE FROM` queries to remove data from tables.

Delete queries accept a filter to match rows to delete.

Expand Down Expand Up @@ -255,7 +255,7 @@ data.writeTo("prod.db.table.branch_audit").overwritePartitions()

## Writing with DataFrames

Spark 3 introduced the new `DataFrameWriterV2` API for writing to tables using data frames. The v2 API is recommended for several reasons:
Spark introduced the new `DataFrameWriterV2` API for writing to tables using data frames. The v2 API is recommended for several reasons:

* CTAS, RTAS, and overwrite by filter are supported
* All operations consistently write columns to a table by name
Expand All @@ -270,7 +270,7 @@ Spark 3 introduced the new `DataFrameWriterV2` API for writing to tables using d
The v1 DataFrame `write` API is still supported, but is not recommended.

!!! danger
When writing with the v1 DataFrame API in Spark 3, use `saveAsTable` or `insertInto` to load tables with a catalog.
When writing with the v1 DataFrame API in Spark, use `saveAsTable` or `insertInto` to load tables with a catalog.
Using `format("iceberg")` loads an isolated table reference that will not automatically refresh tables used by queries.


Expand Down
12 changes: 6 additions & 6 deletions site/docs/spark-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ This configuration creates a path-based catalog named `local` for tables under `
=== "CLI"

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:{{ icebergVersion }}\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
Expand All @@ -290,7 +290,7 @@ This configuration creates a path-based catalog named `local` for tables under `
=== "spark-defaults.conf"

```sh
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:{{ icebergVersion }}
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type hive
Expand All @@ -312,27 +312,27 @@ If you already have a Spark environment, you can add Iceberg, using the `--packa
=== "SparkSQL"

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:{{ icebergVersion }}
```

=== "Spark-Shell"

```sh
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:{{ icebergVersion }}
```

=== "PySpark"

```sh
pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
pyspark --packages org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:{{ icebergVersion }}
```

!!! note
If you want to include Iceberg in your Spark installation, add the Iceberg Spark runtime to Spark's `jars` folder.
You can download the runtime by visiting to the [Releases](releases.md) page.

<!-- markdown-link-check-disable-next-line -->
[spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar
[spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-4.0_2.13/{{ icebergVersion }}/iceberg-spark-runtime-4.0_2.13-{{ icebergVersion }}.jar

#### Learn More

Expand Down