Skip to content
This repository was archived by the owner on Feb 6, 2024. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
77bf1b5
https://github.com/apache/iceberg/pull/3723
Feb 8, 2022
610659c
https://github.com/apache/iceberg/pull/3732
Feb 8, 2022
42e7079
https://github.com/apache/iceberg/pull/3749
Feb 8, 2022
1303032
https://github.com/apache/iceberg/pull/3766
Feb 8, 2022
834b4ac
https://github.com/apache/iceberg/pull/3787
Feb 8, 2022
5eb37be
https://github.com/apache/iceberg/pull/3796
Feb 8, 2022
12d1168
https://github.com/apache/iceberg/pull/3809
Feb 8, 2022
3ef3370
https://github.com/apache/iceberg/pull/3820
Feb 8, 2022
06486d4
https://github.com/apache/iceberg/pull/3878
Feb 8, 2022
5952cc5
https://github.com/apache/iceberg/pull/3890
Feb 8, 2022
64fb58f
https://github.com/apache/iceberg/pull/3892
Feb 8, 2022
09db0ed
https://github.com/apache/iceberg/pull/3944
Feb 8, 2022
8773620
https://github.com/apache/iceberg/pull/3976
Feb 8, 2022
5967ea1
https://github.com/apache/iceberg/pull/3993
Feb 8, 2022
c084cf6
https://github.com/apache/iceberg/pull/3996
Feb 8, 2022
6f3aa0b
https://github.com/apache/iceberg/pull/4008
Feb 8, 2022
c19a4d0
https://github.com/apache/iceberg/pull/3758 and 3856
Feb 8, 2022
d34b2a1
https://github.com/apache/iceberg/pull/3761
Feb 8, 2022
9803cb6
https://github.com/apache/iceberg/pull/2062
Feb 8, 2022
7719819
https://github.com/apache/iceberg/pull/3422
Feb 8, 2022
d5d00cc
remove restriction related to legacy parquet file list
Feb 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ theme= "hugo-book"
BookTheme = 'auto'
BookLogo = "img/iceberg-logo-icon.png"
versions.iceberg = "" # This is populated by the github deploy workflow and is equal to the branch name
versions.nessie = "0.17.0"
versions.nessie = "0.18.0"
latestVersions.iceberg = "0.13.0" # This is used for the version badge on the "latest" site version
BookSection='docs' # This determines which directory will inform the left navigation menu
6 changes: 4 additions & 2 deletions docs/content/docs/api/java-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,9 @@ StructType struct = Struct.of(
```java
// map<1 key: int, 2 value: optional string>
MapType map = MapType.ofOptional(
1, Types.IntegerType.get(),
2, Types.StringType.get()
1, 2,
Types.IntegerType.get(),
Types.StringType.get()
)
```
```java
Expand All @@ -203,6 +204,7 @@ Supported predicate expressions are:
* `in`
* `notIn`
* `startsWith`
* `notStartsWith`

Supported expression operations are:

Expand Down
23 changes: 23 additions & 0 deletions docs/content/docs/dremio/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: "Dremio"
bookIconImage: ../img/dremio-logo.png
bookFlatSection: true
weight: 430
bookExternalUrlNewWindow: https://docs.dremio.com/data-formats/apache-iceberg/
---
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-->
38 changes: 19 additions & 19 deletions docs/content/docs/flink/flink-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,25 +22,25 @@ url: flink

# Flink

Apache Iceberg supports both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API to write records into an Iceberg table. Currently,
we only integrate Iceberg with Apache Flink 1.11.x.

| Feature support | Flink 1.11.0 | Notes |
|------------------------------------------------------------------------|--------------------|--------------------------------------------------------|
| [SQL create catalog](#creating-catalogs-and-using-catalogs) | ✔️ | |
| [SQL create database](#create-database) | ✔️ | |
| [SQL create table](#create-table) | ✔️ | |
| [SQL create table like](#create-table-like) | ✔️ | |
| [SQL alter table](#alter-table) | ✔️ | Only support altering table properties, Columns/PartitionKey changes are not supported now|
| [SQL drop_table](#drop-table) | ✔️ | |
| [SQL select](#querying-with-sql) | ✔️ | Support both streaming and batch mode |
| [SQL insert into](#insert-into) | ✔️ ️ | Support both streaming and batch mode |
| [SQL insert overwrite](#insert-overwrite) | ✔️ ️ | |
| [DataStream read](#reading-with-datastream) | ✔️ ️ | |
| [DataStream append](#appending-data) | ✔️ ️ | |
| [DataStream overwrite](#overwrite-data) | ✔️ ️ | |
| [Metadata tables](#inspecting-tables) | | Support Java API but does not support Flink SQL |
| [Rewrite files action](#rewrite-files-action) | ✔️ ️ | |
Apache Iceberg supports both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API. Currently,
Iceberg integration for Apache Flink is available for Flink versions 1.12, 1.13, and 1.14. Previous versions of Iceberg also support Flink 1.11.

| Feature support | Flink | Notes |
| ----------------------------------------------------------- | ----- | ------------------------------------------------------------ |
| [SQL create catalog](#creating-catalogs-and-using-catalogs) | ✔️ | |
| [SQL create database](#create-database) | ✔️ | |
| [SQL create table](#create-table) | ✔️ | |
| [SQL create table like](#create-table-like) | ✔️ | |
| [SQL alter table](#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported |
| [SQL drop_table](#drop-table) | ✔️ | |
| [SQL select](#querying-with-sql) | ✔️ | Support both streaming and batch mode |
| [SQL insert into](#insert-into) | ✔️ ️ | Support both streaming and batch mode |
| [SQL insert overwrite](#insert-overwrite) | ✔️ ️ | |
| [DataStream read](#reading-with-datastream) | ✔️ ️ | |
| [DataStream append](#appending-data) | ✔️ ️ | |
| [DataStream overwrite](#overwrite-data) | ✔️ ️ | |
| [Metadata tables](#inspecting-tables) | ️ | Support Java API but does not support Flink SQL |
| [Rewrite files action](#rewrite-files-action) | ✔️ ️ | |

## Preparation when using Flink SQL Client

Expand Down
8 changes: 8 additions & 0 deletions docs/content/docs/hive/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,14 @@ catalog.createTable(tableId, schema, spec, tableProperties);

The table level configuration overrides the global Hadoop configuration.

#### Hive on Tez configuration

To use the Tez engine on Hive `3.1.2` or later, Tez needs to be upgraded to >= `0.10.1` which contains a necessary fix [Tez-4248](https://issues.apache.org/jira/browse/TEZ-4248).

To use the Tez engine on Hive `2.3.x`, you will need to manually build Tez from the `branch-0.9` branch due to a backwards incompatibility issue with Tez `0.10.1`.

You will also need to set the following property in the Hive configuration: `tez.mrreader.config.update.properties=hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids`.

## Catalog Management

### Global Hive catalog
Expand Down
9 changes: 8 additions & 1 deletion docs/content/docs/integrations/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,11 @@ If for any reason you have to use S3A, here are the instructions:
3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine.
4. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, S3A configuration varies a lot based on the version you use).

### S3 Write Checksum Verification

To ensure integrity of uploaded objects, checksum validations for S3 writes can be turned on by setting catalog property `s3.checksum-enabled` to `true`.
This is turned off by default.

## AWS Client Customization

Many organizations have customized their way of configuring AWS clients with their own credential provider, access proxy, retry strategy, etc.
Expand Down Expand Up @@ -448,8 +453,10 @@ spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:{{% icebergVersio
[Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html), [Flink](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html),
[Trino](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that can run Iceberg.

You can use a [bootstrap action](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html) similar to the following to pre-install all necessary dependencies:
Starting with EMR version 6.5.0, EMR clusters can be configured to have the necessary Apache Iceberg dependencies installed without requiring bootstrap actions.
Please refer to the [official documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-create-cluster.html) on how to create a cluster with Iceberg installed.

For versions before 6.5.0, you can use a [bootstrap action](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html) similar to the following to pre-install all necessary dependencies:
```sh
#!/bin/bash

Expand Down
2 changes: 2 additions & 0 deletions docs/content/docs/spark/spark-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Both catalogs are configured using properties nested under the catalog name. Com
| spark.sql.catalog._catalog-name_.uri | thrift://host:port | Metastore connect URI; default from `hive-site.xml` |
| spark.sql.catalog._catalog-name_.warehouse | hdfs://nn:8020/warehouse/path | Base path for the warehouse directory |
| spark.sql.catalog._catalog-name_.cache-enabled | `true` or `false` | Whether to enable catalog cache, default value is `true` |
| spark.sql.catalog._catalog-name_.cache.expiration-interval-ms | `30000` (30 seconds) | Duration after which cached catalog entries are expired; Only effective if `cache-enabled` is `true`. `-1` disables cache expiration and `0` disables caching entirely, irrespective of `cache-enabled`. Default is `30000` (30 seconds) | |

Additional properties can be found in common [catalog configuration](../configuration#catalog-properties).

Expand Down Expand Up @@ -162,6 +163,7 @@ spark.read
| file-open-cost | As per table property | Overrides this table's read.split.open-file-cost |
| vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled |
| batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size |
| stream-from-timestamp | (none) | A timestamp in milliseconds to stream from; if before the oldest known ancestor snapshot, the oldest will be used |

### Write options

Expand Down
28 changes: 28 additions & 0 deletions docs/content/docs/spark/spark-ddl.md
Original file line number Diff line number Diff line change
Expand Up @@ -370,3 +370,31 @@ ALTER TABLE prod.db.sample WRITE ORDERED BY category ASC NULLS LAST, id DESC NUL
{{< hint info >}}
Table write order does not guarantee data order for queries. It only affects how data is written to the table.
{{< /hint >}}

`WRITE ORDERED BY` sets a global ordering where rows are ordered across tasks, like using `ORDER BY` in an `INSERT` command:

```sql
INSERT INTO prod.db.sample
SELECT id, data, category, ts FROM another_table
ORDER BY ts, category
```

To order within each task, not across tasks, use `LOCALLY ORDERED BY`:

```sql
ALTER TABLE prod.db.sample WRITE LOCALLY ORDERED BY category, id
```

### `ALTER TABLE ... WRITE DISTRIBUTED BY PARTITION`

`WRITE DISTRIBUTED BY PARTITION` will request that each partition is handled by one writer, the default implementation is hash distribution.

```sql
ALTER TABLE prod.db.sample WRITE DISTRIBUTED BY PARTITION
```

`DISTRIBUTED BY PARTITION` and `LOCALLY ORDERED BY` may be used together, to distribute by partition and locally order rows within each task.

```sql
ALTER TABLE prod.db.sample WRITE DISTRIBUTED BY PARTITION LOCALLY ORDERED BY category, id
```
28 changes: 22 additions & 6 deletions docs/content/docs/spark/spark-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@ Roll back a table to a specific snapshot ID.

To roll back to a specific time, use [`rollback_to_timestamp`](#rollback_to_timestamp).

**Note** this procedure invalidates all cached Spark plans that reference the affected table.
{{< hint info >}}
This procedure invalidates all cached Spark plans that reference the affected table.
{{< /hint >}}

#### Usage

Expand Down Expand Up @@ -83,7 +85,9 @@ CALL catalog_name.system.rollback_to_snapshot('db.sample', 1)

Roll back a table to the snapshot that was current at some time.

**Note** this procedure invalidates all cached Spark plans that reference the affected table.
{{< hint info >}}
This procedure invalidates all cached Spark plans that reference the affected table.
{{< /hint >}}

#### Usage

Expand Down Expand Up @@ -112,7 +116,9 @@ Sets the current snapshot ID for a table.

Unlike rollback, the snapshot is not required to be an ancestor of the current table state.

**Note** this procedure invalidates all cached Spark plans that reference the affected table.
{{< hint info >}}
This procedure invalidates all cached Spark plans that reference the affected table.
{{< /hint >}}

#### Usage

Expand Down Expand Up @@ -143,7 +149,9 @@ Cherry-picking creates a new snapshot from an existing snapshot without altering

Only append and dynamic overwrite snapshots can be cherry-picked.

**Note** this procedure invalidates all cached Spark plans that reference the affected table.
{{< hint info >}}
This procedure invalidates all cached Spark plans that reference the affected table.
{{< /hint >}}

#### Usage

Expand Down Expand Up @@ -192,6 +200,9 @@ the `expire_snapshots` procedure will never remove files which are still require
| `table` | ✔️ | string | Name of the table to update |
| `older_than` | ️ | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
| `retain_last` | | int | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
| `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (by default, no thread pool is used) |

If `older_than` and `retain_last` are omitted, the table's [expiration properties](./configuration/#table-behavior-properties) will be used.

#### Output

Expand Down Expand Up @@ -227,6 +238,7 @@ Used to remove files which are not referenced in any metadata files of an Iceber
| `older_than` | ️ | timestamp | Remove orphan files created before this timestamp (Defaults to 3 days ago) |
| `location` | | string | Directory to look for files in (defaults to the table's location) |
| `dry_run` | | boolean | When true, don't actually remove files (defaults to false) |
| `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (by default, no thread pool is used) |

#### Output

Expand Down Expand Up @@ -308,7 +320,9 @@ Data files in manifests are sorted by fields in the partition spec. This procedu
See the [`RewriteManifestsAction` Javadoc](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/actions/RewriteManifestsAction.html)
to see more configuration options.

**Note** this procedure invalidates all cached Spark plans that reference the affected table.
{{< hint info >}}
This procedure invalidates all cached Spark plans that reference the affected table.
{{< /hint >}}

#### Usage

Expand Down Expand Up @@ -350,11 +364,13 @@ When inserts or overwrites run on the snapshot, new files are placed in the snap

When finished testing a snapshot table, clean it up by running `DROP TABLE`.

**Note** Because tables created by `snapshot` are not the sole owners of their data files, they are prohibited from
{{< hint info >}}
Because tables created by `snapshot` are not the sole owners of their data files, they are prohibited from
actions like `expire_snapshots` which would physically delete data files. Iceberg deletes, which only effect metadata,
are still allowed. In addition, any operations which affect the original data files will disrupt the Snapshot's
integrity. DELETE statements executed against the original Hive table will remove original data files and the
`snapshot` table will no longer be able to access them.
{{< /hint >}}

See [`migrate`](#migrate) to replace an existing table with an Iceberg table.

Expand Down
Loading