Skip to content

Commit

Permalink
How to query Iceberg topics using Snowflake and Open Catalog (#957)
Browse files Browse the repository at this point in the history
Co-authored-by: Joyce Fee <[email protected]>
  • Loading branch information
kbatuigas and Feediver1 authored Feb 7, 2025
1 parent f585b97 commit 1ca009e
Show file tree
Hide file tree
Showing 4 changed files with 229 additions and 3 deletions.
4 changes: 3 additions & 1 deletion modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -170,12 +170,14 @@
*** xref:manage:security/iam-roles.adoc[]
** xref:manage:tiered-storage-linux/index.adoc[Tiered Storage]
*** xref:manage:tiered-storage.adoc[]
*** xref:manage:topic-iceberg-integration.adoc[Iceberg topics]
*** xref:manage:fast-commission-decommission.adoc[]
*** xref:manage:mountable-topics.adoc[]
*** xref:manage:remote-read-replicas.adoc[Remote Read Replicas]
*** xref:manage:topic-recovery.adoc[Topic Recovery]
*** xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore]
** xref:manage:iceberg/index.adoc[Iceberg]
*** xref:manage:iceberg/topic-iceberg-integration.adoc[Iceberg topics]
*** xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[Query Iceberg topics with Snowflake]
** xref:manage:schema-reg/index.adoc[Schema Registry]
*** xref:manage:schema-reg/schema-reg-overview.adoc[Overview]
*** xref:manage:schema-reg/manage-schema-reg.adoc[]
Expand Down
3 changes: 3 additions & 0 deletions modules/manage/pages/iceberg/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
= Integrate Redpanda with Iceberg
:description: Generate Iceberg tables for your Redpanda topics for data lakehouse access.
:page-layout: index
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
= Query Iceberg Topics using Snowflake and Open Catalog
:description: Add Redpanda topics as Iceberg tables that you can query in Snowflake using an Open Catalog integration.
:page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration
:page-beta: true

[NOTE]
====
include::shared:partial$enterprise-license.adoc[]
====

This guide walks you through querying Redpanda topics as Iceberg tables in https://docs.snowflake.com/en/user-guide/tables-iceberg[Snowflake^], with AWS S3 as object storage and a catalog integration using https://other-docs.snowflake.com/en/opencatalog/overview[Open Catalog^].

== Prerequisites

* xref:manage:tiered-storage.adoc#configure-object-storage[Object storage configured] for your cluster and xref:manage:tiered-storage.adoc#enable-tiered-storage[Tiered Storage enabled] for the topics for which you want to generate Iceberg tables.
** The S3 bucket URI so that you can configure it as external storage for Open Catalog.
* A Snowflake account.
* An Open Catalog account. To https://other-docs.snowflake.com/en/opencatalog/create-open-catalog-account[create an Open Catalog account], you require ORGADMIN access in Snowflake.
* An internal catalog created in Open Catalog with your Tiered Storage AWS S3 bucket configured as external storage.
** Follow this guide to https://other-docs.snowflake.com/en/opencatalog/create-catalog#create-a-catalog-using-amazon-simple-storage-service-amazon-s3[create a catalog] with the S3 bucket configured as external storage. You require admin permissions to carry out these steps in AWS:
. If you don't already have one, create an IAM policy that gives Open Catalog read and write access to your S3 bucket.
. Create an IAM role and attach the IAM policy to the role.
. After creating a new catalog in Open Catalog, grant the catalog's AWS IAM user access to the S3 bucket.
* A Snowflake https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume[external volume] set up using the Tiered Storage bucket.
** Follow this guide to https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-s3[configure the external volume with S3]. You can use the same IAM policy as the catalog for the external volume's IAM role and user.

== Set up catalog integration using Open Catalog

=== Create a new Open Catalog service connection for Redpanda

To create a new service connection to integrate the Iceberg-enabled topics into Open Catalog:

. In Open Catalog, select *Connections*, then *+ Connection*.
. In *Configure Service Connection*, provide a name. Open Catalog creates a new principal with this name.
. Make sure *Create new principal role* is toggled on.
. Enter a name for the principal role. Then, click *Create*.

After you create the connection, you are provided the client ID and client secret. Save these credentials to add to your cluster configuration in a later step.

=== Create a catalog role

Grant privileges to the principal created in the previous step:

. In Open Catalog, select *Catalogs*, and select your catalog.
. On the *Roles* tab of your catalog, click *+ Catalog Role*.
. Give the catalog role a name.
. Under *Privileges*, select `CATALOG_MANAGE_CONTENT`. This provides full management https://other-docs.snowflake.com/en/opencatalog/access-control#catalog-privileges[privileges] for the catalog. Then, click *Create*.
. On the *Roles* tab of the catalog, click *Grant to Principal Role*.
. Select the catalog role you just created.
. Select the principal role you created earlier. Click *Grant*.

=== Update cluster configuration

To configure your Redpanda cluster to enable Iceberg on a topic, as well as set up the integration with Open Catalog:

. Edit your cluster configuration to set the `iceberg_enabled` property to `true`, and set the catalog integration properties listed in the example below. You must restart your cluster if you change this configuration for a running cluster. You can run `rpk cluster config edit` to update these properties:
+
[,bash]
----
iceberg_enabled: true
iceberg_catalog_type: rest
iceberg_rest_catalog_endpoint: https://<snowflake-orgname>-<open-catalog-account-name>.snowflakecomputing.com/polaris/api/catalog
iceberg_rest_catalog_client_id: <open-catalog-connection-client-id>
iceberg_rest_catalog_client_secret: <open-catalog-connection-client-secret>
iceberg_rest_catalog_prefix: <open-catalog-name>
# Optional
iceberg_translation_interval_ms_default: 1000
iceberg_catalog_commit_interval_ms: 1000
----
+
Use your own values for the following placeholders:
+
--
- `<snowflake-orgname>` and `<open-catalog-account-name>`: Your https://docs.snowflake.com/en/sql-reference/sql/create-catalog-integration-open-catalog#required-parameters[Open Catalog account URI] is composed of these values.
+
TIP: In Snowflake, navigate to **Admin**, then **Accounts**. Click the ellipsis near your Open Catalog account name, and select **Manage URLs**. The **Current URL** contains `<snowflake-orgname>` and `<open-catalog-account-name>`.
- `<open-catalog-connection-client-id>`: The client ID of the service connection you created in an earlier step.
- `<open-catalog-connection-client-secret>`: The client secret of the service connection you created in an earlier step.
- `<open-catalog-name>`: The name of your catalog in Open Catalog.
--
+
[,bash,role=no-copy]
----
Successfully updated configuration. New configuration version is 2.
----

. You must restart your cluster so that the configuration changes take effect.

. Enable the integration for a topic by configuring the topic property `redpanda.iceberg.mode`. This mode creates an Iceberg table for the topic consisting of two columns, one for the record metadata including the key, and another binary column for the record's value. See xref:manage:iceberg/topic-iceberg-integration.adoc#enable-iceberg-integration[Enable Iceberg integration] for more details on Iceberg modes. The following examples show how to use xref:get-started:rpk-install.adoc[`rpk`] to either create a new topic, or alter the configuration for an existing topic, to set the Iceberg mode to `key_value`.
+
.Create a new topic and set `redpanda.iceberg.mode`:
[,bash]
----
rpk topic create <topic-name> --topic-config=redpanda.iceberg.mode=key_value
----
+
.Set `redpanda.iceberg.mode` for an existing topic:
[,bash]
----
rpk topic alter-config <topic-name> --set redpanda.iceberg.mode=key_value
----

. Produce to the topic. For example,
+
[,bash]
----
echo "hello world\nfoo bar\nbaz qux" | rpk topic produce <topic-name> --format='%k %v\n'
----

You should see the topic as a table in Open Catalog.

. In Open Catalog, select *Catalogs*, then open your catalog.
. Under your catalog, you should see the `redpanda` namespace, and a table with the name of your topic. The `redpanda` namespace and the table are automatically added for you.

== Query Iceberg table in Snowflake

To query the topic in Snowflake, you must create a https://docs.snowflake.com/en/user-guide/tables-iceberg#catalog-integration[catalog integration^] so that Snowflake has access to the table data and metadata.

=== Configure catalog integration with Snowflake

. Run the https://docs.snowflake.com/sql-reference/sql/create-catalog-integration-open-catalog[`CREATE CATALOG INTEGRATION`] command in Snowflake:
+
[,sql]
----
CREATE CATALOG INTEGRATION <catalog-integration-name>
CATALOG_SOURCE = POLARIS
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'redpanda'
REST_CONFIG = (
CATALOG_URI = '<open-catalog-uri>'
WAREHOUSE = '<open-catalog-name>'
)
REST_AUTHENTICATION = (
TYPE = OAUTH
OAUTH_CLIENT_ID = '<open-catalog-connection-client-id>'
OAUTH_CLIENT_SECRET = '<open-catalog-connection-client-secret>'
OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
)
REFRESH_INTERVAL_SECONDS = 30
ENABLED = TRUE;
----
+
Use your own values for the following placeholders:
+
- `<catalog-integration-name>`: Provide a name for your Iceberg catalog integration in Snowflake.
- `<open-catalog-uri>`: Your https://docs.snowflake.com/en/sql-reference/sql/create-catalog-integration-open-catalog#required-parameters[Open Catalog account URI] (`https://<snowflake-orgname>-<account-name>.snowflakecomputing.com/polaris/api/catalog`).
- `<open-catalog-name>`: The name of your catalog in Open Catalog.
- `<open-catalog-connection-client-id>`: The client ID of the service connection you created in an earlier step.
- `<open-catalog-connection-client-secret>`: The client secret of the service connection you created in an earlier step.

. Run the following command to verify that the catalog is integrated correctly:
+
[,sql]
----
SELECT SYSTEM$LIST_ICEBERG_TABLES_FROM_CATALOG('<catalog-integration-name>');
----
+
[,bash,role="no-copy no-placeholders"]
----
# Example result for redpanda.iceberg.mode=key_value
+-----------------------------------------------------------------------+
| SYSTEM$LIST_ICEBERG_TABLES_FROM_CATALOG('<catalog_integration_name>') |
+-----------------------------------------------------------------------+
| [{"namespace":"redpanda","name":"<table_name>"}] |
+-----------------------------------------------------------------------+
----

=== Create Iceberg table in Snowflake

After creating the catalog integration, you must create an externally-managed table in Snowflake. You must run your Snowflake queries against this table.

In your Snowflake database, run the https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table-rest[CREATE ICEBERG TABLE] command. The following example also specifies that the table should automatically refresh metadata:

[,sql]
----
CREATE ICEBERG TABLE <table-name>
CATALOG = '<catalog-integration-name>'
EXTERNAL_VOLUME = '<iceberg-external-volume-name>'
CATALOG_TABLE_NAME = '<topic-name>'
AUTO_REFRESH = TRUE
----

Use your own values for the following placeholders:

- `<table-name>`: Provide a name for your table in Snowflake.
- `<catalog-integration-name>`: The name of the catalog integration you configured in an earlier step.
- `<iceberg-external-volume-name>`: The name of the external volume you configured using the Tiered Storage bucket.
- `<topic-name>`: The name of the table in your catalog, which is the same as your Redpanda topic name.

=== Query table

To verify that Snowflake has successfully created the table containing the topic data, run the following:

[,sql]
----
SELECT * FROM <table-name>;
----

Your query results should look like the following:

[,bash,role=no-copy]
----
# Example for redpanda.iceberg.mode=key_value with 3 records produced to topic
+--------------------------------------------------------------------------------------------------------------+------------+
| REDPANDA | VALUE |
+--------------------------------------------------------------------------------------------------------------+------------+
| { "partition": 0, "offset": 0, "timestamp": "2025-02-07 16:29:50.122", "headers": null, "key": "68656C6C6F"} | 776F726C64 |
| { "partition": 0, "offset": 1, "timestamp": "2025-02-07 16:29:50.122", "headers": null, "key": "666F6F"} | 626172 |
| { "partition": 0, "offset": 2, "timestamp": "2025-02-07 16:29:50.122", "headers": null, "key": "62617A" } | 717578 |
+--------------------------------------------------------------------------------------------------------------+------------+
----
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
= Iceberg Topics
:description: Learn how to integrate Redpanda topics with Apache Iceberg.
:page-context-links: [{"name": "Linux", "to": "manage:topic-iceberg-integration.adoc" } ]
:page-context-links: [{"name": "Linux", "to": "manage:iceberg/topic-iceberg-integration.adoc" } ]
:page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration
:page-aliases: manage:topic-iceberg-integration.adoc
:page-beta: true

[NOTE]
Expand Down Expand Up @@ -272,7 +273,7 @@ Set the cluster configuration property `iceberg_catalog_type` with one of the fo

Once you have enabled the Iceberg integration for a topic and selected a catalog type, you cannot switch to another catalog type.

For production use cases, Redpanda recommends the `rest` option with REST-enabled Iceberg catalog services such as https://docs.tabular.io/[Tabular^], https://docs.databricks.com/en/data-governance/unity-catalog/index.html[Databricks Unity^] and https://github.com/apache/polaris[Apache Polaris^].
For production use cases, Redpanda recommends the `rest` option with REST-enabled Iceberg catalog services such as https://docs.tabular.io/[Tabular^], https://docs.databricks.com/en/data-governance/unity-catalog/index.html[Databricks Unity^] and https://other-docs.snowflake.com/en/opencatalog/overview[Snowflake Open Catalog^].

For an Iceberg REST catalog, set the following additional cluster configuration properties:

Expand Down Expand Up @@ -323,6 +324,8 @@ SELECT * FROM streaming.redpanda.ClickEvent;

Spark can use the REST catalog to automatically discover the topic's Iceberg table.

See also: xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[]

==== File system-based catalog (`object_storage`)

If you are using the `object_storage` catalog type, you must set up the catalog integration in your processing engine accordingly. For example, you can configure Spark to use a file system-based catalog with at least the following properties, is using AWS S3 for object storage:
Expand Down Expand Up @@ -432,6 +435,10 @@ FROM <catalog-name>.ClickEvent_key_value;
+------------------------------------------------------------------------------+
----

== Next steps

* xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[]

== Suggested reading

* xref:manage:schema-reg/schema-id-validation.adoc[]
Expand Down

0 comments on commit 1ca009e

Please sign in to comment.