Skip to content

Conversation

@kbatuigas
Copy link
Contributor

@kbatuigas kbatuigas commented Aug 22, 2025

Description

This pull request updates the AWS Glue integration documentation for Redpanda Iceberg tables. The changes clarify partitioning limitations, add instructions for manual table deletion, enhance IAM policy requirements, and provide more explicit configuration guidance for credentials and deletion behavior.

Documentation improvements for AWS Glue and Iceberg:

  • Added a section explaining that AWS Glue does not support partitioning on nested fields and that an empty partition spec can cause issues with Amazon Redshift queries.
  • Added instructions for manual deletion of Iceberg tables in AWS Glue, including the requirement to set iceberg_delete to false before deleting tables and topics.
  • Updated sample IAM policy to include required S3 permissions (s3:PutObject, s3:PutObjectAcl, s3:DeleteObject, s3:ListBucket) for Iceberg catalog operations.

Configuration guidance updates:

  • Added and clarified the iceberg_rest_catalog_credentials_source property, specifying when to set it to config_file, and updated example configuration blocks accordingly. [1] [2] [3] [4]
  • Updated Athena example query to use limit 10 instead of limit 1 for better demonstration.

Resolves https://redpandadata.atlassian.net/browse/
Review deadline:

Page previews

Query Iceberg Topics using AWS Glue

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@kbatuigas kbatuigas requested a review from a team as a code owner August 22, 2025 00:35
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 22, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Documentation updates for AWS Glue and Iceberg integration:

  • Added a “Manual deletion of Iceberg tables” section, noting manual cleanup steps and the iceberg_delete property (default false).
  • Introduced and documented iceberg_rest_catalog_credentials_source with guidance to set it to config_file, especially for cloud deployments.
  • Updated cluster configuration examples and commands to include iceberg_rest_catalog_credentials_source=config_file.
  • Added environment-specific notes in cluster-properties for cloud vs non-cloud behavior and fallbacks.

Sequence Diagram(s)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Assessment against linked issues

Objective Addressed Explanation
Require setting iceberg_rest_catalog_credentials_source to config_file for Glue in cloud (DOC-1567)

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Added “Manual deletion of Iceberg tables” section and documentation of iceberg_delete property (modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc) Not related to setting iceberg_rest_catalog_credentials_source to config_file; introduces separate operational guidance and a different property.

Possibly related PRs

Suggested reviewers

  • simon0191
  • mattschumpert
  • paulohtb6
  • micheleRP
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch DOC-1567-to-configure-glue-in-cloud-customers-need-to-set-i

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@netlify
Copy link

netlify bot commented Aug 22, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 73fb869
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/68ae7f7ea38a1300085aac10
😎 Deploy Preview https://deploy-preview-1336--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modules/reference/pages/properties/cluster-properties.adoc (1)

2225-2245: Inconsistency: SigV4 referenced elsewhere but not listed as an accepted authentication mode

Multiple properties (for example, iceberg_rest_catalog_aws_access_key, iceberg_rest_catalog_aws_region) explicitly reference “aws_sigv4 authentication mode”, but iceberg_rest_catalog_authentication_mode documents only none, bearer, and oauth2. This omission can lead to misconfiguration for Glue. Please include aws_sigv4 in the description (and preferably add an explicit Accepted values list).

Suggested amendment:

-=== iceberg_rest_catalog_authentication_mode
-
-The authentication mode for client requests made to the Iceberg catalog. Choose from: `none`, `bearer`, and `oauth2`. In `oauth2` mode, the credentials specified in `iceberg_rest_catalog_client_id` and `iceberg_rest_catalog_client_secret` are used to obtain a bearer token from the URI defined by `iceberg_rest_catalog_oauth2_server_uri.`. In `bearer` mode, the token specified in `iceberg_rest_catalog_token` is used unconditionally, and no attempts are made to refresh the token. 
+=== iceberg_rest_catalog_authentication_mode
+
+The authentication mode for client requests made to the Iceberg catalog. Choose from: `none`, `bearer`, `oauth2`, and `aws_sigv4`.
+In `oauth2` mode, the credentials specified in `iceberg_rest_catalog_client_id` and `iceberg_rest_catalog_client_secret` are used to obtain a bearer token from the URI defined by `iceberg_rest_catalog_oauth2_server_uri.`. In `bearer` mode, the token specified in `iceberg_rest_catalog_token` is used unconditionally, and no attempts are made to refresh the token. In `aws_sigv4` mode (for example, when using AWS Glue), requests are signed with AWS SigV4 using the credentials source configured in `iceberg_rest_catalog_credentials_source`.
+
+*Accepted values:* `none`, `bearer`, `oauth2`, `aws_sigv4`
modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc (1)

161-173: Broken multiline shell command due to missing continuations

In the Cloud rpk cluster config set example, the line for iceberg_rest_catalog_base_location (Line 166) lacks a trailing backslash, so everything after that becomes separate commands. Also add trailing backslashes to all continued property lines except the last.

Apply this fix:

 rpk cluster config set \
   iceberg_enabled=true \
   iceberg_catalog_type=rest \
   iceberg_rest_catalog_endpoint=https://glue.<glue-region>.amazonaws.com/iceberg \
   iceberg_rest_catalog_authentication_mode=aws_sigv4 \
-  iceberg_rest_catalog_base_location=s3://<bucket-name>/<warehouse-path>
+  iceberg_rest_catalog_base_location=s3://<bucket-name>/<warehouse-path> \
   # Set credentials source to config_file if using 
   # iceberg_rest_catalog_aws_access_key and iceberg_rest_catalog_aws_secret_key
-  iceberg_rest_catalog_credentials_source=config_file
-  iceberg_rest_catalog_aws_region=<glue-region>
-  iceberg_rest_catalog_aws_access_key=<glue-access-key>
+  iceberg_rest_catalog_credentials_source=config_file \
+  iceberg_rest_catalog_aws_region=<glue-region> \
+  iceberg_rest_catalog_aws_access_key=<glue-access-key> \
   iceberg_rest_catalog_aws_secret_key=${secrets.<glue-secret-key-name>}
🧹 Nitpick comments (4)
modules/reference/pages/properties/cluster-properties.adoc (1)

2263-2269: Environment-specific guidance looks good; add a precision about when config_file is required

The non-cloud/cloud split is clear. For Cloud, the sentence currently reads as a blanket requirement. To avoid implying that IAM role or STS-based auth is unsupported, tighten the scope to “when providing explicit access/secret keys”.

Proposed tweak:

-ifdef::env-cloud[]
-Source of AWS credentials for Iceberg REST catalog SigV4 authentication. If using `iceberg_rest_catalog_aws_access_key` and `iceberg_rest_catalog_aws_secret_key` for Glue catalog authentication, you must set this property to `config_file`.
-endif::[]
+ifdef::env-cloud[]
+Source of AWS credentials for Iceberg REST catalog SigV4 authentication. When you provide explicit keys with
+`iceberg_rest_catalog_aws_access_key` and `iceberg_rest_catalog_aws_secret_key` for Glue catalog authentication, set this property to `config_file`. If you use IAM roles or STS, choose the corresponding credentials source instead.
+endif::[]
modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc (3)

47-52: Manual deletion section: call out topic-level overrides explicitly

The guidance to set the cluster property iceberg_delete to false is correct for Glue. Add a one-liner noting that a topic-level override (redpanda.iceberg.delete) set to true will take precedence, so users should unset or set it to false on affected topics.

Proposed addition:

 When `iceberg_delete` is set to `false`, you can delete the Redpanda topic, and then delete the table in AWS Glue and the table data (Parquet) files in the S3 bucket. You are required to delete the table data entirely if you delete an Iceberg topic and then create a new Iceberg topic with the same name.
+If any topic explicitly sets the `redpanda.iceberg.delete` topic property to `true`, override it to `false` (or remove the override) to ensure manual deletion behavior applies.

107-113: Scope “config_file” requirement to explicit key usage; mention alternatives

Good to emphasize config_file for Cloud when keys are provided. To prevent over-constraining Cloud users, add a qualifier and reference other supported sources (for IAM roles/STS).

Proposed tweak:

-You must configure credentials for the AWS Glue Data Catalog integration using the following properties:
+Configure credentials for the AWS Glue Data Catalog integration as follows. When providing explicit access and secret keys, set:

and add a follow-up sentence after the list:

+If you instead use IAM roles or STS in your environment, omit the access/secret keys and set `iceberg_rest_catalog_credentials_source` to the appropriate value (for example, `sts`).

229-234: Minor mismatch: text says 10 rows, query uses LIMIT 1

Either change the prose to “1 row” or update the query to LIMIT 10 to match the description. Prefer aligning to 10 for consistency with typical preview queries.

-SELECT * FROM "AwsDataCatalog"."redpanda"."<table-name>" limit 1;
+SELECT * FROM "AwsDataCatalog"."redpanda"."<table-name>" LIMIT 10;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5aac622 and 598234b.

📒 Files selected for processing (2)
  • modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc (4 hunks)
  • modules/reference/pages/properties/cluster-properties.adoc (1 hunks)
🔇 Additional comments (1)
modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc (1)

126-129: Non-cloud example: good inclusion of iceberg_delete=false for Glue

Including iceberg_delete: false upfront is helpful and aligns with the manual deletion requirement. No changes requested.


The AWS Glue catalog integration requires Redpanda Iceberg tables to be manually deleted. To manually delete Iceberg tables, you must first set the cluster property config_ref:iceberg_delete,true,properties/cluster-properties[`iceberg_delete`] to `false` when you configure the catalog integration.

When `iceberg_delete` is set to `false`, you can delete the Redpanda topic, and then delete the table in AWS Glue and the table data (Parquet) files in the S3 bucket. You are required to delete the table data entirely if you delete an Iceberg topic and then create a new Iceberg topic with the same name.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then delete the table in AWS Glue and the table data (Parquet) files in the S3 bucket

maybe not too useful to call out parquet specifically, and also applies to metadata files too, so something like:

"...and then delete the table in AWS Glue and the Iceberg metadata and data files in the S3 bucket."

You are required to delete the table data entirely if you delete an Iceberg topic and then create a new Iceberg topic with the same name.

maybe:

"If your intent is to recreate the topic, you are required to delete the table data entirely if you delete an Iceberg topic before recreating the topic."

@kbatuigas kbatuigas force-pushed the DOC-1567-to-configure-glue-in-cloud-customers-need-to-set-i branch from 87f43cc to 23b00ea Compare August 25, 2025 04:18
@kbatuigas kbatuigas requested a review from andrwng August 25, 2025 15:32

If you want to use partitioning, you must specify a custom partition specification using your own partition columns (columns that are not nested).

NOTE: An empty partition spec `()` can cause a known issue in Amazon Redshift that prevents queries from running successfully on the table.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should doc this unless we have a way to doc that it's true only for older versions -- there's a fix for it in Redpanda 25.2.2 and 25.1.10 (redpanda-data/redpanda#27205 mentioned in the releases https://github.com/redpanda-data/redpanda/releases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattschumpert @andrwng Should we add something along the lines of

NOTE: In Redpanda versions 25.2.1, and 25.1.9 and earlier, an empty partition spec () can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, upgrade Redpanda to versions 25.2.2 or 25.1.10 and later.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me if that's what we do for other bugs (vs version specific docs).

It's also probably worth instructing users to not use an empty partition spec with Glue in these older versions. (or upgrade)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @kbatuigas We also need to add the guidance about the delete policy for the table. Glue users (for the foreseeable future), must configure each topic's policy or the cluster property to NOT delete the table when the topic is deleted, and then take care to delete the table data if desired and not later recreate a new topic of the same name containing different data. https://docs.redpanda.com/current/reference/properties/cluster-properties/#iceberg_delete

Do I have that right @andrwng ?

Copy link

@andrwng andrwng Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat, and sorry this is likely gonna need another tweak: yesterday we just merged this change redpanda-data/redpanda#27367 that will make Redpanda not use purgeRequested=true with Glue, meaning users will be able to leave iceberg_delete=true and have Redpanda drop the table on topic deletion, though they will need to manually delete the data.

This is going to be backported into the next minors (likely 25.2.3 and 25.1.11).

Copy link
Contributor

@Feediver1 Feediver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing one build error:
12:40:15 AM: [04:40:15.327] ERROR (asciidoctor): target of xref not found: reference:properties/broker-properties.adoc#schema_registry_auth_method
but it may have come in via the base branch you used.

@kbatuigas kbatuigas force-pushed the DOC-1567-to-configure-glue-in-cloud-customers-need-to-set-i branch from f6e7691 to e767290 Compare August 25, 2025 23:54

[NOTE]
====
In Redpanda versions 25.2.1, and 25.1.9 and earlier, an empty partition spec () can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, specify custom partitioning, or upgrade Redpanda to versions 25.2.2 or 25.1.10 and later.
Copy link
Contributor

@micheleRP micheleRP Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In Redpanda versions 25.2.1, and 25.1.9 and earlier, an empty partition spec () can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, specify custom partitioning, or upgrade Redpanda to versions 25.2.2 or 25.1.10 and later.
In Redpanda versions 25.2.1, 25.1.9, and earlier, an empty partition spec `()` can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, specify custom partitioning or upgrade Redpanda to versions 25.2.2, 25.1.10, or later.

This is confusing. Is there a version in between 25.1.10 and 25.2.1? Seems possible to upgrade from 25.1.9 or an earlier version to 25.2.1 and find yourself in the same problem. To make it safe and simple, can we just say "In version 25.2.1 and earlier... specify custom partitioning or upgrade Redpanda to version 25.2.2 or later."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micheleRP There currently aren't any versions between 25.1.10 and 25.2.1, but I don't know if that will change. You're right, it would theoretically be possible to upgrade from <=25.1.9 to 25.2.1 and still run into this issue. I'm not sure if it's more typical to upgrade from a 25.1.x up to 25.2.x, or just keep the upgrade within minor/patch versions... maybe @andrwng has insight. But I think it's probably safest to go with your suggestion for now, so I'll add that in.

endif::[]

ifdef::env-cloud[]
You must allow Redpanda access to AWS Glue services in your AWS account. It is recommended to create a new IAM policy or role that manages access to AWS Glue, allowing all AWS Glue API actions (`"glue:*"`) on the following resources:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You must allow Redpanda access to AWS Glue services in your AWS account. It is recommended to create a new IAM policy or role that manages access to AWS Glue, allowing all AWS Glue API actions (`"glue:*"`) on the following resources:
You must allow Redpanda access to AWS Glue services in your AWS account. Create a new IAM policy or role that manages access to AWS Glue, allowing all AWS Glue API actions (`"glue:*"`) on the following resources:

[,json]
----
{
"Version": "2012-10-17",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a comment that this seems very old?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just part of the IAM policy syntax by AWS

@kbatuigas kbatuigas force-pushed the DOC-1567-to-configure-glue-in-cloud-customers-need-to-set-i branch from bbc62e3 to 73fb869 Compare August 27, 2025 03:46
@kbatuigas kbatuigas merged commit f12c6c8 into main Aug 27, 2025
7 checks passed
@kbatuigas kbatuigas deleted the DOC-1567-to-configure-glue-in-cloud-customers-need-to-set-i branch August 27, 2025 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants