-
Notifications
You must be signed in to change notification settings - Fork 48
Additional guidance for Iceberg cluster configs for Glue catalog #1336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional guidance for Iceberg cluster configs for Glue catalog #1336
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughDocumentation updates for AWS Glue and Iceberg integration:
Sequence Diagram(s)Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
✅ Deploy Preview for redpanda-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modules/reference/pages/properties/cluster-properties.adoc (1)
2225-2245: Inconsistency: SigV4 referenced elsewhere but not listed as an accepted authentication modeMultiple properties (for example,
iceberg_rest_catalog_aws_access_key,iceberg_rest_catalog_aws_region) explicitly reference “aws_sigv4 authentication mode”, buticeberg_rest_catalog_authentication_modedocuments onlynone,bearer, andoauth2. This omission can lead to misconfiguration for Glue. Please includeaws_sigv4in the description (and preferably add an explicit Accepted values list).Suggested amendment:
-=== iceberg_rest_catalog_authentication_mode - -The authentication mode for client requests made to the Iceberg catalog. Choose from: `none`, `bearer`, and `oauth2`. In `oauth2` mode, the credentials specified in `iceberg_rest_catalog_client_id` and `iceberg_rest_catalog_client_secret` are used to obtain a bearer token from the URI defined by `iceberg_rest_catalog_oauth2_server_uri.`. In `bearer` mode, the token specified in `iceberg_rest_catalog_token` is used unconditionally, and no attempts are made to refresh the token. +=== iceberg_rest_catalog_authentication_mode + +The authentication mode for client requests made to the Iceberg catalog. Choose from: `none`, `bearer`, `oauth2`, and `aws_sigv4`. +In `oauth2` mode, the credentials specified in `iceberg_rest_catalog_client_id` and `iceberg_rest_catalog_client_secret` are used to obtain a bearer token from the URI defined by `iceberg_rest_catalog_oauth2_server_uri.`. In `bearer` mode, the token specified in `iceberg_rest_catalog_token` is used unconditionally, and no attempts are made to refresh the token. In `aws_sigv4` mode (for example, when using AWS Glue), requests are signed with AWS SigV4 using the credentials source configured in `iceberg_rest_catalog_credentials_source`. + +*Accepted values:* `none`, `bearer`, `oauth2`, `aws_sigv4`modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc (1)
161-173: Broken multiline shell command due to missing continuationsIn the Cloud
rpk cluster config setexample, the line foriceberg_rest_catalog_base_location(Line 166) lacks a trailing backslash, so everything after that becomes separate commands. Also add trailing backslashes to all continued property lines except the last.Apply this fix:
rpk cluster config set \ iceberg_enabled=true \ iceberg_catalog_type=rest \ iceberg_rest_catalog_endpoint=https://glue.<glue-region>.amazonaws.com/iceberg \ iceberg_rest_catalog_authentication_mode=aws_sigv4 \ - iceberg_rest_catalog_base_location=s3://<bucket-name>/<warehouse-path> + iceberg_rest_catalog_base_location=s3://<bucket-name>/<warehouse-path> \ # Set credentials source to config_file if using # iceberg_rest_catalog_aws_access_key and iceberg_rest_catalog_aws_secret_key - iceberg_rest_catalog_credentials_source=config_file - iceberg_rest_catalog_aws_region=<glue-region> - iceberg_rest_catalog_aws_access_key=<glue-access-key> + iceberg_rest_catalog_credentials_source=config_file \ + iceberg_rest_catalog_aws_region=<glue-region> \ + iceberg_rest_catalog_aws_access_key=<glue-access-key> \ iceberg_rest_catalog_aws_secret_key=${secrets.<glue-secret-key-name>}
🧹 Nitpick comments (4)
modules/reference/pages/properties/cluster-properties.adoc (1)
2263-2269: Environment-specific guidance looks good; add a precision about when config_file is requiredThe non-cloud/cloud split is clear. For Cloud, the sentence currently reads as a blanket requirement. To avoid implying that IAM role or STS-based auth is unsupported, tighten the scope to “when providing explicit access/secret keys”.
Proposed tweak:
-ifdef::env-cloud[] -Source of AWS credentials for Iceberg REST catalog SigV4 authentication. If using `iceberg_rest_catalog_aws_access_key` and `iceberg_rest_catalog_aws_secret_key` for Glue catalog authentication, you must set this property to `config_file`. -endif::[] +ifdef::env-cloud[] +Source of AWS credentials for Iceberg REST catalog SigV4 authentication. When you provide explicit keys with +`iceberg_rest_catalog_aws_access_key` and `iceberg_rest_catalog_aws_secret_key` for Glue catalog authentication, set this property to `config_file`. If you use IAM roles or STS, choose the corresponding credentials source instead. +endif::[]modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc (3)
47-52: Manual deletion section: call out topic-level overrides explicitlyThe guidance to set the cluster property
iceberg_deletetofalseis correct for Glue. Add a one-liner noting that a topic-level override (redpanda.iceberg.delete) set totruewill take precedence, so users should unset or set it tofalseon affected topics.Proposed addition:
When `iceberg_delete` is set to `false`, you can delete the Redpanda topic, and then delete the table in AWS Glue and the table data (Parquet) files in the S3 bucket. You are required to delete the table data entirely if you delete an Iceberg topic and then create a new Iceberg topic with the same name. +If any topic explicitly sets the `redpanda.iceberg.delete` topic property to `true`, override it to `false` (or remove the override) to ensure manual deletion behavior applies.
107-113: Scope “config_file” requirement to explicit key usage; mention alternativesGood to emphasize
config_filefor Cloud when keys are provided. To prevent over-constraining Cloud users, add a qualifier and reference other supported sources (for IAM roles/STS).Proposed tweak:
-You must configure credentials for the AWS Glue Data Catalog integration using the following properties: +Configure credentials for the AWS Glue Data Catalog integration as follows. When providing explicit access and secret keys, set:and add a follow-up sentence after the list:
+If you instead use IAM roles or STS in your environment, omit the access/secret keys and set `iceberg_rest_catalog_credentials_source` to the appropriate value (for example, `sts`).
229-234: Minor mismatch: text says 10 rows, query uses LIMIT 1Either change the prose to “1 row” or update the query to
LIMIT 10to match the description. Prefer aligning to 10 for consistency with typical preview queries.-SELECT * FROM "AwsDataCatalog"."redpanda"."<table-name>" limit 1; +SELECT * FROM "AwsDataCatalog"."redpanda"."<table-name>" LIMIT 10;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc(4 hunks)modules/reference/pages/properties/cluster-properties.adoc(1 hunks)
🔇 Additional comments (1)
modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc (1)
126-129: Non-cloud example: good inclusion of iceberg_delete=false for GlueIncluding
iceberg_delete: falseupfront is helpful and aligns with the manual deletion requirement. No changes requested.
|
|
||
| The AWS Glue catalog integration requires Redpanda Iceberg tables to be manually deleted. To manually delete Iceberg tables, you must first set the cluster property config_ref:iceberg_delete,true,properties/cluster-properties[`iceberg_delete`] to `false` when you configure the catalog integration. | ||
|
|
||
| When `iceberg_delete` is set to `false`, you can delete the Redpanda topic, and then delete the table in AWS Glue and the table data (Parquet) files in the S3 bucket. You are required to delete the table data entirely if you delete an Iceberg topic and then create a new Iceberg topic with the same name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then delete the table in AWS Glue and the table data (Parquet) files in the S3 bucket
maybe not too useful to call out parquet specifically, and also applies to metadata files too, so something like:
"...and then delete the table in AWS Glue and the Iceberg metadata and data files in the S3 bucket."
You are required to delete the table data entirely if you delete an Iceberg topic and then create a new Iceberg topic with the same name.
maybe:
"If your intent is to recreate the topic, you are required to delete the table data entirely if you delete an Iceberg topic before recreating the topic."
87f43cc to
23b00ea
Compare
|
|
||
| If you want to use partitioning, you must specify a custom partition specification using your own partition columns (columns that are not nested). | ||
|
|
||
| NOTE: An empty partition spec `()` can cause a known issue in Amazon Redshift that prevents queries from running successfully on the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should doc this unless we have a way to doc that it's true only for older versions -- there's a fix for it in Redpanda 25.2.2 and 25.1.10 (redpanda-data/redpanda#27205 mentioned in the releases https://github.com/redpanda-data/redpanda/releases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattschumpert @andrwng Should we add something along the lines of
NOTE: In Redpanda versions 25.2.1, and 25.1.9 and earlier, an empty partition spec
()can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, upgrade Redpanda to versions 25.2.2 or 25.1.10 and later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable to me if that's what we do for other bugs (vs version specific docs).
It's also probably worth instructing users to not use an empty partition spec with Glue in these older versions. (or upgrade)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @kbatuigas We also need to add the guidance about the delete policy for the table. Glue users (for the foreseeable future), must configure each topic's policy or the cluster property to NOT delete the table when the topic is deleted, and then take care to delete the table data if desired and not later recreate a new topic of the same name containing different data. https://docs.redpanda.com/current/reference/properties/cluster-properties/#iceberg_delete
Do I have that right @andrwng ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat, and sorry this is likely gonna need another tweak: yesterday we just merged this change redpanda-data/redpanda#27367 that will make Redpanda not use purgeRequested=true with Glue, meaning users will be able to leave iceberg_delete=true and have Redpanda drop the table on topic deletion, though they will need to manually delete the data.
This is going to be backported into the next minors (likely 25.2.3 and 25.1.11).
Feediver1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing one build error:
12:40:15 AM: [04:40:15.327] ERROR (asciidoctor): target of xref not found: reference:properties/broker-properties.adoc#schema_registry_auth_method
but it may have come in via the base branch you used.
f6e7691 to
e767290
Compare
|
|
||
| [NOTE] | ||
| ==== | ||
| In Redpanda versions 25.2.1, and 25.1.9 and earlier, an empty partition spec () can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, specify custom partitioning, or upgrade Redpanda to versions 25.2.2 or 25.1.10 and later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In Redpanda versions 25.2.1, and 25.1.9 and earlier, an empty partition spec () can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, specify custom partitioning, or upgrade Redpanda to versions 25.2.2 or 25.1.10 and later. | |
| In Redpanda versions 25.2.1, 25.1.9, and earlier, an empty partition spec `()` can cause a known issue that prevents certain engines like Amazon Redshift from successfully querying the table. To resolve this issue, specify custom partitioning or upgrade Redpanda to versions 25.2.2, 25.1.10, or later. |
This is confusing. Is there a version in between 25.1.10 and 25.2.1? Seems possible to upgrade from 25.1.9 or an earlier version to 25.2.1 and find yourself in the same problem. To make it safe and simple, can we just say "In version 25.2.1 and earlier... specify custom partitioning or upgrade Redpanda to version 25.2.2 or later."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micheleRP There currently aren't any versions between 25.1.10 and 25.2.1, but I don't know if that will change. You're right, it would theoretically be possible to upgrade from <=25.1.9 to 25.2.1 and still run into this issue. I'm not sure if it's more typical to upgrade from a 25.1.x up to 25.2.x, or just keep the upgrade within minor/patch versions... maybe @andrwng has insight. But I think it's probably safest to go with your suggestion for now, so I'll add that in.
| endif::[] | ||
|
|
||
| ifdef::env-cloud[] | ||
| You must allow Redpanda access to AWS Glue services in your AWS account. It is recommended to create a new IAM policy or role that manages access to AWS Glue, allowing all AWS Glue API actions (`"glue:*"`) on the following resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| You must allow Redpanda access to AWS Glue services in your AWS account. It is recommended to create a new IAM policy or role that manages access to AWS Glue, allowing all AWS Glue API actions (`"glue:*"`) on the following resources: | |
| You must allow Redpanda access to AWS Glue services in your AWS account. Create a new IAM policy or role that manages access to AWS Glue, allowing all AWS Glue API actions (`"glue:*"`) on the following resources: |
| [,json] | ||
| ---- | ||
| { | ||
| "Version": "2012-10-17", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a comment that this seems very old?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just part of the IAM policy syntax by AWS
…t_catalog_credentials_source to config_file
…lse at the cluster level
Apply suggestion from doc review
Co-authored-by: Michele Cyran <[email protected]>
bbc62e3 to
73fb869
Compare
Description
This pull request updates the AWS Glue integration documentation for Redpanda Iceberg tables. The changes clarify partitioning limitations, add instructions for manual table deletion, enhance IAM policy requirements, and provide more explicit configuration guidance for credentials and deletion behavior.
Documentation improvements for AWS Glue and Iceberg:
iceberg_deletetofalsebefore deleting tables and topics.s3:PutObject,s3:PutObjectAcl,s3:DeleteObject,s3:ListBucket) for Iceberg catalog operations.Configuration guidance updates:
iceberg_rest_catalog_credentials_sourceproperty, specifying when to set it toconfig_file, and updated example configuration blocks accordingly. [1] [2] [3] [4]limit 10instead oflimit 1for better demonstration.Resolves https://redpandadata.atlassian.net/browse/
Review deadline:
Page previews
Query Iceberg Topics using AWS Glue
Checks