(NOT_READY) Fixed IcebergCatalog stale FileIO after metadata refresh by MonkeyCanCode · Pull Request #3494 · apache/polaris

MonkeyCanCode · 2026-01-21T02:30:48Z

This PR addressed one of the issues reported in #3440 where spark client is able to use STS temporary credential to create table but not using it for other S3 I/O such as insert for the following call when authed via AssumeRole. It doesn't appear we have any setup or test cases for actually validating this workflow thus the validation was done by the reporter.

Checklist

🛡️ Don't disclose security issues! (contact security@apache.org)
🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
🧪 Added/updated tests with good coverage, or manually tested (and explained how)
💡 Added comments for complex logic
🧾 Updated CHANGELOG.md (if needed)
📚 Updated documentation in site/content/in-dev/unreleased (if needed)

dimas-b · 2026-01-21T15:07:43Z

runtime/service/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java

+          tableFileIO =
+              loadFileIOForTableLike(
+                  tableIdentifier,
+                  StorageUtil.getLocationsUsedByTable(currentMetadata),
+                  resolvedEntities,
+                  new HashMap<>(currentMetadata.properties()),
+                  Set.of(
+                      PolarisStorageActions.READ,
+                      PolarisStorageActions.WRITE,
+                      PolarisStorageActions.LIST));


Why do we request R/W access for FileIO in a "refresh" method?

Having to reset tableFileIO in this context sounds like a design issue to me. I'd think access level (Read / Write / List) should be decided per use case (load vs. commit, etc.).

WDYT?

Yeah, resetting the permissions here does feel a bit odd. So we may need to do some refactor to fix this bug. Based on my understanding, the issue reporter faced was AWS assumeRole credentials became stale after table creation due to cached FileIO thus a quick "workaround" here is to reset it to full which then allowed the reporter to continue the test and validated the root cause. If we do want to keep the FileIO cached, we may need to do reader/writer FileIO separately to follow the least privileges pattern. Another alternative would be creating new FileIO per operation and right permissions for them. Any preferred route or better solutions?

I personally need to think more about this. I'm not totally familiar with this code.

If someone else is available for a review - please comment.

@fabio-rizzo-01 do you mind take a look?

@dennishuo : WDYT?

I'm leaning towards option 2. Do we know how often this FileIO instance is actually resued in runtime?

Sorry, I'm not very familiar with this area of the code myself, but I imagine every new REST request gets a new FileIO anyway, so reuse is limited to one request, I guess 🤔

I think so. I am also a bit new to this specific code path but the cached FileIO appears to be the problem in this case. Let me update to use option 2 and avoid cache.

So I tried to implement the fix but it ends up being a major refactor and we may needed a delegate FileIO class.

Based on my understanding, the issue here is for that specific setup (I also tried to see if I can get something similar setup on my local for validation but no luck so far as setup details are not being shared nor a minimal reproducible), after spark use its RO credential with assume role privilege to assume the role associated with catalog, it won't be able to use the STS token for insert thus falled back to client id/secret associated with spark client. Which in this case, it can only happened after the metadata refresh (as it will need to be call first before insert can happen based on my understanding). Then spark will use the returned FileIO from function io() which can lost write access (thus the initial fix worked).

However, for the setup I have on AWS, the client role always doesn't have write access to the bucket directly and it uses assume role with the IAM role associated with the catalog for all operations. So I believed this may be very setup specific for this issue.

I am bit hesitated to do the major refactor for delegate FileIO (so permission can be determined based on context such as RO for io().newInputFile() and RW for io().newOutputFile()) as I don't have a setup which I can test tp validate as well as future refactors which can break this silently as well without some proper test cases/setups.

Maybe we should have someone who is more familiar with this code path to take another look in case I missed something here. cc @dimas-b @evindj @flyrain

@MonkeyCanCode : Thanks for the investigation!

From my side, I'm willing to look deeper into refactoring this code, but I do not really understand the problem (yet) 😅

Do you have repro instructions handy? Could you post them here or summarize in #3440 ?

@MonkeyCanCode : Thanks for the investigation!

From my side, I'm willing to look deeper into refactoring this code, but I do not really understand the problem (yet) 😅

Do you have repro instructions handy? Could you post them here or summarize in #3440 ?

Unfortunately, we don't have a reproducible thus the hesitation for major refactor. Detailed added in #3440. Thanks for looking this this. I will set this as not ready for review.

Fixed IcebergCatalog stale FileIO after metadata refresh

550404c

github-project-automation bot added this to Basic Kanban Board Jan 21, 2026

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jan 21, 2026

MonkeyCanCode changed the title ~~Fixed IcebergCatalog stale FileIO after metadata refresh~~ (fix) Fixed IcebergCatalog stale FileIO after metadata refresh Jan 21, 2026

This was referenced Jan 21, 2026

DO-NOT-MERGE (fix) Refined AWS vs non-AWS detection for S3-compatible storage #3496

Closed

(WIP) Remove KMS policies when KMS is not configured #3445

Closed

MonkeyCanCode requested review from adutra and dimas-b January 21, 2026 03:36

dimas-b reviewed Jan 21, 2026

View reviewed changes

MonkeyCanCode changed the title ~~(fix) Fixed IcebergCatalog stale FileIO after metadata refresh~~ (NOT_YET_READY) Fixed IcebergCatalog stale FileIO after metadata refresh Jan 24, 2026

MonkeyCanCode changed the title ~~(NOT_YET_READY) Fixed IcebergCatalog stale FileIO after metadata refresh~~ (NOT_READY) Fixed IcebergCatalog stale FileIO after metadata refresh Jan 24, 2026

MonkeyCanCode closed this Jan 26, 2026

github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(NOT_READY) Fixed IcebergCatalog stale FileIO after metadata refresh#3494

(NOT_READY) Fixed IcebergCatalog stale FileIO after metadata refresh#3494
MonkeyCanCode wants to merge 1 commit intoapache:mainfrom
MonkeyCanCode:aws_assume_role_fix

MonkeyCanCode commented Jan 21, 2026

Uh oh!

dimas-b Jan 21, 2026

Uh oh!

MonkeyCanCode Jan 21, 2026

Uh oh!

dimas-b Jan 21, 2026

Uh oh!

MonkeyCanCode Jan 22, 2026

Uh oh!

dimas-b Jan 22, 2026

Uh oh!

dimas-b Jan 23, 2026 •

edited

Loading

Uh oh!

MonkeyCanCode Jan 23, 2026

Uh oh!

MonkeyCanCode Jan 24, 2026

Uh oh!

dimas-b Jan 26, 2026

Uh oh!

MonkeyCanCode Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

MonkeyCanCode commented Jan 21, 2026

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

dimas-b Jan 23, 2026 •

edited

Loading