Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@ request adding CHANGELOG notes for breaking (!) changes and possibly other secti

### Fixes

- Fixed IcebergCatalog stale FileIO after metadata refresh, resolving AWS assumeRole permission errors.

### Commits

## [1.2.0-incubating]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1409,6 +1409,22 @@ public void doRefresh() {
Set.of(PolarisStorageActions.READ, PolarisStorageActions.LIST));
return TableMetadataParser.read(fileIO, metadataLocation);
});

// After a refresh, re-load the FileIO with the new table metadata properties to
// ensure the right permissions are present for subsequent file system interactions.
if (currentMetadata != null) {
tableFileIO =
loadFileIOForTableLike(
tableIdentifier,
StorageUtil.getLocationsUsedByTable(currentMetadata),
resolvedEntities,
new HashMap<>(currentMetadata.properties()),
Set.of(
PolarisStorageActions.READ,
PolarisStorageActions.WRITE,
PolarisStorageActions.LIST));
Comment on lines +1416 to +1425
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we request R/W access for FileIO in a "refresh" method?

Having to reset tableFileIO in this context sounds like a design issue to me. I'd think access level (Read / Write / List) should be decided per use case (load vs. commit, etc.).

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, resetting the permissions here does feel a bit odd. So we may need to do some refactor to fix this bug. Based on my understanding, the issue reporter faced was AWS assumeRole credentials became stale after table creation due to cached FileIO thus a quick "workaround" here is to reset it to full which then allowed the reporter to continue the test and validated the root cause. If we do want to keep the FileIO cached, we may need to do reader/writer FileIO separately to follow the least privileges pattern. Another alternative would be creating new FileIO per operation and right permissions for them. Any preferred route or better solutions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally need to think more about this. I'm not totally familiar with this code.

If someone else is available for a review - please comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabio-rizzo-01 do you mind take a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dennishuo : WDYT?

Copy link
Contributor

@dimas-b dimas-b Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards option 2. Do we know how often this FileIO instance is actually resued in runtime?

Sorry, I'm not very familiar with this area of the code myself, but I imagine every new REST request gets a new FileIO anyway, so reuse is limited to one request, I guess 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I am also a bit new to this specific code path but the cached FileIO appears to be the problem in this case. Let me update to use option 2 and avoid cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to implement the fix but it ends up being a major refactor and we may needed a delegate FileIO class.

Based on my understanding, the issue here is for that specific setup (I also tried to see if I can get something similar setup on my local for validation but no luck so far as setup details are not being shared nor a minimal reproducible), after spark use its RO credential with assume role privilege to assume the role associated with catalog, it won't be able to use the STS token for insert thus falled back to client id/secret associated with spark client. Which in this case, it can only happened after the metadata refresh (as it will need to be call first before insert can happen based on my understanding). Then spark will use the returned FileIO from function io() which can lost write access (thus the initial fix worked).

However, for the setup I have on AWS, the client role always doesn't have write access to the bucket directly and it uses assume role with the IAM role associated with the catalog for all operations. So I believed this may be very setup specific for this issue.

I am bit hesitated to do the major refactor for delegate FileIO (so permission can be determined based on context such as RO for io().newInputFile() and RW for io().newOutputFile()) as I don't have a setup which I can test tp validate as well as future refactors which can break this silently as well without some proper test cases/setups.

Maybe we should have someone who is more familiar with this code path to take another look in case I missed something here. cc @dimas-b @evindj @flyrain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MonkeyCanCode : Thanks for the investigation!

From my side, I'm willing to look deeper into refactoring this code, but I do not really understand the problem (yet) 😅

Do you have repro instructions handy? Could you post them here or summarize in #3440 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MonkeyCanCode : Thanks for the investigation!

From my side, I'm willing to look deeper into refactoring this code, but I do not really understand the problem (yet) 😅

Do you have repro instructions handy? Could you post them here or summarize in #3440 ?

Unfortunately, we don't have a reproducible thus the hesitation for major refactor. Detailed added in #3440. Thanks for looking this this. I will set this as not ready for review.

}

polarisEventListener.onEvent(
new PolarisEvent(
PolarisEventType.AFTER_REFRESH_TABLE,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@
import org.apache.polaris.core.identity.provider.ServiceIdentityProvider;
import org.apache.polaris.core.persistence.MetaStoreManagerFactory;
import org.apache.polaris.core.persistence.PolarisMetaStoreManager;
import org.apache.polaris.core.persistence.PolarisResolvedPathWrapper;
import org.apache.polaris.core.persistence.cache.EntityCache;
import org.apache.polaris.core.persistence.dao.entity.BaseResult;
import org.apache.polaris.core.persistence.dao.entity.EntityResult;
Expand All @@ -124,6 +125,7 @@
import org.apache.polaris.core.persistence.resolver.ResolverFactory;
import org.apache.polaris.core.secrets.UserSecretsManager;
import org.apache.polaris.core.storage.CredentialVendingContext;
import org.apache.polaris.core.storage.PolarisStorageActions;
import org.apache.polaris.core.storage.PolarisStorageIntegration;
import org.apache.polaris.core.storage.PolarisStorageIntegrationProvider;
import org.apache.polaris.core.storage.StorageAccessConfig;
Expand Down Expand Up @@ -2503,4 +2505,66 @@ public void testPaginatedListNamespaces() {
}
}
}

@Test
public void testFileIOIsRefreshedOnTableRefresh() {
// Catalog use the spied provider to verify internal behavior
StorageAccessConfigProvider spiedProvider = spy(storageAccessConfigProvider);
PolarisPassthroughResolutionView passthroughView =
new PolarisPassthroughResolutionView(
resolutionManifestFactory, authenticatedRoot, CATALOG_NAME);
TaskExecutor taskExecutor = Mockito.mock(TaskExecutor.class);
IcebergCatalog catalog =
new IcebergCatalog(
diagServices,
resolverFactory,
metaStoreManager,
polarisContext,
passthroughView,
authenticatedRoot,
taskExecutor,
spiedProvider,
fileIOFactory,
polarisEventListener,
eventMetadataFactory);
catalog.setCatalogFileIo(new InMemoryFileIO());
catalog.initialize(
CATALOG_NAME,
Map.of(CatalogProperties.FILE_IO_IMPL, "org.apache.iceberg.inmemory.InMemoryFileIO"));

// Create a table
TableIdentifier tableIdentifier = TableIdentifier.of(NS, "refresh_test_table");
if (requiresNamespaceCreate()) {
catalog.createNamespace(NS);
}
catalog.buildTable(tableIdentifier, SCHEMA).create();

// Get the table operations
IcebergCatalog.BasePolarisTableOperations operations =
(IcebergCatalog.BasePolarisTableOperations)
((BaseTable) catalog.loadTable(tableIdentifier)).operations();

// Verify initial state
assertThat(operations.io()).isNotNull();

// refresh the table
operations.refresh();

// Verify that getStorageAccessConfig was called with WRITE permissions at least twice:
// 1. during table creation
// 2. during table refresh, to update the FileIO with WRITE permissions
Mockito.verify(spiedProvider, Mockito.atLeast(2))
.getStorageAccessConfig(
Mockito.eq(tableIdentifier),
Mockito.anySet(),
Mockito.argThat(
actions ->
actions.containsAll(
Set.of(
PolarisStorageActions.READ,
PolarisStorageActions.WRITE,
PolarisStorageActions.LIST))),
Mockito.any(),
Mockito.any(PolarisResolvedPathWrapper.class));
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -177,9 +177,10 @@ public void testLoadFileIOForCleanupTask(String scheme) {
.isInstanceOf(InMemoryFileIO.class);

// 1. BasePolarisCatalog:doCommit: for writing the table during the creation
// 2. BasePolarisCatalog:doRefresh: for reading the table during the drop
// 2. BasePolarisCatalog:doRefresh: for reading the table during the drop (2 calls: 1 to read, 1
// to reload)
// 3. TaskFileIOSupplier:apply: for clean up metadata files and merge files
Mockito.verify(testServices.fileIOFactory(), Mockito.times(3))
Mockito.verify(testServices.fileIOFactory(), Mockito.times(4))
.loadFileIO(Mockito.any(), Mockito.any(), Mockito.any());
}

Expand Down