Skip to content

Conversation

@hsiang-c
Copy link
Contributor

@hsiang-c hsiang-c commented Jun 30, 2025

Related to: #1533 and #14083

Context

  • We're registering existing Iceberg tables to HiveCatalog and realize that the metadata.json files used are deleted when the target namespace doesn't exist.
  • When the target namespace doesn't exist, an ValidationException is thrown from the doCommit() method of HiveTableOperations, as Error when creating a table InvalidObjectException  #1533 indicated.
  • Therefore, the finally block of the doCommit method deletes the metadata.json from existing Iceberg tables and leads to data corruption.
    finally {
      HiveOperationsBase.cleanupMetadataAndUnlock(io(), commitStatus, newMetadataLocation, lock);
    }

Proposed changes

  • In BaseMetastoreCatalog and RESTSessionCatalog, registerTable method now throws NoSuchNamespaceException if namespace doesn't exist.
  • One exception is JdbcCatalog. It can be configured with jdbc.strict-mode=false and skip namespace existence checks for createTable and createView. Therefore, I follow this behavior in registerTable.

Tests

Catalog Impl. Covered By
BigQueryMetastoreCatalog CatalogTests
InMemoryCatalog CatalogTests
RESTCatalog CatalogTests
HiveCatalog CatalogTests
NessieCatalog CatalogTests
JdbcCatalog TestJdbcCatalog
DynamoDbCatalog TestDynamoDbCatalog
EcsCatalog TestEcsCatalog
GlueCatalog TestGlueCatalog
HadoopCatalog TestHadoopCatalog
SnowflakeCatalog SnowflakeCatalogTest
  • RESTCompatibilityKitCatalogTests is using RESTCatalog (client) and JdbcCatalog (server) combination. Therefore, only in iceberg-open-api module, I turned on strict mode on JdbcCatalog so that client and server have consistent behavior.
  • Also test from the perspective of Spark's RegisterTableProcedure

@github-actions github-actions bot added the spark label Jun 30, 2025
@hsiang-c hsiang-c changed the title Table registration to nonexistent target namespace leads to metadata deletion in HiveCatalog Spark: Table registration to nonexistent target namespace leads to metadata deletion in HiveCatalog Jun 30, 2025
@hsiang-c hsiang-c changed the title Spark: Table registration to nonexistent target namespace leads to metadata deletion in HiveCatalog Spark: Registering tables to nonexistent target namespace leads to metadata deletion in HiveCatalog Jun 30, 2025
"Cannot handle an empty argument metadata_file");

Catalog icebergCatalog = ((HasIcebergCatalog) tableCatalog()).icebergCatalog();
if (tableName.hasNamespace() && icebergCatalog instanceof SupportsNamespaces) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check should be part of the procedure. Instead, such a check should be added in registerTable() in BaseMetastoreCatalog / RestSessionCatalog. Tests can then be added to CatalogTests for RestSessionCatalog. Tests for registerTable in BaseMetastoreCatalog are spread across different catalog implementations, so tests should be added there as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra Thank you for your feedback, I think it makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra I've addressed your feedback. Please take another look when you have time, thanks.

@github-actions github-actions bot added the build label Jul 7, 2025
@hsiang-c hsiang-c changed the title Spark: Registering tables to nonexistent target namespace leads to metadata deletion in HiveCatalog Core: Registering tables to nonexistent target namespace leads to metadata deletion in HiveCatalog Jul 7, 2025
return PropertyUtil.propertyAsBoolean(
restCatalog.properties(),
RESTCompatibilityKitSuite.RCK_REQUIRES_NAMESPACE_CREATE,
super.requiresNamespaceCreate());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super.requiresNamespaceCreate() is false in parent class.

&& !JdbcUtil.namespaceExists(catalogName, connections, namespace)) {
throw new NoSuchNamespaceException(
"Cannot create table %s in catalog %s. Namespace %s does not exist",
"Cannot create table %s in catalog %s. Namespace does not exist: %s",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the error message b/c CatalogTests.tableCreationWithoutNamespace() expects this format.

    assertThatThrownBy(
            () ->
                catalog().buildTable(TableIdentifier.of("non-existing", "table"), SCHEMA).create())
        .isInstanceOf(NoSuchNamespaceException.class)
        .hasMessageContaining("Namespace does not exist: non-existing");

@dramaticlly
Copy link
Contributor

dramaticlly commented Jul 16, 2025

We're registering existing Iceberg tables to HiveCatalog and realize that the metadata.json files used are deleted when the target namespace doesn't exist.

I think the other way to look at this problem might be disable the clean up for failed commit on table registration

Today the TableOperations.commit() method assume the caller is the one author the metadata.json and it's safe to clean on commit failure. However for register table, metadata.json is usually written ahead of time (by a separate process or even in used in other catalog), so clean up might not be appropriate. Some more of the problem in https://lists.apache.org/thread/b5k7vdng904zr3n3q8wv83y8l30rnd4c

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 16, 2025
@hsiang-c
Copy link
Contributor Author

Keep alive

@github-actions github-actions bot removed the stale label Aug 19, 2025
return new BaseTable(ops, fullTableName(name(), identifier), metricsReporter());
}

protected void targetNamespaceExists(TableIdentifier identifier) {
Copy link
Contributor

@nastra nastra Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding the namespace check at a central place that now affects all catalogs extending this class. I think the main issue is that not every catalog actually requires an explicit namespace creation before e.g. creating a table inside a namespace or registering a table. That's also why we introduced the requiresNamespaceCreate() flag in the tests.
That being said, I don't think we can actually perform this check here as it should be specific to the respective catalog implementation and depending on whether that catalog implementation actually requires a namespace to be created or not.
For example, the default behavior of the JDBC catalog is to not require a namespace to exist when you create a table. That means registering a table should also not require for that namespace to exist beforehand.
However, if you configure strict-mode, then namespace existence is required, meaning that the targetNamespaceExists check should only be performed when strict-mode is on

Copy link
Contributor Author

@hsiang-c hsiang-c Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra Thank you for the feedback and I agree w/ you.

That's also why we introduced the requiresNamespaceCreate() flag in the tests.

I like this idea when I am working on the test.

How about we make the requiresNamespaceCreate() a default method (return false by default) in SupportsNamespaces?

Doing so:

  1. Makes the requirement explicitly per implementation, i.e. promoting this semantics from src/test to src/main.
  2. Skips targetNamespaceExists check if the catalog implementation doesn't require it.


@Test
public void testRegisterTable() {
ecsCatalog.createNamespace(Namespace.of("a"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate why this change is needed?

Copy link
Contributor Author

@hsiang-c hsiang-c Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because

  1. EcsCatalog implements SupportsNamespaces and the targetNamespaceExists I introduced in this PR requires an existent namespace.
  2. TestEcsCatalog doesn't implement CatalogTests<EcsCatalog> and it doesn't know about the requiresNamespaceCreate method. Therefore, unlike the testRegisterTable in CatalogTests, this test doesn't create a namespace beforehand so I created it here.
// From CatalogTests
  @Test
  public void testRegisterTable() {
    C catalog = catalog();

    if (requiresNamespaceCreate()) {
      catalog.createNamespace(TABLE.namespace());
    }
    // omitted
}

@stevenzwu
Copy link
Contributor

It seems reasonable to throw NoSuchNamespaceException in this case for metastore catalog.

I am also thinking about @dramaticlly ' suggestion on not cleaning up metadata files for registerTable. It can probably help avoid corruption for other types of failures?

@hsiang-c
Copy link
Contributor Author

@stevenzwu Thank you for your review.

not cleaning up metadata files for registerTable. It can probably help avoid corruption for other types of failures?

I agree w/ you and @dramaticlly, doing so would be helpful during catalog migration.

@hantangwangd
Copy link
Contributor

@hsiang-c thanks for the link to PR #14083! It allowed me to catch up on the discussion here. I encountered a similar issue in a concurrent environment where the HiveCatalog is configured with lock disabled. When one operation is executing the commission of registerTable, another concurrent operation happens to commit a create-table for the same target table name. At this point, the registerTable operation fails due to an AlreadyExistsException. In this scenario, the source table becomes corrupted. This situation (along with some other cases maybe) does not seem to be entirely avoidable through pre-checks.

In PR #14083, I implemented a solution aligned with the approach mentioned above. That is, the metadata file is only deleted upon failure if it was newly created by the register table operation. Could you please take a look when you have a moment @nastra @dramaticlly @stevenzwu @hsiang-c, thanks a lot!

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 17, 2025
@hsiang-c
Copy link
Contributor Author

keep alive

@github-actions github-actions bot removed the stale label Oct 18, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 17, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants