Skip to content

Conversation

@omarsmak
Copy link
Member

Currently if the user has set both configs in Hive iceberg.catalog.<catalog_name\>.type=hive|hadoop and iceberg.catalog.<catalog_name\>.catalog-impl=CustomCatalog, Iceberg will regardless use the type to perform some operations in HiveIcebergMetaHook that may introduce inconsistencies that are incompatible with the set custom catalog. For example, if type is set to hive and we have catalog-impl set to GlueCatalog,HiveIcebergMetaHook will perform operations based on hive catalog which are incompatible with the custom catalog GlueCatalog.
In order to mediate the issue, in this PR I am returning null in iceberg.catalog.<catalog_name\>.type in case we have iceberg.catalog.<catalog_name\>.catalog-impl is set in order to avoid such cases as described above.

Note: The behavior in regards to legacy configs is remained unchanged.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes make sense to me, but I'm not a Hive expert, so maybe @rdblue or @pvary could review this one here as well?

@pvary
Copy link
Contributor

pvary commented Sep 21, 2021

For me if someone sets both iceberg.catalog.<catalog_name>.type and iceberg.catalog.<catalog_name>.catalog-impl is a wrong configuration. I think we should throw an exception instead of trying to define a precedence between the two.

What do you think?

@omarsmak
Copy link
Member Author

For me if someone sets both iceberg.catalog.<catalog_name>.type and iceberg.catalog.<catalog_name>.catalog-impl is a wrong configuration. I think we should throw an exception instead of trying to define a precedence between the two.

What do you think?

I thought about it for a bit and I think it is actually a good idea as well to throw an exception, it has an advantage of being explicit to user and thus, the user is educated on how these config work. I will update the logic shortly. Thanks for the feedback!

@omarsmak
Copy link
Member Author

@pvary I have changed the logic a bit to throw an exception instead. Please take a look at it and let me know what you think.

| --------------------------------------------- | ------------------------------------------------------ |
| iceberg.catalog.<catalog_name\>.type | type of catalog: `hive` or `hadoop` |
| iceberg.catalog.<catalog_name\>.catalog-impl | catalog implementation, must not be null if type is null |
| iceberg.catalog.<catalog_name\>.type | type of catalog: `hive`, `hadoop` or empty if `catalog-impl` will be set. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're throwing an exception, I think this change can be rolled back

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary since this is the original intention to leave the type empty in case the catalog-impl will be set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking notes:

  1. I agree that this can mostly be rolled back. I would say that type could be "hive or hadoop, or left unset if using a custom catalog".

@kbendick
Copy link
Contributor

This brings up a larger issue (that is outside of the scope of this PR) that @kainoa21 ecently brought up recently on Slack and elsewhere, that we also experience when using GlueCatalog with Flink - that the custom catalog-impl's don't necessarily neatly fall into the Hadoop or Hive mindset.

In the case he brought up, he wanted to be able to use GlueCatalog in an AWS managed environment without having to have hadoop dependencies on the classpath. The only reason he needed them there was to satisfy some unused code constraints.

He felt that possibly it was a bit of a leaky abstraction in some places. His feeling was that maybe there should be a GlueCatalogFileIO or possibly even multiple possible GlueCatalogFileIO types - depending if one wanted to use Hive or not etc.

As this seems to be an offshoot of the same problem, tagging him and linking the issue for future reference: #3044

This is outside of the scope of this PR and I don't mean to block this PR, but this might be something we want to consider during V3 planning / as part of a larger scope of things as we do seem to be encountering some issues with what I'll call the well known "custom" catalogs? cc @rdblue (in case Jason doesn't come on Github as much) and @jackye1995 who was also working on this problem at the time.

We can continue this conversation elsewhere, sorry to thread-jack / PR-jack the discussion!

} else {
String name = catalogName == null ? ICEBERG_DEFAULT_CATALOG_NAME : catalogName;

String catalogImpl = conf.get(InputFormatConfig.catalogPropertyConfigKey(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of doing the check here it's simpler to do the check in addCatalogPropertiesIfMissing before we set CatalogUtil.ICEBERG_CATALOG_TYPE to something. By doing that, you do not need to get the actual impl like this and you can just check catalogProperties.containsKey(CatalogProperties.CATALOG_IMPL). This also allows compatibility with legacy catalog impl Hadoop config key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! I will change this then and add one more test for legacy configs

@jackye1995
Copy link
Contributor

I also agree that this should throw an exception instead of setting null. Could you update the title to reflect that?

I think we should also do a similar check in Spark and Flink code path to make sure we do not have the same issue, could you also add those checks?

@omarsmak omarsmak changed the title Hive: Set catalog type to null if catalog-impl is set Hive: Thrown an exception if both catalog type and catalog-impl are set Sep 22, 2021
@omarsmak omarsmak changed the title Hive: Thrown an exception if both catalog type and catalog-impl are set Hive: Throw an exception if both catalog type and catalog-impl are set Sep 22, 2021
@omarsmak
Copy link
Member Author

I also agree that this should throw an exception instead of setting null. Could you update the title to reflect that?

I think we should also do a similar check in Spark and Flink code path to make sure we do not have the same issue, could you also add those checks?

I could do that, but would it make sense to keep this PR focused on Hive and raise a new one for Spark and Flink?

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omarsmak, can the check be done in CatalogUtil instead of here? That way it would apply everywhere and not just for Hive.

@omarsmak omarsmak force-pushed the fix-catalog-type-if-custom-set branch 2 times, most recently from 24680bc to 819374c Compare October 11, 2021 15:01
@github-actions github-actions bot added the core label Oct 11, 2021
@omarsmak omarsmak force-pushed the fix-catalog-type-if-custom-set branch from 819374c to 947a952 Compare October 11, 2021 15:09
@omarsmak omarsmak changed the title Hive: Throw an exception if both catalog type and catalog-impl are set Core: Throw an exception if both catalog type and catalog-impl are set Oct 11, 2021
@omarsmak
Copy link
Member Author

@rdblue I have moved the logic to CatalogUtil instead of Catalog. This will work for both Spark and Hive. For Flink logic will be very similar to CatalogUtil but in FlinkCatalogFactory. If you think what I did here it makes sense for Hive and Spark, I will add Flink as well with the same logic .

@omarsmak omarsmak force-pushed the fix-catalog-type-if-custom-set branch from 947a952 to 45a301b Compare October 11, 2021 16:58
@omarsmak omarsmak requested a review from rdblue October 11, 2021 16:59
@omarsmak
Copy link
Member Author

@rdblue please take a look at it once again, I hope things now are in place :)

@omarsmak
Copy link
Member Author

Some tests are failing, I will need to check tomorrow why they are failing

@omarsmak omarsmak force-pushed the fix-catalog-type-if-custom-set branch from 45a301b to 79f0d46 Compare October 11, 2021 17:49
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now! I'll merge when tests are passing.

@omarsmak omarsmak force-pushed the fix-catalog-type-if-custom-set branch from 79f0d46 to 0a3666c Compare October 12, 2021 10:28
@omarsmak omarsmak force-pushed the fix-catalog-type-if-custom-set branch from 0a3666c to 90ecfdd Compare October 12, 2021 11:52
@omarsmak
Copy link
Member Author

@rdblue CI now looks green.

@rdblue rdblue merged commit c5eeb97 into apache:master Oct 12, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 12, 2021

Thanks, @omarsmak!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants