-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Allow table defaults to be configured and/ or enforced at catalog level using catalog properties. #4011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jackye1995 @rdblue @anuragmantri can you help review this, thanks |
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
|
I took a quick look. I think the idea is reasonable, but the implementation needs to be changed:
|
…el using catalog properties.
5c10b26 to
add90e0
Compare
|
@rdblue thanks for the review, updated to keep changes to hive and hadoop catalogs. |
|
Thanks for the quick turn-around @SinghAsDev! I'll take another look, hopefully today. |
|
Hi @rdblue just wanted to bump this and see if you would have cycles to review this. Thanks! |
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java
Outdated
Show resolved
Hide resolved
|
@SinghAsDev As discussed over Slack, this feature would benefit many EMR cluster users. I wanted to check with you if we can fast track this PR. In case you are keeping busy, can I collaborate with you to close this PR (as a co-author)? |
|
Hey @rajarshisarkar so sorry for not being able to spend time on this 😞. I think if you have time I would appreciate the help to help close this, thanks. If you don't get time either, I can try to get to it by end of this week. |
|
Hi @SinghAsDev, I am happy to take this PR to closure. Can you please give access to your fork so that I can push new commits in your branch. |
|
Thanks, @SinghAsDev for providing the access! I will resolve the pending review comments. |
|
Ping @rdblue @kbendick @singhpk234 for review. |
| public void testTablePropsDefinedAtCatalogLevel() { | ||
| String namespace = createNamespace(); | ||
| String tableName = getRandomName(); | ||
| TableIdentifier tableIdent = TableIdentifier.of(namespace, tableName); | ||
| ImmutableMap<String, String> catalogProps = ImmutableMap.of( | ||
| "table-default.key1", "catalog-default-key1", | ||
| "table-default.key2", "catalog-default-key2", | ||
| "table-default.key3", "catalog-default-key3", | ||
| "table-override.key3", "catalog-override-key3", | ||
| "table-override.key4", "catalog-override-key4", | ||
| "warehouse", "s3://" + testBucketName + "/" + testPathPrefix); | ||
|
|
||
| glueCatalog.initialize("glue", catalogProps); | ||
|
|
||
| Schema schema = new Schema( | ||
| NestedField.required(3, "id", Types.IntegerType.get(), "unique ID"), | ||
| NestedField.required(4, "data", Types.StringType.get()) | ||
| ); | ||
|
|
||
| org.apache.iceberg.Table table = glueCatalog.buildTable(tableIdent, schema) | ||
| .withProperty("key2", "table-key2") | ||
| .withProperty("key3", "table-key3") | ||
| .withProperty("key5", "table-key5") | ||
| .create(); | ||
|
|
||
| Assert.assertEquals( | ||
| "Table defaults set for the catalog must be added to the table properties.", | ||
| "catalog-default-key1", | ||
| table.properties().get("key1")); | ||
| Assert.assertEquals( | ||
| "Table property must override table default properties set at catalog level.", | ||
| "table-key2", | ||
| table.properties().get("key2")); | ||
| Assert.assertEquals( | ||
| "Table property override set at catalog level must override table default" + | ||
| " properties set at catalog level and table property specified.", | ||
| "catalog-override-key3", | ||
| table.properties().get("key3")); | ||
| Assert.assertEquals( | ||
| "Table override not in table props or defaults should be added to table properties", | ||
| "catalog-override-key4", | ||
| table.properties().get("key4")); | ||
| Assert.assertEquals( | ||
| "Table properties without any catalog level default or override should be added to table" + | ||
| " properties.", | ||
| "table-key5", | ||
| table.properties().get("key5")); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move this test to TestGlueCatalog instead, as at present we don't run integ test as part of our GA. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in TestGlueCatalog we mock objects of GlueCatalog and GlueClient. The class is mainly used for creating/listing/dropping/renaming namespace/tables by mocking the response. Also, it looks like we cannot mock S3 Client in S3FileIO directly from TestGlueCatalog. Do let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally really don't like working with mocks, so I understand what you mean. I would consider adding some test that does get run through the CI test suite, to ensure that any changes that are made in the API are made against the Glue catalog (outside of compiler related errors).
As it is true that the AWS integration tests aren't run on any schedule within the open source community (that I'm aware of) due to lack of infra.
That said, I wouldn't consider it a hard requirement given that HadoopCatalog and HiveCatalog both have tests. Maybe adding GlueCatalog to the existing CatalogTests suite would be beneficial in the long run? Something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally really don't like working with mocks, so I understand what you mean
Yes, one more limitation here with mocking the glue client is we cannot mock the glue table response with a path for the BaseMetastoreTableOperations.METADATA_LOCATION_PROP property (as it starts checking for the existence of the file while refreshing the metadata in BaseMetastoreTableOperations).
I would consider adding some test that does get run through the CI test suite, to ensure that any changes that are made in the API are made against the Glue catalog
I have added a test which asserts that the properties API should not return an empty map from Glue Catalog when the catalog properties are initialised. The processing of the properties in BaseMetastoreCatalog is validated by HadoopCatalog and HiveCatalog. Meanwhile, TestGlueCatalogTable should check the actual behaviour for GlueCatalog.
Maybe adding GlueCatalog to the existing CatalogTests suite would be beneficial in the long run? Something to consider.
Yes, I think adding GlueCatalog to the existing CatalogTests suite would be beneficial. I can publish a separate PR for the same.
Please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, one more limitation here with mocking the glue client is we cannot mock the glue table response with a path for the BaseMetastoreTableOperations.METADATA_LOCATION_PROP property (as it starts checking for the existence of the file while refreshing the metadata in BaseMetastoreTableOperations)
+1 understood. Wondering if we should comment this somewhere. I suppose a standalone comment on this PR is sufficient (or maybe in the final squashed git PR summarty a small mention of that in the extended message?). But overall +1 understood. Mocks can't solve everything.
I have added a test which asserts that the properties API should not return an empty map from Glue Catalog when the catalog properties are initialised. The processing of the properties in BaseMetastoreCatalog is validated by HadoopCatalog and HiveCatalog. Meanwhile, TestGlueCatalogTable should check the actual behaviour for GlueCatalog.
+1
Yes, I think adding GlueCatalog to the existing CatalogTests suite would be beneficial. I can publish a separate PR for the same.
Major +1. That would be really beneficial. Even if it involves some mocks (right now there's backing catalog, which for REST we use JDBC via the CatalogAdaptor.
Even just having some mocks there one time would make it easy for people who aren't super knowledgable in Glue to add tests that hit multiple catalogs without knowing the details of all of them (to some degree or with review).
Definitely for a separate PR though =)
| private LockManager lockManager; | ||
| private CloseableGroup closeableGroup; | ||
| private Map<String, String> catalogProperties; | ||
| private Map<String, String> catalogProperties = Collections.emptyMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Unless we need null values in the map, we almost always prefer shaded-guava's ImmutableMap.of().
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SinghAsDev for pinging me again!
Left some more comments. I will do more review shortly, but I have some open-ended questions:
Do we have any worries about catalog defaults accidentally affecting a table that a user didn't intend it to? I know some properties have been called out as possibly better suited as per-job definitions (like write.upsert.enabled for Flink Iceberg Sink).
This might be a non-concern, particularly if it's well documented, but we might want to at least log the property keys for which we pulled a default value from the catalog level?
Would love to hear from others if they have any similar concerns, and if there's any proposed solution.
In the REST catalog, we do allow for a catalog-level override via the /config endpoint, so there's definitely some level of precedence, but this covers a bit more surface area than that.
|
|
||
| @Override | ||
| public void initialize(String name, Map<String, String> properties) { | ||
| this.catalogProperties = properties; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Might want to wrapt his in ImmutableMap.copyOf(properties) (which is a no-op if the underlying map is an ImmutableMap. Though again if you need null-values then that won't work for you.
|
|
||
| @VisibleForTesting | ||
| void initialize(String name, String path, AwsProperties properties, GlueClient client, | ||
| LockManager lock, FileIO io, Map<String, String> catalogProps) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit / non-blocking: The indentation for this seems a little weird to me.
If all of the properties are on the next line, will it fit? I do see conflicting things at times so it's not a huge deal, but I believe the preferred way would be something more like:
@VisibleForTesting
void initialize(String name, String path, AwsProperties properties, GlueClient client,
LockManager lock, FileIO io, Map<String, String> catalogProps) {But again, I see different formats so without clarification, I'd either go with the above, a single line (the next line) if they all fit, or just leave it unless somebody states otherwise. I'll follow up for more clarification, but this is absolutley non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The starting indent of all argument lines should match. That could be starting after initialize( or using a continuation indent on the following line. But the important thing is that all lines with arguments indent arguments at the same place.
| } | ||
|
|
||
| protected Map<String, String> properties() { | ||
| return Collections.emptyMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If it's empty, usually we use ImmutableMap.of(). I see you updated from ImmutableMap.builder so I assume you need null values, but I think even still we'd normally use ImmutableMap.of() in the empty case.
| private final TableIdentifier identifier; | ||
| private final Schema schema; | ||
| private final ImmutableMap.Builder<String, String> propertiesBuilder = ImmutableMap.builder(); | ||
| private final Map<String, String> tableProperties = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for null-values? If so, can we ensure that we have tests that include null values? Otherwise it's likely to get refactored at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will need null values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mainly because the map gets altered with tableProperties.putAll(tableDefaultProperties()); and tableProperties.putAll(tableOverrideProperties());
| private FileIO fileIO; | ||
| private LockManager lockManager; | ||
| private boolean suppressPermissionError = false; | ||
| private Map<String, String> catalogProperties = Collections.emptyMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit / non-blocking : Does this need to be initialized? I'm assuming this is to suppress potential not-initialized errors?
| public static final String TABLE_DEFAULT_PREFIX = "table-default."; | ||
| public static final String TABLE_OVERRIDE_PREFIX = "table-override."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit / open-question: How do people feel about these names? I'd kind of like them to have properties or something in them, but this might be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this.
|
|
||
| @Override | ||
| protected Map<String, String> properties() { | ||
| return catalogProperties; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: For here, _if it will remove the not-initialized warning from catalogProperties, we often do the following:
return catalogProperies != null ? ImmutableMap.of() : catalogProperties;| } | ||
|
|
||
| @VisibleForTesting | ||
| void initialize(String name, String path, AwsProperties properties, GlueClient client, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear why this method is needed. @SinghAsDev, is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, it is just used in test to check properties map is not empty when catalog properties is initialised.
|
Since all that was left were nits, I went ahead and merged this. Thanks for the contribution, @SinghAsDev! |
|
Thanks, @SinghAsDev for the contribution - this is a really important feature for the EMR customers. I'll shall address the nitpicks and the GlueCatalog test improvements in a follow-up PR. |
@kbendick I don't think we need to worry as only the catalog properties with a certain prefix are considered, +1 for logging. iceberg/core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java Lines 226 to 233 in 13b1a73
|
Allow table defaults to be configured and/ or enforced at catalog level using catalog properties. This resolves #3994