Skip to content

Conversation

@SinghAsDev
Copy link
Contributor

Allow table defaults to be configured and/ or enforced at catalog level using catalog properties. This resolves #3994

@SinghAsDev
Copy link
Contributor Author

@jackye1995 @rdblue @anuragmantri can you help review this, thanks

@rdblue
Copy link
Contributor

rdblue commented Jan 30, 2022

I took a quick look. I think the idea is reasonable, but the implementation needs to be changed:

  • I think table-default and table-override are better property prefixes
  • This should not modify the Catalog interface or CachingCatalog
  • The best place to add this is in BaseMetastoreCatalog or other catalogs you need to update
  • Subclasses should pass properties using super(...) in the constructor, not set parent fields
  • Please reduce the footprint of this change by not modifying classes unnecessarily; add defaults to avoid touching so many files

@SinghAsDev
Copy link
Contributor Author

@rdblue thanks for the review, updated to keep changes to hive and hadoop catalogs.

@rdblue
Copy link
Contributor

rdblue commented Jan 31, 2022

Thanks for the quick turn-around @SinghAsDev! I'll take another look, hopefully today.

@SinghAsDev
Copy link
Contributor Author

Hi @rdblue just wanted to bump this and see if you would have cycles to review this. Thanks!

@rajarshisarkar
Copy link
Contributor

@SinghAsDev As discussed over Slack, this feature would benefit many EMR cluster users. I wanted to check with you if we can fast track this PR. In case you are keeping busy, can I collaborate with you to close this PR (as a co-author)?

@SinghAsDev
Copy link
Contributor Author

SinghAsDev commented May 9, 2022

Hey @rajarshisarkar so sorry for not being able to spend time on this 😞. I think if you have time I would appreciate the help to help close this, thanks. If you don't get time either, I can try to get to it by end of this week.

@rajarshisarkar
Copy link
Contributor

Hi @SinghAsDev, I am happy to take this PR to closure. Can you please give access to your fork so that I can push new commits in your branch.

@rajarshisarkar
Copy link
Contributor

Thanks, @SinghAsDev for providing the access! I will resolve the pending review comments.

@rajarshisarkar
Copy link
Contributor

Ping @rdblue @kbendick @singhpk234 for review.

Comment on lines +374 to +421
public void testTablePropsDefinedAtCatalogLevel() {
String namespace = createNamespace();
String tableName = getRandomName();
TableIdentifier tableIdent = TableIdentifier.of(namespace, tableName);
ImmutableMap<String, String> catalogProps = ImmutableMap.of(
"table-default.key1", "catalog-default-key1",
"table-default.key2", "catalog-default-key2",
"table-default.key3", "catalog-default-key3",
"table-override.key3", "catalog-override-key3",
"table-override.key4", "catalog-override-key4",
"warehouse", "s3://" + testBucketName + "/" + testPathPrefix);

glueCatalog.initialize("glue", catalogProps);

Schema schema = new Schema(
NestedField.required(3, "id", Types.IntegerType.get(), "unique ID"),
NestedField.required(4, "data", Types.StringType.get())
);

org.apache.iceberg.Table table = glueCatalog.buildTable(tableIdent, schema)
.withProperty("key2", "table-key2")
.withProperty("key3", "table-key3")
.withProperty("key5", "table-key5")
.create();

Assert.assertEquals(
"Table defaults set for the catalog must be added to the table properties.",
"catalog-default-key1",
table.properties().get("key1"));
Assert.assertEquals(
"Table property must override table default properties set at catalog level.",
"table-key2",
table.properties().get("key2"));
Assert.assertEquals(
"Table property override set at catalog level must override table default" +
" properties set at catalog level and table property specified.",
"catalog-override-key3",
table.properties().get("key3"));
Assert.assertEquals(
"Table override not in table props or defaults should be added to table properties",
"catalog-override-key4",
table.properties().get("key4"));
Assert.assertEquals(
"Table properties without any catalog level default or override should be added to table" +
" properties.",
"table-key5",
table.properties().get("key5"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this test to TestGlueCatalog instead, as at present we don't run integ test as part of our GA. WDYT ?

Copy link
Contributor

@rajarshisarkar rajarshisarkar May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in TestGlueCatalog we mock objects of GlueCatalog and GlueClient. The class is mainly used for creating/listing/dropping/renaming namespace/tables by mocking the response. Also, it looks like we cannot mock S3 Client in S3FileIO directly from TestGlueCatalog. Do let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally really don't like working with mocks, so I understand what you mean. I would consider adding some test that does get run through the CI test suite, to ensure that any changes that are made in the API are made against the Glue catalog (outside of compiler related errors).

As it is true that the AWS integration tests aren't run on any schedule within the open source community (that I'm aware of) due to lack of infra.

That said, I wouldn't consider it a hard requirement given that HadoopCatalog and HiveCatalog both have tests. Maybe adding GlueCatalog to the existing CatalogTests suite would be beneficial in the long run? Something to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally really don't like working with mocks, so I understand what you mean

Yes, one more limitation here with mocking the glue client is we cannot mock the glue table response with a path for the BaseMetastoreTableOperations.METADATA_LOCATION_PROP property (as it starts checking for the existence of the file while refreshing the metadata in BaseMetastoreTableOperations).

I would consider adding some test that does get run through the CI test suite, to ensure that any changes that are made in the API are made against the Glue catalog

I have added a test which asserts that the properties API should not return an empty map from Glue Catalog when the catalog properties are initialised. The processing of the properties in BaseMetastoreCatalog is validated by HadoopCatalog and HiveCatalog. Meanwhile, TestGlueCatalogTable should check the actual behaviour for GlueCatalog.

Maybe adding GlueCatalog to the existing CatalogTests suite would be beneficial in the long run? Something to consider.

Yes, I think adding GlueCatalog to the existing CatalogTests suite would be beneficial. I can publish a separate PR for the same.

Please let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, one more limitation here with mocking the glue client is we cannot mock the glue table response with a path for the BaseMetastoreTableOperations.METADATA_LOCATION_PROP property (as it starts checking for the existence of the file while refreshing the metadata in BaseMetastoreTableOperations)

+1 understood. Wondering if we should comment this somewhere. I suppose a standalone comment on this PR is sufficient (or maybe in the final squashed git PR summarty a small mention of that in the extended message?). But overall +1 understood. Mocks can't solve everything.

I have added a test which asserts that the properties API should not return an empty map from Glue Catalog when the catalog properties are initialised. The processing of the properties in BaseMetastoreCatalog is validated by HadoopCatalog and HiveCatalog. Meanwhile, TestGlueCatalogTable should check the actual behaviour for GlueCatalog.

+1

Yes, I think adding GlueCatalog to the existing CatalogTests suite would be beneficial. I can publish a separate PR for the same.

Major +1. That would be really beneficial. Even if it involves some mocks (right now there's backing catalog, which for REST we use JDBC via the CatalogAdaptor.

Even just having some mocks there one time would make it easy for people who aren't super knowledgable in Glue to add tests that hit multiple catalogs without knowing the details of all of them (to some degree or with review).

Definitely for a separate PR though =)

@rajarshisarkar
Copy link
Contributor

Ping @rdblue @kbendick for review.

private LockManager lockManager;
private CloseableGroup closeableGroup;
private Map<String, String> catalogProperties;
private Map<String, String> catalogProperties = Collections.emptyMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unless we need null values in the map, we almost always prefer shaded-guava's ImmutableMap.of().

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SinghAsDev for pinging me again!

Left some more comments. I will do more review shortly, but I have some open-ended questions:

Do we have any worries about catalog defaults accidentally affecting a table that a user didn't intend it to? I know some properties have been called out as possibly better suited as per-job definitions (like write.upsert.enabled for Flink Iceberg Sink).

This might be a non-concern, particularly if it's well documented, but we might want to at least log the property keys for which we pulled a default value from the catalog level?

Would love to hear from others if they have any similar concerns, and if there's any proposed solution.

In the REST catalog, we do allow for a catalog-level override via the /config endpoint, so there's definitely some level of precedence, but this covers a bit more surface area than that.


@Override
public void initialize(String name, Map<String, String> properties) {
this.catalogProperties = properties;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Might want to wrapt his in ImmutableMap.copyOf(properties) (which is a no-op if the underlying map is an ImmutableMap. Though again if you need null-values then that won't work for you.


@VisibleForTesting
void initialize(String name, String path, AwsProperties properties, GlueClient client,
LockManager lock, FileIO io, Map<String, String> catalogProps) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / non-blocking: The indentation for this seems a little weird to me.

If all of the properties are on the next line, will it fit? I do see conflicting things at times so it's not a huge deal, but I believe the preferred way would be something more like:

  @VisibleForTesting
  void initialize(String name, String path, AwsProperties properties, GlueClient client,
                          LockManager lock, FileIO io, Map<String, String> catalogProps) {

But again, I see different formats so without clarification, I'd either go with the above, a single line (the next line) if they all fit, or just leave it unless somebody states otherwise. I'll follow up for more clarification, but this is absolutley non-blocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The starting indent of all argument lines should match. That could be starting after initialize( or using a continuation indent on the following line. But the important thing is that all lines with arguments indent arguments at the same place.

}

protected Map<String, String> properties() {
return Collections.emptyMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If it's empty, usually we use ImmutableMap.of(). I see you updated from ImmutableMap.builder so I assume you need null values, but I think even still we'd normally use ImmutableMap.of() in the empty case.

private final TableIdentifier identifier;
private final Schema schema;
private final ImmutableMap.Builder<String, String> propertiesBuilder = ImmutableMap.builder();
private final Map<String, String> tableProperties = Maps.newHashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for null-values? If so, can we ensure that we have tests that include null values? Otherwise it's likely to get refactored at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will need null values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mainly because the map gets altered with tableProperties.putAll(tableDefaultProperties()); and tableProperties.putAll(tableOverrideProperties());

private FileIO fileIO;
private LockManager lockManager;
private boolean suppressPermissionError = false;
private Map<String, String> catalogProperties = Collections.emptyMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / non-blocking : Does this need to be initialized? I'm assuming this is to suppress potential not-initialized errors?

Comment on lines +32 to +33
public static final String TABLE_DEFAULT_PREFIX = "table-default.";
public static final String TABLE_OVERRIDE_PREFIX = "table-override.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / open-question: How do people feel about these names? I'd kind of like them to have properties or something in them, but this might be overkill.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this.


@Override
protected Map<String, String> properties() {
return catalogProperties;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For here, _if it will remove the not-initialized warning from catalogProperties, we often do the following:

return catalogProperies != null ? ImmutableMap.of() : catalogProperties;

}

@VisibleForTesting
void initialize(String name, String path, AwsProperties properties, GlueClient client,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear why this method is needed. @SinghAsDev, is this necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, it is just used in test to check properties map is not empty when catalog properties is initialised.

@rdblue rdblue merged commit 06fcdbb into apache:master May 17, 2022
@rdblue
Copy link
Contributor

rdblue commented May 17, 2022

Since all that was left were nits, I went ahead and merged this. Thanks for the contribution, @SinghAsDev!

@rajarshisarkar
Copy link
Contributor

Thanks, @SinghAsDev for the contribution - this is a really important feature for the EMR customers.
Thanks, @rdblue and @kbendick for the reviews!

I'll shall address the nitpicks and the GlueCatalog test improvements in a follow-up PR.

@rajarshisarkar
Copy link
Contributor

Do we have any worries about catalog defaults accidentally affecting a table that a user didn't intend it to? I know some properties have been called out as possibly better suited as per-job definitions (like write.upsert.enabled for Flink Iceberg Sink).

This might be a non-concern, particularly if it's well documented, but we might want to at least log the property keys for which we pulled a default value from the catalog level?

Would love to hear from others if they have any similar concerns, and if there's any proposed solution.

@kbendick I don't think we need to worry as only the catalog properties with a certain prefix are considered, +1 for logging.

/**
* Get default table properties set at Catalog level through catalog properties.
*
* @return default table properties specified in catalog properties
*/
private Map<String, String> tableDefaultProperties() {
return PropertyUtil.propertiesWithPrefix(properties(), CatalogProperties.TABLE_DEFAULT_PREFIX);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow table properties defaults to be configured at catalog level

5 participants