Skip to content

Limit session state during metadata queries in Iceberg#19757

Merged
findepi merged 4 commits intomasterfrom
findepi/iceberg-oom-limit
Nov 16, 2023
Merged

Limit session state during metadata queries in Iceberg#19757
findepi merged 4 commits intomasterfrom
findepi/iceberg-oom-limit

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Nov 15, 2023

Metadata queries such as information_schema.columns,
system.jdbc.columns or system.metadata.table_comments may end up
loading arbitrary number of relations within single query (transaction).
It is important to bound memory usage for such queries.

In case of Iceberg Hive metastore based catalog, this is already done in
TrinoHiveCatalogFactory bu means of configuring per-query
CachingHiveMetastore. However, catalogs with explicit caching need
something similar.

@cla-bot cla-bot bot added the cla-signed label Nov 15, 2023
@github-actions github-actions bot added the iceberg Iceberg connector label Nov 15, 2023
@findepi findepi force-pushed the findepi/iceberg-oom-limit branch from a1ab9d5 to 2c3e111 Compare November 15, 2023 21:23
Metadata queries such as `information_schema.columns`,
`system.jdbc.columns` or `system.metadata.table_comments` may end up
loading arbitrary number of relations within single query (transaction).
It is important to bound memory usage for such queries.

In case of Iceberg Hive metastore based catalog, this is already done in
`TrinoHiveCatalogFactory` bu means of configuring per-query
`CachingHiveMetastore`. However, catalogs with explicit caching need
something similar.
@findepi findepi force-pushed the findepi/iceberg-oom-limit branch from 2c3e111 to ad51682 Compare November 15, 2023 21:40
}
String tableLocation = metadataLocation.replaceFirst("/metadata/[^/]*$", "");
deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, tableLocation);
invalidateTableCache(schemaTableName);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to dropTableFromMetastore or even to deleteTable to simplify code and prevent from omitting in case new functions in future?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about it. technically it would work, but i considered dropTableFromMetastore being just a technical operation, which may or may not be invoked, or be the last operation as part of the drop flow

public void unregisterTable(ConnectorSession session, SchemaTableName schemaTableName)
{
dropTableFromMetastore(schemaTableName);
invalidateTableCache(schemaTableName);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this and folowing be moved to dropTableFromMetastore ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Nov 16, 2023

CI #16315 (#8920), #15187
retrying

@findepi findepi merged commit 8d4f26b into master Nov 16, 2023
@findepi findepi deleted the findepi/iceberg-oom-limit branch November 16, 2023 15:25
@github-actions github-actions bot added this to the 434 milestone Nov 16, 2023
@mosabua
Copy link
Copy Markdown
Member

mosabua commented Nov 17, 2023

No release note entry @findepi ?

@findepi findepi added the no-release-notes This pull request does not require release notes entry label Nov 17, 2023
@alexjo2144
Copy link
Copy Markdown
Member

My understanding is that the caching here is important for ensuring that queries use the same snapshot in different phases of query execution. It's unlikely that a regular select would fill this cache but it'd be nice if we could have them be unbounded when necessary.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Nov 17, 2023

I see your point and agree.
I allowed myself to be a bit lazy here.
The pattern exists in Hive connector since long time. It's configurable there (hive.per-transaction-metastore-cache-maximum-size) but i don't recall people needing to configure this. Of course I agree it would be better to have a configuration toggle for this.

We probably should also update the JDBC connector. DefaultJdbcMetadataFactory creates CachingJdbcClient with unbounded cache, so it can OOM for large metadata queries.
cc @hashhar @kokosing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed iceberg Iceberg connector no-release-notes This pull request does not require release notes entry

Development

Successfully merging this pull request may close these issues.

4 participants