Skip to content

Add ability to disable partitions caching in hive metastore#12343

Merged
findepi merged 1 commit intotrinodb:masterfrom
s2lomon:hive-metastore-partitions-cache
May 24, 2022
Merged

Add ability to disable partitions caching in hive metastore#12343
findepi merged 1 commit intotrinodb:masterfrom
s2lomon:hive-metastore-partitions-cache

Conversation

@s2lomon
Copy link
Member

@s2lomon s2lomon commented May 12, 2022

Add ability to disable partitions caching in hive metastore

Thanks to this, one can disable caching of partitions, that
should fix issues with missing partitions during query
execution.

Description

Is this change a fix, improvement, new feature, refactoring, or other?

It's a new feature that helps fixing some issues with inconsistencies in hive metastore caching

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

It's a change to a connector

How would you describe this change to a non-technical end user or system administrator?

It allows you to pick whether you want to cache ALL, DOWN TO TABLE, or PARTITION metadata for hive metastore.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests. But please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need TABLE now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a method to enum instead

scope.matches(otherScope);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would enable this only when ALL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would enable this only when ALL

@findepi
Copy link
Member

findepi commented May 13, 2022

It's a new feature that helps fixing some issues with inconsistencies in hive metastore caching

What is the use-case scenario when a user would want/need to use this?

@Praveen2112
Copy link
Member

What is the use-case scenario when a user would want/need to use this?

It allows user to partially discard a cache - like we would like to cache the table details but not the partition information (if the user adds/drops a partition frequently)

@s2lomon s2lomon force-pushed the hive-metastore-partitions-cache branch from 2aeaa77 to 3040352 Compare May 16, 2022 12:10
@s2lomon s2lomon changed the title Add MetastoreCacheScope to caching hive metastore config Add ability to disable partitions caching in hive metastore May 16, 2022
@findepi
Copy link
Member

findepi commented May 16, 2022

What is the use-case scenario when a user would want/need to use this?

It allows user to partially discard a cache - like we would like to cache the table details but not the partition information (if the user adds/drops a partition frequently)

I know what the change does on a technical level.

It does not answer my question about a use-case scenario when a user would want/need to use this.

@Praveen2112
Copy link
Member

It does not answer my question about a use-case scenario when a user would want/need to use this.

In case of new partitions being added or dropped at some irregular interval - would make the system unusable at times at it fails (if partitions are dropped). We need to either disable the cache fully - but we might hit the underlying metastore too many times. This would be kind of a sweet spot between both the worlds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to a next line ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required as a part of this commit ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it's my mistake

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there is any expansion for sutAction ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sut is a shortcut for system under test. It's a way to mark this as a holder for all actions that are actually being tested. I could rename it to cachingHiveMetastoreInteractions or metastoreInteractions/Actions. For me sut is ok, but maybe the fact that you need to know what it means makes it worst than one of the above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sut is a shortcut for system under test.

would never guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive.metastore-cache-partitions

however, we should have hive.metastore-cache.* prefix, so maybe hive.metastore-cache.cache-partitions and rename existing configs in a follow-up.

cc @losipiuk @electrum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest hive.metastore-cache.cache-partitions and rename existing configs in a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we can do it this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put getter above the setter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use actual default: new ..Config().is...CacheaPartitions()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so when I've wanted to change this it hit me as a very bad practice. I understand that the reasoning for keeping the value automated on the test makes it easier to maintain changes in the default value of this field. Still I feel that making it so defeats the purpose of having these tests anyway (and makes it harder to read). I would argue that in the test we maintain automatically checked contract of the configuration of Trino, so that whenever we change defaults like this, we need to at least make it intentionally and this tests guards us from doing such changes by mistake.

Otherwise we could derive all our defaults from the inlined instance of the class, which would mean that we are writing tests for defaults only because we have a checkstyle for that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s2lomon
if this tests's purpose is to test things related to partitions being cached or not, then i agree, the config should be set explicitly.
however, it's not the goal here. you supplied the config only because the compiler ordered you to (missing parameter), not that because you really wanted to enable or disable partition caching.
new ..Config().is...CacheaPartitions() conveys this intent, and neither true nor false conveys that intent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to new ..Config().is...CacheaPartitions() please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omg you are right, for some reason I've thought that this comment was about the default test. Good point and thanks for being patient :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You chose to add the field as the last one in the config class, so add it as a last one in the test class too.
(or move declaration in the config class)

(same below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmt msg

Thanks to this, one can disable caching of partitions, that
should fix issues with missing partitions during query
execution.

Mention the fact this is for the case where table partitions are modified externally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.

@s2lomon
Copy link
Member Author

s2lomon commented May 19, 2022

It does not answer my question about a use-case scenario when a user would want/need to use this.

In case of new partitions being added or dropped at some irregular interval - would make the system unusable at times at it fails (if partitions are dropped). We need to either disable the cache fully - but we might hit the underlying metastore too many times. This would be kind of a sweet spot between both the worlds.

I guess it should fix issues like this one. #6286 Or at least would be a good try for such scenarios. Alternatively we could add fine tuning parameters, but I guess it's easier to have it cached as always or not at all.

@s2lomon s2lomon force-pushed the hive-metastore-partitions-cache branch from 3040352 to 7cf7803 Compare May 19, 2022 09:44
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the executor (refresh executor) is useless for a "never cache", drop the param

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Optional.empty()? ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest hive.metastore-cache.cache-partitions and rename existing configs in a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to new ..Config().is...CacheaPartitions() please

Comment on lines 739 to 740
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do ordinary for

Comment on lines 748 to 749
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do ordinary for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it last (as everywhere else)

@s2lomon s2lomon force-pushed the hive-metastore-partitions-cache branch from 7cf7803 to 4e4945e Compare May 23, 2022 17:50
Thanks to this, one can disable caching of partitions, that
should fix issues with missing partitions during query
execution. This can happen when partitions are often replaced
or changed by some automated external processes.
@s2lomon s2lomon force-pushed the hive-metastore-partitions-cache branch from 4e4945e to f89babb Compare May 24, 2022 10:17
@findepi
Copy link
Member

findepi commented May 24, 2022

CI #11275 (reopened)

@findepi findepi merged commit 7834be9 into trinodb:master May 24, 2022
@findepi findepi mentioned this pull request May 24, 2022
@github-actions github-actions bot added this to the 382 milestone May 24, 2022
@mosabua
Copy link
Member

mosabua commented May 24, 2022

@colebow can you verify if we need docs and if yes .. raise a PR asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants