Skip to content

Conversation

@sopel39
Copy link
Member

@sopel39 sopel39 commented Apr 27, 2023

# Section
* Reduce coordinator memory usage when caching of file metadata is enabled ({issue}`issuenumber`)

@sopel39
Copy link
Member Author

sopel39 commented Apr 27, 2023

fyi @romanvainb

@github-actions github-actions bot added hive Hive connector tests:hive labels Apr 27, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we generally avoid using weak-valued caches because this brings less predictibility for the memory management. So this small single line is a big change to me.

however, what about scoping the interner in the caller, eg in InternalHiveSplitFactory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@sopel39 sopel39 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we generally avoid using weak-valued caches because this brings less predictibility for the memory management. So this small single line is a big change to me.

The assumption is there won't be many hosts in the cluster so it should be negligible in practice.

however, what about scoping the interner in the caller, eg in InternalHiveSplitFactory?

I'm not sure what would that change. For example CachingDirectoryLister has global scope and it stores BlockLocations indirectly.
I wanted to avoid extra complexity here since the number of hosts should be small anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is there won't be many hosts in the cluster so it should be negligible in practice.

good point. can you add this as a code comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stupid question: what about deployments which scale up and down over time? The number of interned hosts will keep increasing. What causes it to release hostnames which are no longer in use? GC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a stupid question, since I was about to ask the same one :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stupid question: what about deployments which scale up and down over time? The number of interned hosts will keep increasing.

Yes. Even if extreme cases I assume like 1000 hosts. They cannot go up/down very frequently, so the size of this structure will remain relatively stable over time (and small too).

GC will collect weak references eagerly (as soon as no-one references them)

@sopel39 sopel39 requested a review from findepi April 28, 2023 09:00
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% add a comment we assume the HOST_INTERNER will intern only very small limited number of names over time.

@sopel39 sopel39 merged commit ee4146a into trinodb:master Apr 28, 2023
@sopel39 sopel39 deleted the ks/intern_host branch April 28, 2023 20:56
@sopel39 sopel39 mentioned this pull request Apr 28, 2023
@github-actions github-actions bot added this to the 416 milestone Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants