Skip to content

Provide memory tracking capabilities for connector splits#10273

Merged
losipiuk merged 9 commits intotrinodb:masterfrom
arhimondr:split-memory-tracking-master
Dec 23, 2021
Merged

Provide memory tracking capabilities for connector splits#10273
losipiuk merged 9 commits intotrinodb:masterfrom
arhimondr:split-memory-tracking-master

Conversation

@arhimondr
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call openTable more times with this refactor. If so is it a problem?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KuduRecordSet is created once per split by KuduRecordSetProvider#getRecordSet. Caching the kuduTable on the KuduRecordSet should be functionally equivalent of caching it at the KuduSplit level.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that called just once per for split processing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called once per split when AccumuloRecordSet is created

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop serialized from arg names a field name. Just leave phoenixInputSplit

Copy link
Copy Markdown
Contributor Author

@arhimondr arhimondr Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it with the serialized prefix to avoid name clash on getters (getSerializedPhoenixInputSplit and getPhoenixInputSplit, the first one is needed to mark it as jackson serializable)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is based on implementation detail of other component. Also cache is size bound so technically it is not sure that we will get cached instance here (though very probable).
How big the type objects are? Maybe it is better to still account them here to be on the safe side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory accounting primitives for Type is not very complex to implement, but also non trivial. To implement proper memory accounting the TypeSignature class and TypeSignatureParameter class will have to provide memory accounting capabilities as well. In case of TypeSignatureParameter which stores the value as Object in adds additional implementation complexity, as based on the type of parameter the value of a parameter can be of 4 different types that would all have to be accounted for.

From one perspective the complexity added is not that high. But from the practical perspective I'm not sure if it is worth it.

As I already mentioned in the comment the types are cached by TypeRegistry. Many simple types are singletons (such as BigintType). The only cache in TypeRegistry that is bounded is parametricTypeCache. But it accommodates 1000 types, and in reality it is rather unlikely to overflow (as it would mean that at the current moment the system processes queries over more than 1000 columns that all have distinct types). But even so - types are usually reused across all splits as the column handles are usually resolved only once before the split enumeration starts. Thus accounting memory for splits is more likely than not to cause double counting for the memory used by those types.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some more stuff in Properties (defaults, map) - not sure if set in our case.

Copy link
Copy Markdown
Contributor Author

@arhimondr arhimondr Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should account memory for all key-value pairs stored in the Properties. However as you pointed out it may not account for all internal data structure maintained by the map (what duplicates the Properties map for efficiency) and defaults (which may shadow some of the keys). It is hard to estimate memory usage of Properties precisely, but the current estimate should be close.

Ideally the schema field should be removed completely, as it duplicates a lot of data already present in the split (prestodb/presto#13453). But that's probably a topic for a follow up.

@arhimondr arhimondr force-pushed the split-memory-tracking-master branch from 4a729bb to 37ebf5b Compare December 13, 2021 19:10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change from URI to String? Similarly for others parts (e.g. change to serialized page), is it for accurate memory tracking? How much of a difference does it make?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say. As i tried to explain in the commit message it is difficult to predict how much memory does the URI use, as it has many different fields where it caches certain URI parts as well as decoded / encoded representation for some of those.

Store task location as String to make split memory accounting simpler as
accounting memory used by the URI object is rather complex due to many
"caching" fields for various URI parts inside the URI object
Store remote URI as String to make split memory accounting simpler as
accounting memory used by the URI object is rather complex due to many
"caching" fields for various URI parts inside the URI object
Store remote URI as String to make split memory accounting simpler as
accounting memory used by the URI object is rather complex due to many
"caching" fields for various URI parts inside the URI object
Remove KuduTableHandle from the KuduSplit to simplify memory accounting.

KuduTableHandle contains KuduTable, a class defined by the Kudu client
library. Implementing memory accounting correctly for that object is
challenging.
Store time zone id as String to simplify memory accounting
Replace WrappedRange with SerializedRange to simplify memory accounting
Convert sheets values to String early and store them as String in the
split to simplify memory accounting
Replace WrappedPhoenixInputSplit with SerializedPhoenixInputSplit to
simplify memory accounting
@arhimondr arhimondr force-pushed the split-memory-tracking-master branch from 37ebf5b to 0915954 Compare December 22, 2021 19:22
@losipiuk losipiuk merged commit a4898a7 into trinodb:master Dec 23, 2021
@github-actions github-actions bot added this to the 368 milestone Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants