Provide memory tracking capabilities for connector splits#10273
Provide memory tracking capabilities for connector splits#10273losipiuk merged 9 commits intotrinodb:masterfrom
Conversation
There was a problem hiding this comment.
can we call openTable more times with this refactor. If so is it a problem?
There was a problem hiding this comment.
KuduRecordSet is created once per split by KuduRecordSetProvider#getRecordSet. Caching the kuduTable on the KuduRecordSet should be functionally equivalent of caching it at the KuduSplit level.
There was a problem hiding this comment.
Is that called just once per for split processing?
There was a problem hiding this comment.
It is called once per split when AccumuloRecordSet is created
There was a problem hiding this comment.
nit: drop serialized from arg names a field name. Just leave phoenixInputSplit
There was a problem hiding this comment.
I named it with the serialized prefix to avoid name clash on getters (getSerializedPhoenixInputSplit and getPhoenixInputSplit, the first one is needed to mark it as jackson serializable)
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryColumnHandle.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
It is based on implementation detail of other component. Also cache is size bound so technically it is not sure that we will get cached instance here (though very probable).
How big the type objects are? Maybe it is better to still account them here to be on the safe side.
There was a problem hiding this comment.
The memory accounting primitives for Type is not very complex to implement, but also non trivial. To implement proper memory accounting the TypeSignature class and TypeSignatureParameter class will have to provide memory accounting capabilities as well. In case of TypeSignatureParameter which stores the value as Object in adds additional implementation complexity, as based on the type of parameter the value of a parameter can be of 4 different types that would all have to be accounted for.
From one perspective the complexity added is not that high. But from the practical perspective I'm not sure if it is worth it.
As I already mentioned in the comment the types are cached by TypeRegistry. Many simple types are singletons (such as BigintType). The only cache in TypeRegistry that is bounded is parametricTypeCache. But it accommodates 1000 types, and in reality it is rather unlikely to overflow (as it would mean that at the current moment the system processes queries over more than 1000 columns that all have distinct types). But even so - types are usually reused across all splits as the column handles are usually resolved only once before the split enumeration starts. Thus accounting memory for splits is more likely than not to cause double counting for the memory used by those types.
There was a problem hiding this comment.
there is some more stuff in Properties (defaults, map) - not sure if set in our case.
There was a problem hiding this comment.
This method should account memory for all key-value pairs stored in the Properties. However as you pointed out it may not account for all internal data structure maintained by the map (what duplicates the Properties map for efficiency) and defaults (which may shadow some of the keys). It is hard to estimate memory usage of Properties precisely, but the current estimate should be close.
Ideally the schema field should be removed completely, as it duplicates a lot of data already present in the split (prestodb/presto#13453). But that's probably a topic for a follow up.
4a729bb to
37ebf5b
Compare
There was a problem hiding this comment.
Why change from URI to String? Similarly for others parts (e.g. change to serialized page), is it for accurate memory tracking? How much of a difference does it make?
There was a problem hiding this comment.
It's hard to say. As i tried to explain in the commit message it is difficult to predict how much memory does the URI use, as it has many different fields where it caches certain URI parts as well as decoded / encoded representation for some of those.
Store task location as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object
Store remote URI as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object
Store remote URI as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object
Remove KuduTableHandle from the KuduSplit to simplify memory accounting. KuduTableHandle contains KuduTable, a class defined by the Kudu client library. Implementing memory accounting correctly for that object is challenging.
Store time zone id as String to simplify memory accounting
Replace WrappedRange with SerializedRange to simplify memory accounting
Convert sheets values to String early and store them as String in the split to simplify memory accounting
Replace WrappedPhoenixInputSplit with SerializedPhoenixInputSplit to simplify memory accounting
37ebf5b to
0915954
Compare
No description provided.