Provide memory tracking capabilities for connector splits by arhimondr · Pull Request #10273 · trinodb/trino

arhimondr · 2021-12-10T21:01:38Z

No description provided.

losipiuk · 2021-12-13T08:48:26Z

plugin/trino-kudu/src/main/java/io/trino/plugin/kudu/KuduRecordSet.java

can we call openTable more times with this refactor. If so is it a problem?

KuduRecordSet is created once per split by KuduRecordSetProvider#getRecordSet. Caching the kuduTable on the KuduRecordSet should be functionally equivalent of caching it at the KuduSplit level.

losipiuk · 2021-12-13T08:54:29Z

plugin/trino-accumulo/src/main/java/io/trino/plugin/accumulo/model/AccumuloSplit.java

Is that called just once per for split processing?

It is called once per split when AccumuloRecordSet is created

losipiuk · 2021-12-13T08:58:51Z

plugin/trino-phoenix/src/main/java/io/trino/plugin/phoenix/PhoenixSplit.java

nit: drop serialized from arg names a field name. Just leave phoenixInputSplit

I named it with the serialized prefix to avoid name clash on getters (getSerializedPhoenixInputSplit and getPhoenixInputSplit, the first one is needed to mark it as jackson serializable)

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryColumnHandle.java

losipiuk · 2021-12-13T10:14:16Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveColumnProjectionInfo.java

It is based on implementation detail of other component. Also cache is size bound so technically it is not sure that we will get cached instance here (though very probable).
How big the type objects are? Maybe it is better to still account them here to be on the safe side.

The memory accounting primitives for Type is not very complex to implement, but also non trivial. To implement proper memory accounting the TypeSignature class and TypeSignatureParameter class will have to provide memory accounting capabilities as well. In case of TypeSignatureParameter which stores the value as Object in adds additional implementation complexity, as based on the type of parameter the value of a parameter can be of 4 different types that would all have to be accounted for.

From one perspective the complexity added is not that high. But from the practical perspective I'm not sure if it is worth it.

As I already mentioned in the comment the types are cached by TypeRegistry. Many simple types are singletons (such as BigintType). The only cache in TypeRegistry that is bounded is parametricTypeCache. But it accommodates 1000 types, and in reality it is rather unlikely to overflow (as it would mean that at the current moment the system processes queries over more than 1000 columns that all have distinct types). But even so - types are usually reused across all splits as the column handles are usually resolved only once before the split enumeration starts. Thus accounting memory for splits is more likely than not to cause double counting for the memory used by those types.

losipiuk · 2021-12-13T10:16:29Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplit.java

there is some more stuff in Properties (defaults, map) - not sure if set in our case.

This method should account memory for all key-value pairs stored in the Properties. However as you pointed out it may not account for all internal data structure maintained by the map (what duplicates the Properties map for efficiency) and defaults (which may shadow some of the keys). It is hard to estimate memory usage of Properties precisely, but the current estimate should be close.

Ideally the schema field should be removed completely, as it duplicates a lot of data already present in the split (prestodb/presto#13453). But that's probably a topic for a follow up.

linzebing · 2021-12-13T22:46:47Z

core/trino-main/src/main/java/io/trino/split/RemoteSplit.java

Why change from URI to String? Similarly for others parts (e.g. change to serialized page), is it for accurate memory tracking? How much of a difference does it make?

It's hard to say. As i tried to explain in the commit message it is difficult to predict how much memory does the URI use, as it has many different fields where it caches certain URI parts as well as decoded / encoded representation for some of those.

Store task location as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object

Store remote URI as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object

Remove KuduTableHandle from the KuduSplit to simplify memory accounting. KuduTableHandle contains KuduTable, a class defined by the Kudu client library. Implementing memory accounting correctly for that object is challenging.

Store time zone id as String to simplify memory accounting

Replace WrappedRange with SerializedRange to simplify memory accounting

Convert sheets values to String early and store them as String in the split to simplify memory accounting

Replace WrappedPhoenixInputSplit with SerializedPhoenixInputSplit to simplify memory accounting

cla-bot bot added the cla-signed label Dec 10, 2021

arhimondr requested review from linzebing, losipiuk and martint December 10, 2021 21:01

losipiuk reviewed Dec 13, 2021

View reviewed changes

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryColumnHandle.java Outdated Show resolved Hide resolved

losipiuk reviewed Dec 13, 2021

View reviewed changes

losipiuk approved these changes Dec 13, 2021

View reviewed changes

arhimondr force-pushed the split-memory-tracking-master branch from 4a729bb to 37ebf5b Compare December 13, 2021 19:10

linzebing approved these changes Dec 13, 2021

View reviewed changes

arhimondr added 9 commits December 22, 2021 14:13

Refactor RemoteSplit

20e3d8c

Store task location as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object

Refactor ExampleSplit

c7bc710

Store remote URI as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object

Refactor PrometheusSplit

849a63f

Store remote URI as String to make split memory accounting simpler as accounting memory used by the URI object is rather complex due to many "caching" fields for various URI parts inside the URI object

Refactor KuduSplit

a04bb06

Remove KuduTableHandle from the KuduSplit to simplify memory accounting. KuduTableHandle contains KuduTable, a class defined by the Kudu client library. Implementing memory accounting correctly for that object is challenging.

Refactor AtopSplit

7d98635

Store time zone id as String to simplify memory accounting

Refactor AccumuloSplit

bc44d70

Replace WrappedRange with SerializedRange to simplify memory accounting

Refactor SheetsSplit

ea56b04

Convert sheets values to String early and store them as String in the split to simplify memory accounting

Refactor PhoenixSplit

0d84280

Replace WrappedPhoenixInputSplit with SerializedPhoenixInputSplit to simplify memory accounting

Add ConnectorSplit#getRetainedSizeInBytes method

0915954

arhimondr force-pushed the split-memory-tracking-master branch from 37ebf5b to 0915954 Compare December 22, 2021 19:22

losipiuk approved these changes Dec 22, 2021

View reviewed changes

losipiuk merged commit a4898a7 into trinodb:master Dec 23, 2021

github-actions bot added this to the 368 milestone Dec 23, 2021

mosabua mentioned this pull request Dec 30, 2021

Add Trino 368 release notes #10433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide memory tracking capabilities for connector splits#10273

Provide memory tracking capabilities for connector splits#10273
losipiuk merged 9 commits intotrinodb:masterfrom
arhimondr:split-memory-tracking-master

arhimondr commented Dec 10, 2021

Uh oh!

losipiuk Dec 13, 2021

Uh oh!

arhimondr Dec 13, 2021

Uh oh!

losipiuk Dec 13, 2021

Uh oh!

arhimondr Dec 13, 2021

Uh oh!

losipiuk Dec 13, 2021

Uh oh!

arhimondr Dec 13, 2021 •

edited

Loading

Uh oh!

Uh oh!

losipiuk Dec 13, 2021

Uh oh!

arhimondr Dec 13, 2021

Uh oh!

losipiuk Dec 13, 2021

Uh oh!

arhimondr Dec 13, 2021 •

edited

Loading

Uh oh!

linzebing Dec 13, 2021

Uh oh!

arhimondr Dec 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

arhimondr commented Dec 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

arhimondr Dec 13, 2021 •

edited

Loading

arhimondr Dec 13, 2021 •

edited

Loading