Skip to content

Optimize HiveSplit serialization#13453

Merged
arhimondr merged 2 commits intoprestodb:masterfrom
arhimondr:optimize_hive_split
Oct 1, 2019
Merged

Optimize HiveSplit serialization#13453
arhimondr merged 2 commits intoprestodb:masterfrom
arhimondr:optimize_hive_split

Conversation

@arhimondr
Copy link
Copy Markdown
Member

@arhimondr arhimondr commented Sep 26, 2019

Serializing "schema" for every splits is very expensive. Instead of creating it on the coordinator it can be re-constructed on a worker.

If release note is NOT required, use:

== RELEASE NOTES ==

Hive Changes
* Improve cpu load on coordinator by reducing the cost of serializing ``HiveSplit``s

@arhimondr arhimondr changed the title [WIP] Optimize HiveSplit serialization Optimize HiveSplit serialization Sep 26, 2019
@arhimondr arhimondr assigned wenleix and rschlussel and unassigned wenleix Sep 26, 2019
@rschlussel
Copy link
Copy Markdown
Contributor

haven't read the PR yet, but this should get a release note since it fixes a performance issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you account for the size of hiveTypeName

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after in person conversation it's because the values are shared from a cache. So the plan is

  1. add a note explaining.
  2. get rid of dead code in HiveTypeName if HiveTypeName.getEstimatedSizeInBytes isn't used anywhere else

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class has two filelds

private final HiveTypeName hiveTypeName;
private final TypeInfo typeInfo;

hiveTypeName is set from typeInfo

TypeInfo objects are always created by the TypeInfoFactory that has a static singleton cache. Thus it doesn't make much sense to account memory for these objects here, as those are shared.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hiveTypeName is set from typeInfo

Sorry, that is not true. HiveTypeName is created based on TypeInfo. We still need to account for it. Let me fix it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, does that mean Column is not counted for memory usage for years?

BTW: why the method name is not getEstimatedRetainedSizeInBytes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, does that mean Column is not counted for memory usage for years?

In this PR the Map<Integer, HiveTypeName> columnCoercion is replaced with the Map<Integer, Column>, that's why the size of the Column has to be accounted.

BTW: why the method name is not getEstimatedRetainedSizeInBytes?

I believe the question is why it is not the getEstimatedSizeInBytes. I wanted the name to be kinda descriptive of the fact that the size of the TypeInfo is not accounted, as it is not "retained" by the Column object.

Copy link
Copy Markdown
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge after addressing the remaining comments

@arhimondr
Copy link
Copy Markdown
Member Author

@rschlussel comments addressed

Copy link
Copy Markdown
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Refactor HiveTableLayoutHandle" LGTM.

Copy link
Copy Markdown
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Optimize imports in ParquetPageSourceFactory": This is accidentally done in #13473 so I think you can drop it ;)

Copy link
Copy Markdown
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an initial pass and generally looks good. I will made another pass next week.

I have two questions:

  1. The idea is we don't need to have partition schema in each HiveSplit, instead, we can recompute it via table schema + column coercion. Is it guaranteed to be the same ?

  2. Should we use the old name (columnCoercion), or use the new name (partitionSchemaDifference or partitionSchemaOverride). The former is more compatible with the Hive context and other existing enum/variables (e.g. CoercionPolicy). The latter seems to be more descriptive.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, does that mean Column is not counted for memory usage for years?

BTW: why the method name is not getEstimatedRetainedSizeInBytes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, sd comes from Hive which is the abbreviation for StorageDescriptor

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. That makes sense. However i found that storage would be a better name for the variable of type Storage, so it is not confused with the StorageDescriptor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about calling the third parameter partitionSchmeaOverride? -- basically it defines "column overrides" at partition level right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me call it partitionSchemaDifference, to be consistent with the field name in the HiveSplit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this just an inline? -- do we need this change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Properties schema is no longer in the parameters. Now there's Storage instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

        int partitionDataColumnCount = partition.getPartition()
                .map(p -> p.getColumns().size())
                .orElse(table.getDataColumns().size());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure understand this method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed in the BackgroundHiveSplitLoader to avoid creating the whole schema, as it only needs custom properties from the storage descriptor with a possible override by table properties. The semantic is weird, but I'm just copying the existing behaviour.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What about name this partitionSchmeaOverride ? Ditto for other places.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than it should be something like partitionColumnSchemaOverrides. As this doesn't override the whole partition schema, but only some columns. Thus i like the partitionSchemaDifference more. I would like to keep it that way if you don't mind.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let @rschlussel decide whether this variable should be named coercionFrom or coerceFrom 😃

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to rename it. Renamed it back.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: why we need to pass Storage now? Can serdeParameter be a huge map?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what's suppose to be stored in schema when it's a Properties

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: why we need to pass Storage now? Can serdeParameter be a huge map?

This object contains partition specific storage information. Unfortunately if partition has some custom storage parameter - we should pass it, as those are needed to initialize the Input reader / serde.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what's suppose to be stored in schema when it's a Properties

Yeah, the Properties schema used to contain them before.

@wenleix wenleix assigned arhimondr and unassigned wenleix and rschlussel Sep 28, 2019
Generate table layout name outside of the HiveTableLayoutHandle constructor.
@arhimondr
Copy link
Copy Markdown
Member Author

@wenleix

The idea is we don't need to have partition schema in each HiveSplit, instead, we can recompute it via table schema + column coercion. Is it guaranteed to be the same ?

Unless there's a bug it should be the same

Should we use the old name (columnCoercion), or use the new name (partitionSchemaDifference or partitionSchemaOverride). The former is more compatible with the Hive context and other existing enum/variables (e.g. CoercionPolicy). The latter seems to be more descriptive.

Since the partition schema has to be recreated now, the Map<Integer, HiveTypeName> columnCoercion is not enough, as it contains information only about the difference it types. To accurately recreated the partition schema - the difference in column names should also be tracked. That's why the Map<Integer, HiveTypeName> got replaced with the Map<Integer, Column>. And that's why the Map<Integer, Column> also contains the information about the extra columns present in partition (the Map<Integer, HiveTypeName> didn't)

Then based on the partitionSchemaDifference the column mappings for coercion policy are created.

Copy link
Copy Markdown
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

private static Properties getHiveSchema(
Storage sd,
List<Column> dataColumns,
public static Properties getHiveSchema(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to rename this method into something like getHiveTableParameters in a separate PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not gonna be precise. As the Properties getHiveSchema follows this weird semantic of replacing null values with empty strings. This method is trying to mimic the behaviour of the original getHiveSchema

@arhimondr arhimondr merged commit 1d61bc7 into prestodb:master Oct 1, 2019
@arhimondr arhimondr deleted the optimize_hive_split branch October 1, 2019 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants