Optimize HiveSplit serialization#13453
Conversation
4020ac5 to
c1389a7
Compare
|
haven't read the PR yet, but this should get a release note since it fixes a performance issue. |
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveTableLayoutHandle.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitPartitionInfo.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
why don't you account for the size of hiveTypeName
There was a problem hiding this comment.
after in person conversation it's because the values are shared from a cache. So the plan is
- add a note explaining.
- get rid of dead code in HiveTypeName if HiveTypeName.getEstimatedSizeInBytes isn't used anywhere else
There was a problem hiding this comment.
The class has two filelds
private final HiveTypeName hiveTypeName;
private final TypeInfo typeInfo;
hiveTypeName is set from typeInfo
TypeInfo objects are always created by the TypeInfoFactory that has a static singleton cache. Thus it doesn't make much sense to account memory for these objects here, as those are shared.
There was a problem hiding this comment.
hiveTypeName is set from typeInfo
Sorry, that is not true. HiveTypeName is created based on TypeInfo. We still need to account for it. Let me fix it.
There was a problem hiding this comment.
Wow, does that mean Column is not counted for memory usage for years?
BTW: why the method name is not getEstimatedRetainedSizeInBytes?
There was a problem hiding this comment.
Wow, does that mean Column is not counted for memory usage for years?
In this PR the Map<Integer, HiveTypeName> columnCoercion is replaced with the Map<Integer, Column>, that's why the size of the Column has to be accounted.
BTW: why the method name is not getEstimatedRetainedSizeInBytes?
I believe the question is why it is not the getEstimatedSizeInBytes. I wanted the name to be kinda descriptive of the fact that the size of the TypeInfo is not accounted, as it is not "retained" by the Column object.
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
rschlussel
left a comment
There was a problem hiding this comment.
Feel free to merge after addressing the remaining comments
c1389a7 to
4559325
Compare
|
@rschlussel comments addressed |
wenleix
left a comment
There was a problem hiding this comment.
"Refactor HiveTableLayoutHandle" LGTM.
wenleix
left a comment
There was a problem hiding this comment.
I made an initial pass and generally looks good. I will made another pass next week.
I have two questions:
-
The idea is we don't need to have partition schema in each HiveSplit, instead, we can recompute it via table schema + column coercion. Is it guaranteed to be the same ?
-
Should we use the old name (
columnCoercion), or use the new name (partitionSchemaDifferenceorpartitionSchemaOverride). The former is more compatible with the Hive context and other existing enum/variables (e.g.CoercionPolicy). The latter seems to be more descriptive.
There was a problem hiding this comment.
Wow, does that mean Column is not counted for memory usage for years?
BTW: why the method name is not getEstimatedRetainedSizeInBytes?
There was a problem hiding this comment.
FWIW, sd comes from Hive which is the abbreviation for StorageDescriptor
There was a problem hiding this comment.
Yeah. That makes sense. However i found that storage would be a better name for the variable of type Storage, so it is not confused with the StorageDescriptor.
There was a problem hiding this comment.
What about calling the third parameter partitionSchmeaOverride? -- basically it defines "column overrides" at partition level right?
There was a problem hiding this comment.
Let me call it partitionSchemaDifference, to be consistent with the field name in the HiveSplit
There was a problem hiding this comment.
nit: is this just an inline? -- do we need this change?
There was a problem hiding this comment.
The Properties schema is no longer in the parameters. Now there's Storage instead.
There was a problem hiding this comment.
nit:
int partitionDataColumnCount = partition.getPartition()
.map(p -> p.getColumns().size())
.orElse(table.getDataColumns().size());There was a problem hiding this comment.
Not quite sure understand this method.
There was a problem hiding this comment.
This is needed in the BackgroundHiveSplitLoader to avoid creating the whole schema, as it only needs custom properties from the storage descriptor with a possible override by table properties. The semantic is weird, but I'm just copying the existing behaviour.
There was a problem hiding this comment.
nit: What about name this partitionSchmeaOverride ? Ditto for other places.
There was a problem hiding this comment.
Than it should be something like partitionColumnSchemaOverrides. As this doesn't override the whole partition schema, but only some columns. Thus i like the partitionSchemaDifference more. I would like to keep it that way if you don't mind.
There was a problem hiding this comment.
I will let @rschlussel decide whether this variable should be named coercionFrom or coerceFrom 😃
There was a problem hiding this comment.
I didn't mean to rename it. Renamed it back.
There was a problem hiding this comment.
Curious: why we need to pass Storage now? Can serdeParameter be a huge map?
There was a problem hiding this comment.
Also what's suppose to be stored in schema when it's a Properties
There was a problem hiding this comment.
Curious: why we need to pass Storage now? Can serdeParameter be a huge map?
This object contains partition specific storage information. Unfortunately if partition has some custom storage parameter - we should pass it, as those are needed to initialize the Input reader / serde.
There was a problem hiding this comment.
Also what's suppose to be stored in schema when it's a Properties
Yeah, the Properties schema used to contain them before.
Generate table layout name outside of the HiveTableLayoutHandle constructor.
Unless there's a bug it should be the same
Since the partition schema has to be recreated now, the Then based on the |
4559325 to
5c822fa
Compare
| private static Properties getHiveSchema( | ||
| Storage sd, | ||
| List<Column> dataColumns, | ||
| public static Properties getHiveSchema( |
There was a problem hiding this comment.
We might want to rename this method into something like getHiveTableParameters in a separate PR.
There was a problem hiding this comment.
This is actually not gonna be precise. As the Properties getHiveSchema follows this weird semantic of replacing null values with empty strings. This method is trying to mimic the behaviour of the original getHiveSchema
Serializing "schema" for every splits is very expensive. Instead of creating it on the coordinator it can be re-constructed on a worker.
If release note is NOT required, use: