Benchmark for Hive Projection Pushdown#2307
Conversation
|
How does Hive projection pushdown relates to lazy nested blocks? From high-level perspective, they both seem to do the same. |
Optimizer rule application: i. PushdownDereferences rule (in the plan, #1435) ii. PushProjectionIntoTableScan rule (creates virtual column handle for iii. PredicatePushdown optimizer iv. PushPredicateIntoTableScan: The optimization around skipping a stripe/file has to done with the help of pushdowns during planning, and not something that can be achieved by loading blocks lazily. Another approach could be predicate pushdowns by supporting dereference projections in
|
|
@phd3 Thank you for throughout explanation. Indeed, ability to pushdown predicates for dereferenced expression is really valuable. |
|
there are build failures |
0f877d4 to
7e06391
Compare
|
@sopel39 the failures seem to be unrelated. |
7e06391 to
5af2d5d
Compare
presto-hive/src/test/java/io/prestosql/plugin/hive/benchmark/FileFormat.java
Outdated
Show resolved
Hide resolved
...o-hive/src/test/java/io/prestosql/plugin/hive/benchmark/BenchmarkProjectionPushdownHive.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/io/prestosql/plugin/hive/benchmark/FileFormat.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/io/prestosql/plugin/hive/benchmark/FileFormat.java
Outdated
Show resolved
Hide resolved
5af2d5d to
5d8bd0a
Compare
|
@sopel39 thanks for the review, addressed your comments. |
| { | ||
| private BenchmarkHiveFileFormatUtil() {} | ||
|
|
||
| protected static class TestData |
There was a problem hiding this comment.
make the class top level and call it BenchmarkHiveFileTestData. Also why it has to be top-level class?
There was a problem hiding this comment.
Since we're using TestData in multiple independent classes, I thought it might be better to extract it out. I hadn't thought about making it top-level, but it makes sense with the name you suggested, rather than putting in a random util class.
presto-hive/src/test/java/io/prestosql/plugin/hive/benchmark/FileFormat.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/io/prestosql/plugin/hive/benchmark/FileFormat.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| private static ConnectorPageSource createPageSource( | ||
| private static ConnectorPageSource createPageSourceForBaseColumns( |
There was a problem hiding this comment.
What are baseColumns? Undo this rename?
There was a problem hiding this comment.
I had added ForBaseColumns to indicate that the provided column names to this method will always correspond to top-level (or base) columns, not projected ones. For example, createFileFormatReader should not be given a projected column (like "a.b"), since the createPageSource method is only capable of reading "a", and will leave up to the hive page source wrapper to extract out the field. createGenericReader uses the column handles constructed for such nested fields.
However, this method doesn't have a way to provide projected handles, so undoing the rename.
| return Optional.empty(); | ||
| } | ||
|
|
||
| public final ConnectorPageSource createFileFormatReader( |
There was a problem hiding this comment.
Actually, do we need this method if we have createGenericReader? Would createGenericReader create page source differently than this method?
createHiveReader returns a format-specific reader with the hive connector's wrapper page source.
HivePageSourceProvider should first try to instantiate native reader
There was a problem hiding this comment.
HivePageSourceProvider should first try to instantiate native reader
Currently HivePageSourceProvider::createPageSource does that implicitly, when provided with the factories. So simulating that flow requires providing a factory, without instantiating the reader. Not sure if I understand your point correctly.
Would createGenericReader create page source differently than this method?
The method is the same, the pagesource returned by createGenericReader may incur some overhead in addition to the processing done by the delegate readers returned in createFileFormatReader. I kept createFileFormatReader thinking that the existing benchmarks testing the file-formate specific readers would want numbers without the HivePageSource's overhead.
presto-hive/src/test/java/io/prestosql/plugin/hive/benchmark/FileFormat.java
Outdated
Show resolved
Hide resolved
...o-hive/src/test/java/io/prestosql/plugin/hive/benchmark/BenchmarkProjectionPushdownHive.java
Outdated
Show resolved
Hide resolved
| @Measurement(iterations = 50) | ||
| @Warmup(iterations = 20) | ||
| @Fork(3) | ||
| public class BenchmarkProjectionPushdownHive |
There was a problem hiding this comment.
are there some parts of code we could unify with BenchmarkHiveFileFormat?
There was a problem hiding this comment.
hmm, I thought about this, but feel that keeping the two classes separate would be cleaner. BenchmarkHiveFileFormat is more about looking at reader performances across different types/compressions/formats, whereas BenchmarkProjectionPushdownHive documents the performance of the adaptations.
code-wise, I don't think there is much duplication in the two classes.
5d8bd0a to
8a53d6f
Compare
| @@ -504,44 +504,6 @@ private static <E extends TpchEntity> TestData createTpchDataSet(FileFormat form | |||
| return new TestData(columnNames, columnTypes, pages.build()); | |||
There was a problem hiding this comment.
change commit title. It now extracts TestData as top-level class
|
small comment + CI fails |
8a53d6f to
0024e49
Compare
|
merged, thanks! |

This PR adds benchmarks for the Hive Projection pushdown changes in #1720. The change in #1720 enables reading only the projected subfields for row type columns. cc @martint
This benchmark can be run to validate the dereference projection pushdown performance for different file formats, as they are implemented. I think the following comparisons can help us be confident about the anticipated benefits of projection pushdown.
aof row type containingNvarchar fields, then accessing the virtual column for "a.b" should be ~N times faster than accessing columna. With nested row type, the throughput increase should be even more drastic.Benchmark Setup and parameters
The benchmark writes some data in the setup phase and then measures throughput of operations while reading pages. Here's some explanation of parameters used in the benchmark.
columnTypeString: type ofSTRUCTcolumn to write.writeStrategy: Decides the writing schemaSTRUCT: the schema is a single column with thecolumnTypeStringtypeTOP_LEVEL: the schema consists of columns with types given by the fields ofcolumnTypeString.readStrategy: Decides whether we project columns or not. No difference in case ofTOP_LEVELwrite strategyWITH_PUSHDOWN: read projected columnsWITHOUT_PUSHDOWN: read struct columnreadColumnCount: Number of columns/fields to readSome benchmark Results
Say
arepresents the struct,a.xrepresents the primitive field.brepresents the primitive top-level column.ORC
Parquet