Improve error handling for partial aggregation pushdown#22011
Improve error handling for partial aggregation pushdown#22011rschlussel merged 5 commits intoprestodb:masterfrom
Conversation
|
I'm still updating the tests. Don't review yet |
4ba2d25 to
01bfe06
Compare
|
this is ready for review (failing tests are flaky/unrelated) |
abhiseksaikia
left a comment
There was a problem hiding this comment.
LGTM % minor nit and a question
presto-hive/src/main/java/com/facebook/presto/hive/orc/DwrfAggregatedPageSourceFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Question: I noticed that some parts of the aggregated page source factory have similar logic as that of its respective non-aggregated page source factory. Does it make sense to refactor this duplicated code or is it better to leave it as is and avoid introducing more complexity/refactoring?
vivek-bharathan
left a comment
There was a problem hiding this comment.
Thanks for improving on the original implementation. Overall lgtm
There was a problem hiding this comment.
nit: this function signature and the one below
| private static boolean shouldSkipPartition(TypeManager typeManager, HiveTableLayoutHandle hiveLayout, DateTimeZone hiveStorageTimeZone, HiveSplit hiveSplit, SplitContext | |
| private static boolean shouldSkipPartition(TypeManager typeManager, | |
| HiveTableLayoutHandle hiveLayout, | |
| DateTimeZone hiveStorageTimeZone, | |
| HiveSplit hiveSplit, | |
| SplitContext splitContext) |
The test needs to drop the views at the end.
a8c2a38
01bfe06 to
a8c2a38
Compare
|
thanks for review @abhiseksaikia and @ClarenceThreepwood. I've addressed your comments. I also split out the commits a bit as per request from @ajaygeorge. |
ajaygeorge
left a comment
There was a problem hiding this comment.
Consolidate error handling for ParquetPageSourceFactory a8c2a38 looks good % a nit
ajaygeorge
left a comment
There was a problem hiding this comment.
Remove unneeded error handling from page source factories f7fae20 looks good % some comments.
There was a problem hiding this comment.
curious. where does this check move after the refactoring. I wasn't able to find it. Is it not needed any more.?
There was a problem hiding this comment.
tagged you where this check is moved to. instead of adding error handling for every file format, we do it all in one place. that's why it's not needed here anymore.
presto-hive/src/main/java/com/facebook/presto/hive/HivePageSourceProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
@ajaygeorge this is where the check is moved to. If our columns are aggregated, we try to create an aggregatedPageSource by looping through all the aggregatedPageSourceFactories and returning when we get an aggregated page source (it's a weird way to do things, but it's how the selective and batch page sources work too), but if the file format doesn't support it (i.e. we finish looping through without returning), then we throw an exception.
ajaygeorge
left a comment
There was a problem hiding this comment.
Rest commits look good. LGTM
There was a problem hiding this comment.
nit. arguments on separate lines for readability.
Improve error handling for partial aggregation pushdown and prevent returning wrong results when footer stats should not be relied on. This covers the following cases: 1. Aggregations have been pushed down but partition file format does not support aggregation pushdown (can occur if table is declared with a supported storage format, but partition has a different storage format). Previously, page source providers for some file formats had special handling for this case, but not all 2. Always throw an exception if aggregations have been pushed down but partition footer stats are unreliable. Previously, if filter pushdown was enabled (used OrcSelectivePageSourceFactory), we wouldn't create an AggregatedPageSource, so you would get an error somewhere on read. If it was disabled (OrcBatchPageSourceFactory), we would create an AggregatedPageSource and the query would silently give wrong results. 3. Unexpected state where some but not all columns are of AGGREGATED type. Error handling is still going to be reader dependent if both the table and partition format support partial aggregation pushdown, but the partition format does not support as many types (e.g. parquet vs. orc)
Remove error handling for aggregated columns from individual page source factories, as these errors are now handled in a consolidated place. This commit is separate from the main commit that consolidated the error handling for easier review.
create a utility method so we can share the error handling code between aggregated and batch page source factories.
a8c2a38 to
ed7bb4b
Compare
| DwrfEncryptionProvider dwrfEncryptionProvider, | ||
| boolean appendRowNumberEnabled) | ||
| { | ||
| OrcDataSource orcDataSource = getOrcDataSource(session, fileSplit, hdfsEnvironment, configuration, hiveFileContext, stats); |
There was a problem hiding this comment.
@rschlussel this is resource leak because we don't close the orcDataSource in a happy case
There was a problem hiding this comment.
oh good catch. Thank you!
Improve error handling for partial aggregation pushdown and prevent returning wrong results when footer stats should not be relied on. This covers the following cases:
Error handling is still going to be reader dependent if both the table and partition format support partial aggregation pushdown, but the partition format does not support as many types (e.g. currently supports more types for partial aggregation pushdown).
Description
Previously AggregatedPageSources (which support the execution side of partial aggregation pushdown) were created from within the selective and batch page source factories of supported file formats. Similarly error handling for any unsupported file format needed to be repeated for each PageSourceFactory of all unsupported file formats. This resulted in a fragmented implementation and some unsupported file formats that did not include proper error handling.
Additionally, partial aggregation pushdown cannot be used when footer stats are unreliable, however handling for this was only added for one of the supported file formats factories (OrcSelectivePageSourceFactory) while others (orc and parquet batch factories) could silently return wrong results. Furthermore, the handling in OrcSelectivePageSourceFactory prevented wrong results by not creating an aggregated page source but didn't produce a clear error message because it kept going by trying to create a selective page source.
This PR makes HiveAggregatedPageSourceFactories into a top-level concept similar to HiveSelectivePageSourceFactories and HiveBatchPageSourceFactories so that we can unify all the error handling and prevent bugs from creeping in as new file format page source factories are added.
The main logic of the change is in HivePageSourceProvider. A lot of the rest of it is scaffolding to support that.
Motivation and Context:
This gap was discovered as part of an audit to make sure we were not assuming that partition file formats will always match table file formats.
Impact
Fix a potential wrong results bug when footer stats are marked as unreliable and aggregation pushdown is enabled. Ensure all file formats that don't support aggregation pushdown will return a clear error to the user.
Test Plan
new unit tests for HivePageSourceProvider
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.