Disable S3Select pushdown when query not filters data#13477
Disable S3Select pushdown when query not filters data#13477arhimondr merged 1 commit intotrinodb:masterfrom
Conversation
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Fuat Basik.
|
| return isEquivalentColumns(projectedColumnNames, schema) && isEquivalentColumnTypes(projectedColumnTypes, schema); | ||
| } | ||
|
|
||
| private boolean isEquivalentColumns(Set<String> projectedColumnNames, Properties schema) |
There was a problem hiding this comment.
Minor: do you think these utility methods can be static?
There was a problem hiding this comment.
Sure, i will fix this in the next revision
|
I would like to call-out that our Docker tests are relying on retrieving the entire table content, so pushing down non-filtering queries to Select. We need to add a mechanism to pass filtering queries before merging this PR. |
|
You mean these tests right (https://github.com/trinodb/trino/blob/master/plugin/trino-hive-hadoop2/src/test/java/io/trino/plugin/hive/s3select/TestHiveFileSystemS3SelectPushdown.java)? You are right and it is a really good catch. With this change, none of these tests will be pushed down to Select. Let me create a new test utility method, ScanAndFilterTable that accepts Table and ColumnHandles, and uses Select to return only the relevant data. Next, i can add tests in the aforementioned class that uses this method, instead of readTable method. This way, we can check correctness and completeness of the Select Filtering too, which should be a good addition to test coverage. |
| } | ||
| else { | ||
| final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ","); | ||
| columnNames = new HashSet<>(asList(columnNameProperty.split(columnNameDelimiter))); |
There was a problem hiding this comment.
This can be simplified to Set.of(columnNameProperty.split(columnNameDelimiter))
There was a problem hiding this comment.
Thanks. Fixed this.
| Set<String> columnNames; | ||
| String columnNameProperty = schema.getProperty(LIST_COLUMNS); | ||
| if (columnNameProperty.length() == 0) { | ||
| columnNames = new HashSet<>(); |
There was a problem hiding this comment.
Should we use Collections.emptySet()?
There was a problem hiding this comment.
Same for the other occurrences - lines: 165, 179
There was a problem hiding this comment.
Thanks. Fixed this.
| columnNames = new HashSet<>(); | ||
| } | ||
| else { | ||
| final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ","); |
There was a problem hiding this comment.
Thanks. Fixed this.
| Set<String> columnNames; | ||
| String columnNameProperty = schema.getProperty(LIST_COLUMNS); | ||
| if (columnNameProperty.length() == 0) { | ||
| columnNames = new HashSet<>(); |
There was a problem hiding this comment.
Prefer Set.of or ImmutableSet.of
There was a problem hiding this comment.
Thanks. Fixed this.
| String columnTypeProperty = schema.getProperty(LIST_COLUMN_TYPES); | ||
| Set<String> columnTypes; | ||
| if (columnTypeProperty.length() == 0) { | ||
| columnTypes = new HashSet<>(); |
There was a problem hiding this comment.
Thanks. Fixed this.
| private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper) | ||
| { | ||
| if (readerColumns == null || readerColumns.isEmpty()) { | ||
| return new HashSet<>(); |
There was a problem hiding this comment.
Thanks. Fixed this.
|
|
||
| private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper) | ||
| { | ||
| if (readerColumns == null || readerColumns.isEmpty()) { |
There was a problem hiding this comment.
Can readerColumns ever be null?
There was a problem hiding this comment.
cannot be null but can be empty. Dropping the null check.
| } | ||
| return readerColumns.stream() | ||
| .map(mapper) | ||
| .collect(Collectors.toSet()); |
| public void shouldReturnSelectRecordCursor() | ||
| { | ||
| List<HiveColumnHandle> columnHandleList = new ArrayList<>(); | ||
| s3SelectPushdownEnabled = true; |
There was a problem hiding this comment.
This is not thread safe (multiple tests can run in parallel)
There was a problem hiding this comment.
I made s3SelectPushdownEnabled a method-local variable for the tests.
|
|
||
| public class TestS3SelectRecordCursorProvider | ||
| { | ||
| private static final Configuration CONFIGURATION = ConfigurationInstantiator.newEmptyConfiguration(); |
There was a problem hiding this comment.
I would recommend inlining all mutable fields for thread safety (CONFIGURATION, HDFS_ENVIRONMENT, S3_SELECT_RECORD_CURSOR_PROVIDER, SCHEMA)
There was a problem hiding this comment.
I am not sure i understood this comment. Should I re-create these objects in each Test method?
There was a problem hiding this comment.
I am not sure i understood this comment. Should I re-create these objects in each Test method?
Yeah. Those objects should be cheap to create and it would make the tests thread safe allowing multiple tests to be executed in parallel.
|
|
||
| private static String buildPropertyFromColumns(List<HiveColumnHandle> columns, Function<HiveColumnHandle, String> mapper) | ||
| { | ||
| if (columns == null || columns.isEmpty()) { |
There was a problem hiding this comment.
can columns ever be null?
There was a problem hiding this comment.
it shouldn't be, i am dropping the null check. But they can be empty.
| { | ||
| s3SelectPushdownEnabled = true; | ||
| TupleDomain<HiveColumnHandle> effectivePredicate = TupleDomain.all(); | ||
| final List<HiveColumnHandle> readerColumns = ImmutableList.of(QUANTITY_COLUMN, AUTHOR_COLUMN, ARTICLE_COLUMN); |
There was a problem hiding this comment.
nit: drop final (here an in shouldNotReturnSelectRecordCursorWhenProjectionOrderIsDifferent)
There was a problem hiding this comment.
Thanks. Fixing it in second revision.
a1b0841 to
49aa6a4
Compare
arhimondr
left a comment
There was a problem hiding this comment.
LGTM % remaining comments
There was a problem hiding this comment.
nit: we usually prefer the toImmutableList collector from Guava
There was a problem hiding this comment.
Thanks, updating in the next commit.
There was a problem hiding this comment.
nit: I would probably simply inline it
There was a problem hiding this comment.
Thanks, I have inlined the parameter.
There was a problem hiding this comment.
nit: I would recommend putting each argument on a separate line (here and in other places)
There was a problem hiding this comment.
Done, i encapsulated the createRecordCursor in a method and there I am putting each argument on a separate line.
49aa6a4 to
e11bfb2
Compare
Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out significant portion of the data residing on S3. This commits disables Select pushdown when query has no predicate, or projection. In these cases using GET is both cheaper and faster.
e11bfb2 to
bc8e2fc
Compare
|
If this improves performance/reduces cost, it would be good to include it in the release notes. Could you please propose a potential release note? cc @arhimondr |
|
Potential release note: Improve efficiency for queries over tables in CSV and JSON formats stored on S3 when no filtering or projection is needed by automatically disabling S3 Select pushdown. |
Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out some portion of the data residing on S3. This commits disables Select pushdown when query has no predicate, or projection. To retrieve entire object, S3 GetObject is a cheaper option.
Description
Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out significant portion of the data residing on S3. This commit disables Select pushdown when query has no predicate, or projection. In these cases using S3 GetObject is both cheaper and faster.
Improvement
To Trino-Hive connector
This commit disables use of S3 Select, when it is not going to improve the performance.
Related issues, pull requests, and links
Documentation
(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(X) No release notes entries required.
( ) Release notes entries required with the following suggested text: