Disable S3Select pushdown when query not filters data by fuatbasik · Pull Request #13477 · trinodb/trino

fuatbasik · 2022-08-03T08:56:41Z

Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out some portion of the data residing on S3. This commits disables Select pushdown when query has no predicate, or projection. To retrieve entire object, S3 GetObject is a cheaper option.

Description

Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out significant portion of the data residing on S3. This commit disables Select pushdown when query has no predicate, or projection. In these cases using S3 GetObject is both cheaper and faster.

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

To Trino-Hive connector

How would you describe this change to a non-technical end user or system administrator?

This commit disables use of S3 Select, when it is not going to improve the performance.

Related issues, pull requests, and links

Documentation

(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(X) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

cla-bot · 2022-08-03T08:56:43Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Fuat Basik.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

preethiratnam · 2022-08-03T09:21:11Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+        return isEquivalentColumns(projectedColumnNames, schema) && isEquivalentColumnTypes(projectedColumnTypes, schema);
+    }
+
+    private boolean isEquivalentColumns(Set<String> projectedColumnNames, Properties schema)


Minor: do you think these utility methods can be static?

Sure, i will fix this in the next revision

dnanuti · 2022-08-03T13:05:55Z

I would like to call-out that our Docker tests are relying on retrieving the entire table content, so pushing down non-filtering queries to Select. We need to add a mechanism to pass filtering queries before merging this PR.

fuatbasik · 2022-08-03T13:27:31Z

You mean these tests right (https://github.com/trinodb/trino/blob/master/plugin/trino-hive-hadoop2/src/test/java/io/trino/plugin/hive/s3select/TestHiveFileSystemS3SelectPushdown.java)? You are right and it is a really good catch. With this change, none of these tests will be pushed down to Select.

Let me create a new test utility method, ScanAndFilterTable that accepts Table and ColumnHandles, and uses Select to return only the relevant data. Next, i can add tests in the aforementioned class that uses this method, instead of readTable method. This way, we can check correctness and completeness of the Select Filtering too, which should be a good addition to test coverage.

dnanuti · 2022-08-03T14:28:42Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+        }
+        else {
+            final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ",");
+            columnNames = new HashSet<>(asList(columnNameProperty.split(columnNameDelimiter)));


This can be simplified to Set.of(columnNameProperty.split(columnNameDelimiter))

Thanks. Fixed this.

dnanuti · 2022-08-03T14:29:10Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+        Set<String> columnNames;
+        String columnNameProperty = schema.getProperty(LIST_COLUMNS);
+        if (columnNameProperty.length() == 0) {
+            columnNames = new HashSet<>();


Should we use Collections.emptySet()?

Same for the other occurrences - lines: 165, 179

Thanks. Fixed this.

arhimondr · 2022-08-08T21:13:30Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+            columnNames = new HashSet<>();
+        }
+        else {
+            final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ",");


nit: drop final

Thanks. Fixed this.

arhimondr · 2022-08-08T21:13:48Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+        Set<String> columnNames;
+        String columnNameProperty = schema.getProperty(LIST_COLUMNS);
+        if (columnNameProperty.length() == 0) {
+            columnNames = new HashSet<>();


Prefer Set.of or ImmutableSet.of

Thanks. Fixed this.

arhimondr · 2022-08-08T21:14:11Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+        String columnTypeProperty = schema.getProperty(LIST_COLUMN_TYPES);
+        Set<String> columnTypes;
+        if (columnTypeProperty.length() == 0) {
+            columnTypes = new HashSet<>();


Same comments here

Thanks. Fixed this.

arhimondr · 2022-08-08T21:14:33Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+    private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper)
+    {
+        if (readerColumns == null || readerColumns.isEmpty()) {
+            return new HashSet<>();


Thanks. Fixed this.

arhimondr · 2022-08-08T21:15:17Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+
+    private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper)
+    {
+        if (readerColumns == null || readerColumns.isEmpty()) {


Can readerColumns ever be null?

cannot be null but can be empty. Dropping the null check.

arhimondr · 2022-08-08T21:15:31Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

+        }
+        return readerColumns.stream()
+                .map(mapper)
+                .collect(Collectors.toSet());


toImmutableSet

arhimondr · 2022-08-08T21:16:35Z

...trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestS3SelectRecordCursorProvider.java

+    public void shouldReturnSelectRecordCursor()
+    {
+        List<HiveColumnHandle> columnHandleList = new ArrayList<>();
+        s3SelectPushdownEnabled = true;


This is not thread safe (multiple tests can run in parallel)

I made s3SelectPushdownEnabled a method-local variable for the tests.

arhimondr · 2022-08-08T21:17:49Z

...trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestS3SelectRecordCursorProvider.java

+
+public class TestS3SelectRecordCursorProvider
+{
+    private static final Configuration CONFIGURATION = ConfigurationInstantiator.newEmptyConfiguration();


I would recommend inlining all mutable fields for thread safety (CONFIGURATION, HDFS_ENVIRONMENT, S3_SELECT_RECORD_CURSOR_PROVIDER, SCHEMA)

I am not sure i understood this comment. Should I re-create these objects in each Test method?

I am not sure i understood this comment. Should I re-create these objects in each Test method?

Yeah. Those objects should be cheap to create and it would make the tests thread safe allowing multiple tests to be executed in parallel.

arhimondr · 2022-08-08T21:18:27Z

...trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestS3SelectRecordCursorProvider.java

+
+    private static String buildPropertyFromColumns(List<HiveColumnHandle> columns, Function<HiveColumnHandle, String> mapper)
+    {
+        if (columns == null || columns.isEmpty()) {


can columns ever be null?

it shouldn't be, i am dropping the null check. But they can be empty.

arhimondr · 2022-08-08T21:18:53Z

...trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestS3SelectRecordCursorProvider.java

+    {
+        s3SelectPushdownEnabled = true;
+        TupleDomain<HiveColumnHandle> effectivePredicate = TupleDomain.all();
+        final List<HiveColumnHandle> readerColumns = ImmutableList.of(QUANTITY_COLUMN, AUTHOR_COLUMN, ARTICLE_COLUMN);


nit: drop final (here an in shouldNotReturnSelectRecordCursorWhenProjectionOrderIsDifferent)

Thanks. Fixing it in second revision.

arhimondr

LGTM % remaining comments

arhimondr · 2022-08-10T19:55:25Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java

nit: we usually prefer the toImmutableList collector from Guava

Thanks, updating in the next commit.

arhimondr · 2022-08-10T19:56:57Z

...trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestS3SelectRecordCursorProvider.java

nit: I would probably simply inline it

Thanks, I have inlined the parameter.

arhimondr · 2022-08-10T19:57:33Z

...trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestS3SelectRecordCursorProvider.java

nit: I would recommend putting each argument on a separate line (here and in other places)

Done, i encapsulated the createRecordCursor in a method and there I am putting each argument on a separate line.

Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out significant portion of the data residing on S3. This commits disables Select pushdown when query has no predicate, or projection. In these cases using GET is both cheaper and faster.

colebow · 2022-08-15T22:56:16Z

If this improves performance/reduces cost, it would be good to include it in the release notes. Could you please propose a potential release note?

cc @arhimondr

fuatbasik · 2022-08-17T14:12:40Z

Potential release note: Improve efficiency for queries over tables in CSV and JSON formats stored on S3 when no filtering or projection is needed by automatically disabling S3 Select pushdown.

findepi requested a review from arhimondr August 3, 2022 09:05

preethiratnam approved these changes Aug 3, 2022

View reviewed changes

fuatbasik requested a review from preethiratnam August 3, 2022 09:25

github-actions bot added the tests:hive label Aug 3, 2022

preethiratnam approved these changes Aug 3, 2022

View reviewed changes

dnanuti reviewed Aug 3, 2022

View reviewed changes

fuatbasik mentioned this pull request Aug 5, 2022

Enabled hive splits for uncompressed CSV files with S3 Select pushdown #13417

Merged

arhimondr reviewed Aug 8, 2022

View reviewed changes

cla-bot bot added the cla-signed label Aug 10, 2022

fuatbasik force-pushed the select-pushdown-optimisation branch 2 times, most recently from a1b0841 to 49aa6a4 Compare August 10, 2022 13:40

fuatbasik requested review from arhimondr, dnanuti and preethiratnam August 10, 2022 13:44

arhimondr reviewed Aug 10, 2022

View reviewed changes

fuatbasik force-pushed the select-pushdown-optimisation branch from 49aa6a4 to e11bfb2 Compare August 12, 2022 12:16

fuatbasik requested a review from arhimondr August 12, 2022 12:17

arhimondr approved these changes Aug 12, 2022

View reviewed changes

dnanuti requested a review from arhimondr August 12, 2022 13:40

fuatbasik removed request for dnanuti and preethiratnam August 12, 2022 13:44

dnanuti approved these changes Aug 12, 2022

View reviewed changes

fuatbasik force-pushed the select-pushdown-optimisation branch from e11bfb2 to bc8e2fc Compare August 15, 2022 15:45

arhimondr merged commit a4b3cd7 into trinodb:master Aug 15, 2022

github-actions bot added this to the 393 milestone Aug 15, 2022

colebow mentioned this pull request Aug 15, 2022

Add Trino 393 release notes #13519

Merged

dnanuti mentioned this pull request Nov 15, 2022

Hive Connector with Amazon S3 documentation updates #15035

Merged

Conversation

fuatbasik commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

cla-bot bot commented Aug 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnanuti commented Aug 3, 2022

Uh oh!

fuatbasik commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuatbasik Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuatbasik commented Aug 3, 2022 •

edited

Loading

fuatbasik commented Aug 3, 2022 •

edited

Loading

fuatbasik Aug 10, 2022 •

edited

Loading