Skip to content

Hive Connector with Amazon S3 documentation updates#15035

Merged
arhimondr merged 1 commit intotrinodb:masterfrom
dnanuti:master
Nov 22, 2022
Merged

Hive Connector with Amazon S3 documentation updates#15035
arhimondr merged 1 commit intotrinodb:masterfrom
dnanuti:master

Conversation

@dnanuti
Copy link
Copy Markdown
Member

@dnanuti dnanuti commented Nov 15, 2022

Description

Documentation updates following up changes on Hive connector with Amazon S3: fix for Select pushdown for uncompressed files, addition of JSON support to Amazon S3 Select and usage of S3 Select scan range requests.
Relevant PRs:

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Nov 15, 2022
@github-actions github-actions bot added the docs label Nov 15, 2022
@arhimondr
Copy link
Copy Markdown
Contributor

@jhlodin Could you please take a look?

Copy link
Copy Markdown
Contributor

@jhlodin jhlodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing docs! I suggested edits for the paragraph and a recommended anchor link.

Are any changes needed to the "Is S3 Select a good fit for my workload" section above this content?

Comment on lines 406 to 410
Copy link
Copy Markdown
Contributor

@jhlodin jhlodin Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For uncompressed files, Scan Range feature of S3 Select is used.
An Amazon S3 Select scan range request runs across the specified byte range.
This range is aligned with the internal Hive splits for the query fragments
that get pushed down to Select. Changes in the Hive connector performance
tuning configuration properties would be reflected here as well.
For uncompressed files, S3 Select scans ranges of bytes in parallel. The scan range
requests run across the byte ranges of the internal Hive splits for the query fragments
pushed down to S3 Select. Changes to the Hive catalog's :ref:`performance tuning
configuration properties <hive-performance-tuning-configuration>` are reflected
here as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the anchor link work, please add
.. _hive-performance-tuning-configuration:
above line 734 of the hive connector page (/connector/hive.rst)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Updated 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't reply to the above comment. There are no changes needed for "Is S3 Select a good fit for my workload" section.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like only the ref link was added, can you apply the rest of the suggested edits?

Should look like the following:

For uncompressed files, S3 Select scans ranges of bytes in parallel. The scan range
requests run across the byte ranges of the internal Hive splits for the query fragments
pushed down to S3 Select. Changes to the Hive catalog's :ref:`performance tuning
configuration properties <hive-performance-tuning-configuration>` are reflected
here as well.

Copy link
Copy Markdown
Member Author

@dnanuti dnanuti Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was rephrased a bit on our side as well. From a technical perspective, I think we should say Hive connector, not Hive catalog, as this is related to how the connector works.

Does this work for you?

For uncompressed files, S3 Select scans ranges of bytes in parallel. The scan range
requests run across the byte ranges of the internal Hive splits for the query fragments
pushed down to S3 Select. Changes in the Hive connector :ref:`performance tuning
configuration properties <hive-performance-tuning-configuration>` are likely to impact
S3 Select pushdown performance.

Copy link
Copy Markdown
Contributor

@jhlodin jhlodin Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense to me! Once that change is in, LGTM

Copy link
Copy Markdown
Contributor

@jhlodin jhlodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % one last minor suggestion

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"they are retrieving" -> "they retrieve"

Fixed Select pushdown for uncompressed files, added JSON support
to Amazon S3 Select and started using S3 Select scan range requests.
Relevant PRs: 12633, 13354, 13477, 13754, 14040
@arhimondr arhimondr merged commit af813ca into trinodb:master Nov 22, 2022
@github-actions github-actions bot added this to the 404 milestone Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants