Skip to content

Add documentation for Hudi connector#13753

Merged
martint merged 1 commit intotrinodb:masterfrom
codope:hudi-docs
Oct 12, 2022
Merged

Add documentation for Hudi connector#13753
martint merged 1 commit intotrinodb:masterfrom
codope:hudi-docs

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Aug 19, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?

Documentation for new connector.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hudi connector

How would you describe this change to a non-technical end user or system administrator?

Add documentation for Hudi connector.

Related issues, pull requests, and links

Fixes #14429

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Aug 19, 2022
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Aug 19, 2022

@findinpath @electrum Please also review the documentation.

@github-actions github-actions bot added the docs label Aug 19, 2022
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation is primarily intended to Trino users and not Trino contributors.

Please consider adding the design details of the connector in a README.md under trino-hudi module.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add the details directly in trino-hudi readme and not to link an external doc because some of the design details may evolve in the months/years to come.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll add it to readme.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These details are superfluos.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are speaking here about the metadata columns and not regular table columns. Please add a note what those metadata columns are. Do reconsider having metadata columns in the Hudi connector (same as in Hive - https://trino.io/docs/current/connector/hive.html#special-columns). Let's strive for consistency within Trino and not for consistency for keeping the status quo on Hudi's approach of exposing by default metadata columns as regular columns.

cc @martint @electrum

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. It's not a blocker for us. Let me think a bit more about this. I want to also discuss this internally with the team as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a link for the Hudi sync tool ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the hiding of non-Hudi being done?
This may create some unwanted performance drop while listing tables because the tables need to be checked one by one for format.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually not yet. i'll remove this part.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you care to mention any limitations (if known) ?

e.g. : Hive has a Hive 3 related limitations section https://trino.io/docs/current/connector/hive.html#hive-3-related-limitations

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not talk about limitations .. we generally only document what works .. not all the stuff that doesnt work

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add a more detailed description here.

Not everybody is familiar with the CoW, MoR concepts.

Also the Snapshot queries and Read optimized queries are very "Hudi" specific.

Copy link
Copy Markdown
Contributor Author

@codope codope Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked a Hudi concept doc. Let me know if you think adding a few sentences here would be more preferable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW , talk on the exposed virtual tables table_name_rt , table_name_ro and what they are.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a section related to friendliness of Hudi to Hive?

Specifically, document how the hive connector could read (if possible) a Hudi table.

@findinpath findinpath requested a review from mosabua August 19, 2022 19:12
@ebyhr ebyhr mentioned this pull request Sep 29, 2022
14 tasks
@ebyhr ebyhr self-requested a review October 1, 2022 01:35
Copy link
Copy Markdown
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start .. a couple of small things to update and then might be close to get live.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the sentence "As of..."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
example ``etc/catalog/hudi.properties``, that references the ``hudi``
example ``etc/catalog/example.properties``, that references the ``hudi``

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To create a catalog that uses the Hudi connector,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that are hidden

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
are accessed using index.Only applicable to Parquet file format.
are accessed using the index. Only applicable to Parquet file format.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformat

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading Hudi tables with the Hive connector

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also be accessed with a catalog using the Hive connector.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the Hive connector

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont link to mvnrepository .. instead use the official maven repo or the official maven repo search

https://repo.maven.apache.org/

https://central.sonatype.dev/

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Oct 7, 2022

@codope please let us know if you need any help to proceed further with this. Ideally we would like to get documentation merged soon.. ideally before the Trino Community Broadcast episode about Hudi..

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Oct 10, 2022

@mosabua Thanks for reviewing. Sorry for the late response. I am on call which ends this Monday. I will address the comments and update the PR by Tuesday this week.

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Oct 11, 2022

@mosabua I have addressed your feedback and updated the PR.

Copy link
Copy Markdown
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor changes but essentially this a close to ready for a first merge.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tables synced to Hive metastore.
tables with relevant metadata in a Hive metastore.

or even

Suggested change
tables synced to Hive metastore.
tables.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To use Hudi connector, you need:
To use the Hudi connector, you need:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword as in other connector - for example https://trino.io/docs/current/connector/postgresql.html#sql-support

and confirm that the linked commands work .. otherwise we have to create a list of individual statements.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Merge On Read Read Optimized Queries
Merge on read Read optimized queries

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not talk about limitations .. we generally only document what works .. not all the stuff that doesnt work

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a "COW" ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-on-write table

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this sentence says

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rephrased the sentence. I am referring to the table in Hudi Quickstart documentation as Hudi users are familiar with this table. For anyone landing here directly, without even trying out Hudi, then the quickstart link kinda redirects them to first create Hudi table, sync metadata to hive metastore and then run these queries.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop "please"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JAR file

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Oct 12, 2022

@mosabua I have updated the PR. Please take another pass.

Copy link
Copy Markdown
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go now. Please just squash the two commits and then this should be good to go. Maybe @electrum or @findepi can merge.

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Oct 12, 2022

Thanks @mosabua ! I have rebased and squashed.

@martint martint merged commit e120678 into trinodb:master Oct 12, 2022
@martint
Copy link
Copy Markdown
Member

martint commented Oct 12, 2022

Tested locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add Hudi connector documentation

4 participants