Skip to content

Conversation

@manojpec
Copy link
Contributor

What is the purpose of the pull request

Adding the RFC for metadata table based bloom index.

Brief change log

  • Added the RFC doc

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the process calls for first landing the change here so you get your number fixed.
http://hudi.apache.org/contribute/rfc-process#proposing-the-rfc
same for @nsivabalan 's other RFC. Lets please read this carefully, follow it verbatim and proceed. :)

@manojpec
Copy link
Contributor Author

@vinothchandar sure, here is the RFC number claiming PR #3995

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a feedback.

@manojpec manojpec force-pushed the rfc/rfc-37-metadata-based-bloom-index branch from 02f1d93 to 4fe7a88 Compare November 15, 2021 17:08
@manojpec manojpec requested a review from nsivabalan November 15, 2021 21:02
@vinothchandar vinothchandar added the rfc Request for comments label Nov 21, 2021
@manojpec manojpec force-pushed the rfc/rfc-37-metadata-based-bloom-index branch from 68cba72 to d9213ef Compare December 7, 2021 05:02
@manojpec
Copy link
Contributor Author

manojpec commented Dec 7, 2021

CI passed in the re-run of failed job.

2. For all the involved partitions, load all its file list
3. Level 1: Range pruning using `column_stats` index:
1. For each of the record key, generate the column stats index lookup key based on the tuple
(__hoodie_record_key, partition name, file path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep in mind that Column Stats index is likely might not be very efficient in case when we don't maintain the invariant that the rows are ordered by that column (in this case _hoodie_record_key): in an ideal scenario base files should not overlap in the ranges, therefore more overlap we have b/w those the less efficient index becomes

@manojpec manojpec force-pushed the rfc/rfc-37-metadata-based-bloom-index branch from d9213ef to d23a140 Compare December 14, 2021 01:28
@manojpec manojpec force-pushed the rfc/rfc-37-metadata-based-bloom-index branch 2 times, most recently from b614f85 to 2686dc2 Compare December 20, 2021 18:25
@vinothchandar vinothchandar self-assigned this Dec 25, 2021
nsivabalan and others added 6 commits January 31, 2022 18:27
 - Adding the RFC-37 doc for Metadata table based bloom index
 - Adding a section on Schema evolution and its impact on the hash id based
   index keys.
 - Updating the schema for index payloads and addressing other review comments.
 - Updated the metadata schema used for the index
@manojpec manojpec force-pushed the rfc/rfc-37-metadata-based-bloom-index branch from 2686dc2 to c57ad0d Compare February 1, 2022 02:37
@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 1, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 72f7348 into apache:master Feb 1, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc Request for comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants