Skip to content

Conversation

@nsivabalan
Copy link
Contributor

What is the purpose of the pull request

Adding an RFC for metadata based bloom index.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan
Copy link
Contributor Author

Screen Shot 2021-11-06 at 12 29 48 AM

Screen Shot 2021-11-06 at 12 29 58 AM

Screen Shot 2021-11-06 at 12 30 09 AM

Screen Shot 2021-11-06 at 12 30 24 AM

Screen Shot 2021-11-06 at 12 30 36 AM

Screen Shot 2021-11-06 at 12 30 47 AM

@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 6, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

original file group location. And this index will leverage both the partitions to deduce the record key => file name mappings.

```
Input: JavaRdd<HoodieKey>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the input/output be generic to satisfy more engines?


<img src="metadata_index_col_stats.png" alt="Column Stats Partition" width="600"/>

We have to encode column names, filenames etc to IDs to save storage and to exploit compression. We will update the RFC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding column names, partition names may not be required as HFile compresses blocks of key-value data. So repeated string of column names, etc will compress well.

Requirements:<br>
Given a list of FileIDs, return their bloom filters
```
Key format: [PartitionId][FileId]
Copy link
Member

@prashantwason prashantwason Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fileId is UUID based, can we assume that fileIDs are unique within HUDI? If so, the partitionId is not required here.

But prefixing with partitionID may lead to better perf as all the fileIDs for a partition will be together in same block.

Hudi maintains indices to locate/map incoming records to file groups during writes. Most commonly
used record index is the HoodieBloomIndex. For larger installations and for global index types, performance might be an issue
due to loading of bloom from large number of data files and due to throttling issues with some of the cloud stores. We are proposing to
build a new Metadata index (metadata table based bloom index) to boost the performance of existing bloom index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a record-level-index for this functionality? Is it because of storage requirements? Wouldn't a record level index be faster than using a bloom based index?

@manojpec
Copy link
Contributor

Thanks @nsivabalan for the initial RFC doc. Had an offline discussion and I am taking over the remaining work in this RFC and addressing the comments. The new PR is at #3989.

@manojpec
Copy link
Contributor

@leesf @prashantwason thanks for the review, will address them in #3989

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants