Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Oct 8, 2021

What is the purpose of the pull request

  • Added support to read HFile log blocks via inline FileSystem in metadata table.
  • Also added support to read for a list of keys(batch get) rather than full scan in metadata table.

Brief change log

  • Added two new configs to HoodieMetadataConfig. hoodie.metadata.enable.inline.reading.log.files and hoodie.metadata.enable.full.scan.log.files.
  • Since we are adding support for seek based read, renamed AbstractHoodieLogRecordScanner to AbstractHoodieLogRecordReader. and so have renamed HoodieMetadataMergedLogRecordReader.
  • Added new method to HoodieMetadataMergedLogRecordReader to support this purpose(i.e. reading records for a list of keys) w/o doing full scan.
public List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>> getRecordsByKeys(List<String> keys) {

}
  • Added new method to HoodieDataBlock for the new requirement. Base class does not have any impl. HoodieHFileDataBlock overrides and gives a concrete impl where in records are read via inline FileSystem with seek based approach.
public List<IndexedRecord> getRecords(List<String> keys) throws IOException {
}
  • HoodieDataBlock also adheres to enableInline config even if not for batch get. Basically 3 options are possible. a: full scan w/o inline. b. full scan with inlining. c. batch get (with inline)
  • Have fixed BaseTableMetadata.getAllFilesInPartitions(List partitions) to use batch get api rather than N single calls to getAllFilesInPartition(Path partitionPath).
  • have fixed metadata reader (HoodieBackedTableMetadata) to leverage the new apis based on config values.

Verify this pull request

This change added tests and can be verified as follows:

  • Added tests to TestHoodieRealtimeRecordReader to verify the change.
  • Found some gaps in testing HFileWriter and Reader especially around seek based read and have added TestHoodieHFileReaderWriter to test these cases.
  • Enabled inline and batch get reads to 1 test in TestHoodieBackedMetadata.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan force-pushed the inlineMetadataLogReader branch from 5fb7a2a to cb7e9ce Compare October 8, 2021 01:38
@hudi-bot
Copy link
Collaborator

hudi-bot commented Oct 8, 2021

CI report:

  • 5fb7a2afa196fd75ada005d26a0fb9fce5472545 UNKNOWN
  • 2b369a6 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan changed the title [HUDI-1294] Adding inline read and seekable read for hfile blocks in metadata table [HUDI-1294] Adding inline read and seekable read for hfile log blocks in metadata table Oct 8, 2021
}

public void scan(List<String> keys) {
currentInstantLogBlocks = new ArrayDeque<>();
Copy link
Contributor Author

@nsivabalan nsivabalan Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to be cautious about seek based approach vs full scan. In full scan, we do one time full scan and prepare a hashmap of records. so, any number of look up can be done without any cost.
But with seek based approach, if users calls
scan(list of 3 keys)
scan(list of 5 keys)
we might have to read/parse through the log blocks twice, since everytime we are looking for only interested keys. so, we should be cautious in using the seek based read for metadata table.

Copy link
Member

@prashantwason prashantwason Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good point to add as a code comment in this file.

@nsivabalan nsivabalan changed the title [HUDI-1294] Adding inline read and seekable read for hfile log blocks in metadata table [HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table Oct 8, 2021
@nsivabalan nsivabalan force-pushed the inlineMetadataLogReader branch from 1140119 to f27df7a Compare October 13, 2021 05:59
@nsivabalan
Copy link
Contributor Author

@vinothchandar : this is good to be reviewed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashantwason @satishkotha : do you guys know why we did not do batch get here and doing 1 key at a time? is there any particular reason for it. I have fixed it to fetch batch get in this patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity of implementation I suppose - performance was not taken into consideration. Also, given the number of keys being fetched, batch would be slower as it may need to read the entire hfile.

@umehrot2 Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I infer, with HoodieMergedLogRecordScanner, we first read all records from all log blocks and prepare a hash map of records(record key to HoodieRecord). And we don't do seek based read prior to this patch and so we do read all log records from all log blocks. so was bit curious.

@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

2 similar comments
@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

@nsivabalan nsivabalan force-pushed the inlineMetadataLogReader branch 2 times, most recently from 8756437 to ce6740e Compare October 14, 2021 06:54
@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

.key(METADATA_PREFIX + ".enable.full.scan.log.files")
.defaultValue(true)
.sinceVersion("0.10.0")
.withDocumentation("Enable full scanning of log files while reading log records");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little bit more context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt your suggestion of moving this out to common config(instead of metadata config), I don't really see a need where we will use this for regular data table. so, I prefer we can leave it at metadata config itself. let me know.

@nsivabalan
Copy link
Contributor Author

@hudi-bot azure.

@nsivabalan nsivabalan force-pushed the inlineMetadataLogReader branch from 48ed467 to f378067 Compare October 21, 2021 05:13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting entries

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Oct 22, 2021
@nsivabalan
Copy link
Contributor Author

@prashantwason : Can you review the patch please when you get time.

@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

@nsivabalan nsivabalan force-pushed the inlineMetadataLogReader branch from f378067 to 2b369a6 Compare October 29, 2021 13:40
@nsivabalan nsivabalan merged commit 69ee790 into apache:master Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants