-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1294] Adding inline read and seek based read(batch get) for hfile log blocks in metadata table #3762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5fb7a2a to
cb7e9ce
Compare
| } | ||
|
|
||
| public void scan(List<String> keys) { | ||
| currentInstantLogBlocks = new ArrayDeque<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to be cautious about seek based approach vs full scan. In full scan, we do one time full scan and prepare a hashmap of records. so, any number of look up can be done without any cost.
But with seek based approach, if users calls
scan(list of 3 keys)
scan(list of 5 keys)
we might have to read/parse through the log blocks twice, since everytime we are looking for only interested keys. so, we should be cautious in using the seek based read for metadata table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good point to add as a code comment in this file.
1140119 to
f27df7a
Compare
|
@vinothchandar : this is good to be reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prashantwason @satishkotha : do you guys know why we did not do batch get here and doing 1 key at a time? is there any particular reason for it. I have fixed it to fetch batch get in this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity of implementation I suppose - performance was not taken into consideration. Also, given the number of keys being fetched, batch would be slower as it may need to read the entire hfile.
@umehrot2 Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I infer, with HoodieMergedLogRecordScanner, we first read all records from all log blocks and prepare a hash map of records(record key to HoodieRecord). And we don't do seek based read prior to this patch and so we do read all log records from all log blocks. so was bit curious.
|
@hudi-bot azure run |
2 similar comments
|
@hudi-bot azure run |
|
@hudi-bot azure run |
8756437 to
ce6740e
Compare
|
@hudi-bot azure run |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
...hudi-client-common/src/test/java/org/apache/hudi/io/storage/TestHoodieHFileReaderWriter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
Outdated
Show resolved
Hide resolved
| .key(METADATA_PREFIX + ".enable.full.scan.log.files") | ||
| .defaultValue(true) | ||
| .sinceVersion("0.10.0") | ||
| .withDocumentation("Enable full scanning of log files while reading log records"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
little bit more context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt your suggestion of moving this out to common config(instead of metadata config), I don't really see a need where we will use this for regular data table. so, I prefer we can leave it at metadata config itself. let me know.
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
Outdated
Show resolved
Hide resolved
|
@hudi-bot azure. |
48ed467 to
f378067
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting entries
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
Outdated
Show resolved
Hide resolved
|
@prashantwason : Can you review the patch please when you get time. |
|
@hudi-bot azure run |
f378067 to
2b369a6
Compare
What is the purpose of the pull request
Brief change log
hoodie.metadata.enable.inline.reading.log.filesandhoodie.metadata.enable.full.scan.log.files.Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.