-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3438] Avoid getSmallFiles if hoodie.parquet.small.file.limit is 0 #4823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
| private Map<String, List<SmallFile>> getSmallFilesForPartitions(List<String> partitionPaths, HoodieEngineContext context) { | ||
| Map<String, List<SmallFile>> partitionSmallFilesMap = new HashMap<>(); | ||
|
|
||
| if (config.getParquetSmallFileLimit() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using <= 0 to indicate always writing to new file group? And maybe add a description of this behavior to the document of PARQUET_SMALL_FILE_LIMIT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can PARQUET_SMALL_FILE_LIMIT be set to be negative? I'm not sure.
Yeah, I'll add the description to the PARQUET_SMALL_FILE_LIMIT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to be set it to negative, which is meaningless of course. So I guess we could cover it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
… 0 (apache#4823) Co-authored-by: Hui An <hui.an@shopee.com>
Tips
What is the purpose of the pull request
Method
getSmallFilescould be very time consuming if there are many small files in one partitionPath. Though we already sethoodie.parquet.small.file.limitto 0, it will still callgetSmallFilesto do the comparison, this could be avoided.Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.