-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning: a way to estimate parquet file size #46984
Conversation
Signed-off-by: zeminzhou <[email protected]>
Hi @zeminzhou. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc lance6716 |
/cc @D3Hunter |
any test result for the update? |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #46984 +/- ##
================================================
- Coverage 72.9964% 72.6689% -0.3275%
================================================
Files 1338 1359 +21
Lines 399462 405795 +6333
================================================
+ Hits 291593 294887 +3294
- Misses 89018 92162 +3144
+ Partials 18851 18746 -105
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Signed-off-by: zeminzhou <[email protected]>
I manually tested 12 parquet files, and the errors were all within 10%. |
Can you post it in the description? Thank you~ |
Co-authored-by: D3Hunter <[email protected]>
Co-authored-by: D3Hunter <[email protected]>
Signed-off-by: zeminzhou <[email protected]>
please provide detail manual test operate steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem is, it will consume large time to cunstructFileInfo when inport parquet from s3.
Rest lgtm
/ok-to-test |
Hi @zeminzhou in fact parquet has recorded the "uncompressed page size" so maybe we can simply add them up. https://parquet.incubator.apache.org/docs/file-format/metadata/ Do you have time to take a look? |
I did a test and found that |
Signed-off-by: zeminzhou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
Signed-off-by: zeminzhou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
Signed-off-by: zeminzhou <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lance6716, okJiang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
Signed-off-by: zeminzhou <[email protected]>
/retest |
3 similar comments
/retest |
/retest |
/retest |
/cherry-pick release-7.1 |
@okJiang: new pull request created to branch In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What problem does this PR solve?
Issue Number: close #46980
Problem Summary:
What is changed and how it works?
Estimated by sampling:
row data size/sampled row data size =~ row count/sampled row count
.Manual testing:
data size after conversion to row
: using lightning converts the entire parquet format file to row format.estimated data size
: estimated data size byrow data size/sampled row data size =~ row count/sampled row count
.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.