Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3125: Add CLI for SizeStatistics #3126

Merged
merged 1 commit into from
Jan 21, 2025
Merged

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jan 18, 2025

Rationale for this change

There is no way to print size stats from the parquet-cli yet.

What changes are included in this PR?

Add a new CLI command for SizeStatistics.

Are these changes tested?

Added a test case.

parquet-cli size-stats

File with both histogram and unencoded bytes.

Row group 0
--------------------------------------------------------------------------------
column         unencoded bytes rep level histogram                      def level histogram
[int32_field]  -               [10]                                     [10]
[int64_field]  -               [10]                                     [10]
[float_field]  -               [10]                                     [10]
[double_field] -               [10]                                     [10]
[binary_field] 46 B            [10]                                     [10]
[flba_field]   -               [10]                                     [10]
[date_field]   -               [10]                                     [10]

File with only unencoded bytes.

Row group 0
--------------------------------------------------------------------------------
column     unencoded bytes rep level histogram                      def level histogram
[id]       -               -                                        -
[type]     66 B            -                                        -
[sha1_git] -               -                                        -

File with no size stats.

Row group 0
--------------------------------------------------------------------------------
column           unencoded bytes rep level histogram                      def level histogram
[processed_dttm] -               -                                        -
[float8_1]       -               -                                        -

parquet-cli column-index

row-group 0:
column index for column int32_field:
Boundary order: ASCENDING
                      null count  min                                       max                                        rep level histogram   def level histogram
page-0                         0  32                                        41                                                        [10]                  [10]

offset index for column int32_field:
                          offset       compressed size       first row index       unencoded bytes
page-0                         4                    37                     0                     -

column index for column int64_field:
Boundary order: ASCENDING
                      null count  min                                       max                                        rep level histogram   def level histogram
page-0                         0  64                                        73                                                        [10]                  [10]

offset index for column int64_field:
                          offset       compressed size       first row index       unencoded bytes
page-0                        41                    37                     0                     -

column index for column float_field:
Boundary order: ASCENDING
                      null count  min                                       max                                        rep level histogram   def level histogram
page-0                         0  1.0                                       10.0                                                      [10]                  [10]

offset index for column float_field:
                          offset       compressed size       first row index       unencoded bytes
page-0                        78                    67                     0                     -

column index for column double_field:
Boundary order: ASCENDING
                      null count  min                                       max                                        rep level histogram   def level histogram
page-0                         0  2.0                                       11.0                                                      [10]                  [10]

offset index for column double_field:
                          offset       compressed size       first row index       unencoded bytes
page-0                       145                   109                     0                     -

column index for column binary_field:
Boundary order: ASCENDING
                      null count  min                                       max                                        rep level histogram   def level histogram
page-0                         0  0x424C5545                                0x59454C4C4F57                                            [10]                  [10]

offset index for column binary_field:
                          offset       compressed size       first row index       unencoded bytes
page-0                       316                    35                     0                    46

column index for column flba_field:
Boundary order: ASCENDING
                      null count  min                                       max                                        rep level histogram   def level histogram
page-0                         0  0x34AA0DAA4FAFC8479E8E637E                0xF340C9C6E9F1D3A50A79E820                                [10]                  [10]

offset index for column flba_field:
                          offset       compressed size       first row index       unencoded bytes
page-0                       492                    37                     0                     -

column index for column date_field:
Boundary order: ASCENDING
                      null count  min                                       max                                        rep level histogram   def level histogram
page-0                         0  1970-01-01                                1970-01-10                                                [10]                  [10]

offset index for column date_field:
                          offset       compressed size       first row index       unencoded bytes
page-0                       529                    37                     0                     -

Are there any user-facing changes?

Yes, updated README for the new command.

Closes #3125

@wgtmac
Copy link
Member Author

wgtmac commented Jan 19, 2025

@gszadovszky @Fokko Could you help review this? Thanks!

cc @etseidl

@wgtmac wgtmac merged commit be5ada2 into apache:master Jan 21, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CLI for SizeStatistics
2 participants