Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Aug 5, 2021

Which issue does this PR close?

Resolves #659

Rationale for this change

Bool columns were writing the wrong type of statistics

What changes are included in this PR?

Write boolean stats for boolean columns (not i32 stats)

Are there any user-facing changes?

Statistics for bool columns in parquet files are now boolean. I am not sure if this is visible to users though, as when I used parquet-tools to create a parquet file with a boolean column, the min/max statistics look correct to me (aka are boolean)

alamb@MacBook-Pro rust_parquet % parquet-tools dump -n  /tmp/test_bool.parquet 
parquet-tools dump -n  /tmp/test_bool.parquet 
row group 0 
------------------------------------------------------------------------------------------------------------------
bool_col:  BOOLEAN UNCOMPRESSED DO:0 FPO:4 SZ:40/40/1.00 VC:6 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 1]

    bool_col TV=6 RL=0 DL=1
    --------------------------------------------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: false, max: true, num_nulls: 1] CRC:[none] SZ:7 VC:6

BOOLEAN bool_col 
------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 *** 
value 1: R:0 D:1 V:true
value 2: R:0 D:1 V:true
value 3: R:0 D:1 V:false
value 4: R:0 D:0 V:<null>
value 5: R:0 D:1 V:false
value 6: R:0 D:1 V:true
alamb@MacBook-Pro rust_parquet % 

@alamb alamb force-pushed the alamb/fix_bool_stats branch from 4256eed to 29e9969 Compare August 8, 2021 10:33
@alamb alamb marked this pull request as ready for review August 8, 2021 10:34
@alamb alamb requested a review from sunchao August 8, 2021 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet boolean columns write Int32 statistics

2 participants