Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not write ColumnIndex for null columns when not writing page statistics #6011

Merged
merged 4 commits into from
Jul 16, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 5, 2024

Which issue does this PR close?

Closes #6010.

Rationale for this change

The ColumnIndex for an all-nulls column will be written even when page statistics are disabled. This is not the case for columns that have non-null values.

What changes are included in this PR?

This PR calls to_invalid() on the constructed ColumnIndexBuilder when page statistics are not requested (statistics_enabled != EnabledStatistics::Page). This allows for not writing the null-column statistics in update_column_offset_index().

if null_page && self.column_index_builder.valid() {
self.column_index_builder.append(
null_page,
vec![0; 1],
vec![0; 1],
self.page_metrics.num_page_nulls as i64,
);
} else if self.column_index_builder.valid() {
.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 5, 2024
@alamb
Copy link
Contributor

alamb commented Jul 6, 2024

Thanks @etseidl - I will check this one out shortly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @etseidl. I reviewed the code and logic carefully

I took the liberty of merging this PR up from main so the CI would pass.

@@ -260,6 +260,12 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
// Used for level information
encodings.insert(Encoding::RLE);

// Disable column_index_builder if not collecting page statistics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason this works currently for columns without null is that the ColumnIndex builder is marked as invalid if there are no page statistics at the end of the page. However, when the column has no data this code is not run

self.column_index_builder.to_invalid();

So TLDR is I think this change looks good to me

writer.write_batch(&data, Some(&def_levels), None).unwrap();
writer.flush_data_pages().unwrap();

let column_close_result = writer.close().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this test covers the code by removing the change and verifying the test fails like

assertion failed: column_close_result.column_index.is_none()
thread 'column::writer::tests::test_no_column_index_when_stats_disabled' panicked at parquet/src/column/writer/mod.rs:3044:9:
assertion failed: column_close_result.column_index.is_none()
stack backtrace:
   0: rust_begin_unwind
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:146:5
   3: parquet::column::writer::tests::test_no_column_index_when_stats_disabled
             at ./src/column/writer/mod.rs:3044:9
   4: parquet::column::writer::tests::test_no_column_index_when_stats_disabled::{{closure}}
             at ./src/column/writer/mod.rs:3024:50
   5: core::ops::function::FnOnce::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@alamb alamb changed the title Disable ColumnIndexBuilder when page statistics are not collected Do not write ColumnIndex for null columns when not writing page statistics Jul 11, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and I believe this PR contains no breaking API changes so merging it directly to main (for inclusion in 52.0.0)

@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

Thanks again @etseidl

@alamb alamb merged commit 6ab853d into apache:master Jul 16, 2024
17 checks passed
@etseidl etseidl deleted the gh-6010-colidx branch July 16, 2024 22:46
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet ColumnIndex for null columns is written even when statistics are disabled
2 participants