Skip to content

Conversation

@zeevm
Copy link

@zeevm zeevm commented Jun 30, 2020

  1. Calculate page and column statistics
  2. Use pre-calculated statistics when available to speed-up when writing data from other formats like ORC.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@nevi-me
Copy link
Contributor

nevi-me commented Jun 30, 2020

Hi @zeevm, may you please kindly rebase (to fix the Rust failures) and open a JIRA for this PR

@zeevm zeevm changed the title Calculate page and column statistics ARROW-9280: [Rust] Calculate page and column statistics Jun 30, 2020
@zeevm zeevm changed the title ARROW-9280: [Rust] Calculate page and column statistics ARROW-9280: [Rust] [Parquet] Calculate page and column statistics Jun 30, 2020
@github-actions
Copy link

Use pre-calculated statistics when available
@zeevm zeevm force-pushed the write_parquet_statistics branch from f6f96e6 to 45293d6 Compare June 30, 2020 15:06
@zeevm
Copy link
Author

zeevm commented Jun 30, 2020 via email

@nevi-me
Copy link
Contributor

nevi-me commented Jun 30, 2020

EDIT: Looks like the failures are from parquet. I had seen 4 failures when skimming through, and assumed it was the ones that I fixed, but this is not the case. I think the failures are related to your changes

failures:
    arrow::arrow_reader::tests::test_fixed_length_binary_column_reader
    arrow::arrow_reader::tests::test_utf8_single_column_reader_test
    column::writer::tests::test_column_writer_default_encoding_support_byte_array
    column::writer::tests::test_column_writer_default_encoding_support_fixed_len_byte_array

Thanks @zeevm, the remaining failure can be fixed by running cargo +stable fmt from within the rust folder. We use the rustfmt from the stable toolchain, to avoid more frequent changes from a nightly one.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zeevm - left some comments.

values: &[T::T],
def_levels: Option<&[i16]>,
rep_levels: Option<&[i16]>,
min: &Option<T::T>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that among these 4 parameters, ppl only provide, say, nulls_count but leave the rest as None? will this result to partial stats and yield to issues when compute engines want to rely on them? If so do we want to enforce that either all of these 4 are None OR all of these are Some?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the format specifies that the various stats are optional so it seems reasonable to allow the caller to specify only some of the values isn't it?

min_page_value: None,
max_page_value: None,
num_page_nulls: 0,
page_distinct_count: None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is not used at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used to track page level stats and write those stats when writing a page and to update the column level stats when writing the page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it is not updated nor used at all. Can you double check?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you specifically referring to 'page_distinct_count', or all 4 page level vars?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of them are used, page_distinct_count isn't being calculated in this PR though, probably a following PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're used here:

flush_data_pages()
make_typed_statistics()
update_page_min_max()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I meant page_distinct_count. It is fine to do this in a follow upo.

Ze'ev Maor and others added 4 commits July 1, 2020 16:41
…y spec

https://issues.apache.org/jira/browse/ARROW-7285

Closes apache#7544 from liyafan82/fly_0619_ipc

Lead-authored-by: liyafan82 <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…7347)

This commit moves all Netty specific calls into a few classes.
This is the precursor to splitting the netty and unsafe allocators
out to their own modules
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

min_page_value: None,
max_page_value: None,
num_page_nulls: 0,
page_distinct_count: None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I meant page_distinct_count. It is fine to do this in a follow upo.

@zeevm zeevm closed this Jul 2, 2020
@zeevm zeevm reopened this Jul 2, 2020
@sunchao
Copy link
Member

sunchao commented Jul 2, 2020

@zeevm once approved, a committer will help merge this. Seems the PR now is a little messed up, can you clean it up so I can merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants