Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet StatisticsConverter for arrow reader #6046

Merged
merged 11 commits into from
Jul 16, 2024

Conversation

efredine
Copy link
Contributor

@efredine efredine commented Jul 11, 2024

Which issue does this PR close?

Closes #4328.

Rationale for this change

Ports StatisticsConverter implementation and tests from Data Fusion.

What changes are included in this PR?

The StatisticsConverter and all tests. It is functionally unchanged from the DataFusion implementation.

Changes:

  • removed all log::debug statements as it seemed to me these aren't used in arrow crate?
  • converted all errors to use the arrow_err macro.

For the tests I only moved over the the code actually used by the statistics tests.

Are there any user-facing changes?

Yes, exposes the StatisticsConverter.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2024
@efredine efredine marked this pull request as draft July 11, 2024 22:09
@efredine
Copy link
Contributor Author

There is a pending PR in DataFusion that will need to be ported: https://github.com/apache/datafusion/pull/11289/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

@efredine efredine marked this pull request as ready for review July 11, 2024 22:40
@alamb
Copy link
Contributor

alamb commented Jul 13, 2024

Amazing @efredine -- thank you. I am working through this but may not finish until tomorrow

@alamb
Copy link
Contributor

alamb commented Jul 13, 2024

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )

My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

@Lordworms
Copy link
Contributor

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )

My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

I agree, should I port the struct related function now?

@efredine
Copy link
Contributor Author

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )
My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

I agree, should I port the struct related function now?

@Lordworms I don't think it should be part of this PR which is already huge. I think the easiest thing to do is to wait until this PR is merged (which should happen soon) then open a new PR in this repository with the struct changes.

/// underlying statistics value (stored as a parquet value) into the
/// corresponding Arrow value. For example, Decimals are stored as binary in
/// parquet files.
///
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the port, I changed the visibility of parquet_column from pub(crate) to pub because the pub(crate) caused a failure in documentation tests. But I'm not sure this was the right way to resolve that issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having parquet_column pub is a good change and it will be useful for others.

However, since it is more applicable than just statistics, think it should be moved to the main arrow.rs (I will do so shortly)

// in the parquet schema
return None;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is addressing the efficiency consideration here? For a table with many columns it would be a lot of linear searches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should file a follow on ticket to improve the situation. I think we have something functional and then we can always make it better as a follow on

@Lordworms
Copy link
Contributor

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )
My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

I agree, should I port the struct related function now?

@Lordworms I don't think it should be part of this PR which is already huge. I think the easiest thing to do is to wait until this PR is merged (which should happen soon) then open a new PR in this repository with the struct changes.

Got it, I'll wait for it to be merged, thanks for your work.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you so much @efredine -- this is epic (even though I know most of it was just moving code). It was quite easy to read and I found nothing in need of changes.

I know @tustvold has concerns about ignoring / not handling the ColumnOrder statistics correctly: apache/datafusion#10586 Once we sort out what the practical implementations (which is not at all clear to me know) I will make a PR to update the documentation

The one final thing I want to do before merging this PR is to make a draft PR in DataFusion to use it and verify that everything works. Doing so now

// in the parquet schema
return None;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should file a follow on ticket to improve the situation. I think we have something functional and then we can always make it better as a follow on

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

I merged this branch up from master and moved where parquet_column went.

I plan to draft a "use upstream arrow version" PR in DataFusion, file a follow on PR for improving the performance of parquet_column and then I think this PR will be good to go

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

FYI @marvinlanhenke

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

Oh darn - yes - I missed those. If you run out of time I’m happy to port them.

No worries, I already pushed it to your branch (i had the code checked out anyways)

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

Updates

@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

Ok, I pushed some commits to this branch:

  1. 5c8c1ba: adds API I found I needed in Update parquet page pruning code to use the StatisticsExtractor datafusion#11483
  2. f993b08: overly obsessive doc editing

@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

I think this PR is ready to go -- I am just going to try to make sure I can get apache/datafusion#11479 working with this as one final test

@alamb alamb changed the title Add parquet statistics converter for arrow reader Add parquet StatisticsConverter for arrow reader Jul 16, 2024
@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

Thanks again so much @efredine -- I plan to merge this later today unless anyone else would like time to review or comment

@alamb alamb merged commit 66390ff into apache:master Jul 16, 2024
18 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

🚀 -- thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add function that converts from parquet statistics ParquetStatistics to arrow arrays ArrayRef
3 participants