Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ThriftMetadataWriter for writing Parquet metadata #6197

Merged
merged 27 commits into from
Aug 6, 2024

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Aug 5, 2024

BugenZhao and others added 26 commits July 16, 2024 18:05
)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <[email protected]>

* fix example tests

Signed-off-by: Bugen Zhao <[email protected]>

---------

Signed-off-by: Bugen Zhao <[email protected]>
…ally copies data (apache#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy
* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.
* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays
* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <[email protected]>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <[email protected]>
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…MetaData` (apache#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData
* Update FlightSql.proto to version 17.0

Adds new message CommandStatementIngest and removes `experimental` from
other messages.

* Regenerate flight sql protocol

This upgrades the file to version 17.0 of the protobuf definition.
Add test for metadata equivalence
separate tests that require arrow into a separate module
Fix checks and merge with master
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 5, 2024
@adriangb
Copy link
Contributor Author

adriangb commented Aug 5, 2024

cc @alamb @etseidl

@etseidl
Copy link
Contributor

etseidl commented Aug 5, 2024

Thanks @adriangb. I'll do a review later this afternoon, but I think this is looking pretty much ready to merge.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great first step. Thanks for working with me through all the page index changes.

You might want update the description to mention the associated issues (#5988, #6002) so they'll link to this PR.

@alamb alamb changed the title Add encode metadata Add ThriftMetadataWriter for writing Parquet metadata Aug 6, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb and @etseidl

I took the liberty of merging up from main to resolve a conflict

I found a few things I think we can/should improve, most notably I don't think ThriftMetadataWriter is marked as pub and thus can't be used yet.

Since this PR has been hanging out for so long, however, what I think we should do is merge it in to master and then iterate there. I plan to do so once CI passes

Among other things I would like to move ThriftMetadataWriter to its own module and add some documentation examples. I will also update #6184 to mention this code.

Thanks again 🙏

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

let's keep hacking on main. Next PR with some more docs, etc coming soon

@alamb alamb merged commit d5ed6b9 into apache:master Aug 6, 2024
17 checks passed
@adriangb
Copy link
Contributor Author

adriangb commented Aug 6, 2024

Amazing, thank you both for pushing this forward!

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

Among other things I would like to move ThriftMetadataWriter to its own module and add some documentation examples. I will also update #6184 to mention this code.

I see after some more study that ThriftMetadataWriter is likely too low level to expose publically. I will focus on ParquetMetadataWriter at first

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

I made #6202 with some small tweaks and improved documentation in case anyone has a chance to look at it

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

BTW check out this code in action: #6081 🚀

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants