Skip to content

Conversation

@ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Sep 4, 2023

  • Add interfaces to SnapshotProducer to write the partition stats (in Parquet format)

  • Store partition stats file location in snapshot summary

  • Use the PartitionStatsUtil.java from the core module to synchronously
    write the parquet partition stats for every snapshot produced.

  • Compute the partition stats based on the table property
    write.partition.statistics.
    Enabled by default until review to expose issues if exist from all partition table testcases.

  • Track these partition stats from TableMetadata and Snapshot
    by registering them during the write operation.

Depends on PR #7105, #8500, #8501, #8502, #8503

Fixes #8458

Address new comments

remove trailing whitespaces
- Since core module need to write stats in parquet format, to avoid circular dependency,
move all the files from iceberg-parquet module to iceberg code.
- `TestParquetReadProjection` used to duplicate the test code of iceberg-api module's `TestReadProjection`.
Removed the duplicate class and instead directly extend the original class from iceberg-api module.
- Update TestParquetReadProjection to skip empty struct testcases as only Avro readers supports it.
The testcases are now common for both Avro and Parquet readers.
@ajantha-bhat ajantha-bhat force-pushed the wip_pstats branch 3 times, most recently from c4c8c4b to 0695494 Compare September 5, 2023 15:16
@ajantha-bhat ajantha-bhat changed the title [WIP] Partition stats overall PRs Core: Write partition stats during write operation Sep 5, 2023
PartitionsTable.Partition will be used between Partitions metadata table
and partition stats reader-writer.
Hence, move it to a separate class and extend it with Avro's
IndexedRecord (for partition stats writing).
Tracking `PartitionStatisticsFile` in a same way as how `StatisticsFile` is already tracked.
- Add interfaces to SnapshotProducer to write the partition stats (in Parquet format)

- Store partition stats file location in snapshot summary

- Use the `PartitionStatsUtil.java` from the core module to synchronously
write the parquet partition stats for every snapshot produced.

- Compute the partition stats based on the table property
`write.partition.statistics`.
Enabled by default until review to expose issues if exist from all partition table testcases.

-Track these partition stats from TableMetadata and Snapshot
by registering them during the write operation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

Implement Synchronous partition stats writing during write operation (controlled by table property).

1 participant