Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partitioned write support #353

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

Conversation

jqin61
Copy link
Contributor

@jqin61 jqin61 commented Feb 2, 2024

Todo

  • support partitioned append()
    • support append with identity transform
    • fix scenario when arrow table schema not aligned with iceberg schema (finished by others)
    • add integration test for null column partitioning after issue#348 is closed
    • avoid sorting input arrow table when it is already sorted
  • support partitition field in manifest file see PartitionKey
  • apply transform for partitioning algorithm efficiency analysis when transform involved
  • support partitioned static overwrite()
    • overwrite entire table
    • overwrite with expression or filter string (specified partition)
    • overwrite filter validatoin (validation for static overwrite with filter jqin61/iceberg-python#4) as discussed in the monthly meeting, overwrite will be supported by delete + append. we will support more wild filters than spark iceberg and might rewrite files for overwriting rather than just using IsNull and EqualTo. So this is not needed.
  • extend summary for partitioned stats (Add partition stats in snapshot summary #521)
  • support partitioned dynamic overwrite()

@jqin61 jqin61 marked this pull request as draft February 2, 2024 05:19
@jqin61 jqin61 mentioned this pull request Feb 2, 2024
4 tasks
arrow_table_partition: pa.Table


def get_partition_sort_order(partition_columns: list[str], reverse: bool = False) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_partition_sort_order(partition_columns: list[str], reverse: bool = False) -> dict[str, Any]:
def _get_partition_sort_order(partition_columns: list[str], reverse: bool = False) -> dict[str, Any]:

This is a nit: should we use _single_leading_underscore to indicate that these are internal functions according to PEP-8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the reminder! More refactoring is on the way.

@@ -308,6 +308,7 @@ def data_file_with_partition(partition_type: StructType, format_version: Literal
field_id=field.field_id,
name=field.name,
field_type=partition_field_to_data_file_partition_field(field.field_type),
required=False
Copy link
Contributor Author

@jqin61 jqin61 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default value is True, which breaks the avro writer encoding manifest entry when the partition value is null. (Because the attribute of required being True in the nested field of the Avro schema will be used by the writer to infer and choose a non-optional field writer that could not encode None value.)

This is for encoding only, so set it to False for now.

TODO: Set it properly according to table schema.

@Fokko Fokko added this to the PyIceberg 0.7.0 release milestone Feb 7, 2024
@jqin61 jqin61 force-pushed the partitioned-write branch 2 times, most recently from 805c19c to 7a33b23 Compare February 16, 2024 19:19
@kevinjqliu kevinjqliu mentioned this pull request Feb 17, 2024
@jqin61
Copy link
Contributor Author

jqin61 commented Mar 28, 2024

As discussed in the monthly community sync, this will be broken down into 4 prs of:

  1. Partitioned append with identity transform
  2. Dynamic overwrite using delete + append, 2 snapshots in one commit
  3. Hidden partitioning support (for slicing the arrow table, manifest file entry.partition, data file path)
  4. Static overwrite using delete + append, 2 snapshots in one commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants