-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partitioned Append on Identity Transform #555
Conversation
pyiceberg/table/__init__.py
Outdated
table_metadata=table_metadata, | ||
tasks=iter([WriteTask(write_uuid, next(counter), batches) for batches in bin_pack_arrow_table(df, target_file_size)]), # type: ignore | ||
) | ||
if any(len(spec.fields) > 0 for spec in table_metadata.partition_specs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the old line was not checking whether the table is partitioned but was checking partition evolution?
if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 0]) > 0:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jqin61 - this PR is looking 🔥 🔥 it is super exciting to see this PR up and in such a great state already. I've left a few suggestions, please let me know if you want to discuss any of the suggested ideas in more detail
pyiceberg/table/__init__.py
Outdated
target_file_size = PropertyUtil.property_as_int( | ||
properties=table_metadata.properties, | ||
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, | ||
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, | ||
) | ||
if target_file_size is None: | ||
raise ValueError( | ||
"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mixed feelings about this exception check, because we are setting the default value of target_file_size
as TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT right in the previous line. I feel as though this is too redundant.
I understand why we are doing it though:
PropertyUtil.property_as_int
returns Optional[int]
, and bin_packing expects an int, so we need to type check it.
If we run into more of these type checking redundancies in the code base, where when we are using property values that are always expected to have a none-null default value, maybe we should refactor PropertyUtil
instead. Maybe we can have two methods, property_as_int
that returns an Optional[int]
, and property_as_int_with_default
, that returns an int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
property_as_int_with_default sounds better to me because all the exceptions raised due to missing default property could be centralized in the function? How do you feel about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that as well, the ValueError
is misleading and it is not directly obvious why we would raise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just find the default value itself could be None:
PARQUET_COMPRESSION_LEVEL_DEFAULT = None
so this None checking is not unnecessary?
the original code for this target_file_size check just type: ignores
it
pyiceberg/table/__init__.py
Outdated
table_metadata=table_metadata, | ||
tasks=iter([WriteTask(write_uuid, next(counter), batches) for batches in bin_pack_arrow_table(df, target_file_size)]), # type: ignore | ||
) | ||
if any(len(spec.fields) > 0 for spec in table_metadata.partition_specs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find!
pyiceberg/table/__init__.py
Outdated
""" | ||
import pyarrow as pa | ||
|
||
partition_columns = get_partition_columns(iceberg_table_metadata, arrow_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about this suggestion? Most of this function's responsibility seems to lie in making sure that the partition field is provided in the arrow_table, but we seem to already be checking the schema in the write functions now.
partition_columns = get_partition_columns(iceberg_table_metadata, arrow_table) | |
partition_columns = [iceberg_table_metadata.schema().find_column_name(partition_field.source_id) for partition_field in iceberg_table_metadata.spec().fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be more useful when there are hidden partition columns. And the check is also for mypy check because find_column_name returns optional[str]
…k of running test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@syun64 Please give another round of review, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small comments, apart from that it looks good to me 👍
@@ -289,10 +286,7 @@ def partition_field_to_data_file_partition_field(partition_field_type: IcebergTy | |||
|
|||
|
|||
@partition_field_to_data_file_partition_field.register(LongType) | |||
@partition_field_to_data_file_partition_field.register(DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single-dispatch is there only for the TimeType
it seems. Probably we should we should also convert those into a native type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the commit 82dd3ad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful, thanks 👍
pyiceberg/table/__init__.py
Outdated
@@ -1131,8 +1133,11 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT) | |||
if not isinstance(df, pa.Table): | |||
raise ValueError(f"Expected PyArrow table, got: {df}") | |||
|
|||
if len(self.spec().fields) > 0: | |||
raise ValueError("Cannot write to partitioned tables") | |||
supported = {IdentityTransform} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
supported = {IdentityTransform} | |
supported_transforms = {IdentityTransform} |
pyiceberg/table/__init__.py
Outdated
|
||
# Later to be extended with partition information | ||
def generate_data_file_partition_path(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This function looks redundant. The check is being done in generate_data_file_path()
as well. I would merge those two.
pyiceberg/table/__init__.py
Outdated
target_file_size = PropertyUtil.property_as_int( | ||
properties=table_metadata.properties, | ||
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, | ||
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, | ||
) | ||
if target_file_size is None: | ||
raise ValueError( | ||
"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that as well, the ValueError
is misleading and it is not directly obvious why we would raise it.
pyiceberg/table/__init__.py
Outdated
return table_partitions | ||
|
||
|
||
def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> Iterable[TablePartition]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have a bit more length filenames. I also think we should hide this from the outside user.
def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> Iterable[TablePartition]: | |
def _determine_partitions(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> List[TablePartition]: |
I think we can also return a list, so folks know that it is already materialized.
schema=table_metadata.schema(), | ||
) | ||
for partition in partitions | ||
for batches in bin_pack_arrow_table(partition.arrow_table_partition, target_file_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very nice!
@@ -2000,7 +2000,11 @@ def spark() -> "SparkSession": | |||
'float': [0.0, None, 0.9], | |||
'double': [0.0, None, 0.9], | |||
'timestamp': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)], | |||
'timestamptz': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)], | |||
'timestamptz': [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one!
tests/conftest.py
Outdated
import pyarrow as pa | ||
|
||
"""PyArrow table with all kinds of columns.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pyarrow as pa | |
"""PyArrow table with all kinds of columns.""" | |
"""PyArrow table with all kinds of columns.""" | |
import pyarrow as pa |
tests/conftest.py
Outdated
import pyarrow as pa | ||
|
||
"""PyArrow table with all kinds of columns.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pyarrow as pa | |
"""PyArrow table with all kinds of columns.""" | |
"""PyArrow table with all kinds of columns.""" | |
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing! @Fokko I applied your suggestions and ready for another round of review. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @jqin61 Thanks again for working on this 👍
pyiceberg/manifest.py
Outdated
|
||
|
||
def data_file_with_partition(partition_type: StructType, format_version: TableVersion) -> StructType: | ||
def data_file_with_partition(partition_type: StructType, format_version: Literal[1, 2]) -> StructType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
def data_file_with_partition(partition_type: StructType, format_version: Literal[1, 2]) -> StructType: | |
def data_file_with_partition(partition_type: StructType, format_version: TableVersion) -> StructType: |
@@ -289,10 +286,7 @@ def partition_field_to_data_file_partition_field(partition_field_type: IcebergTy | |||
|
|||
|
|||
@partition_field_to_data_file_partition_field.register(LongType) | |||
@partition_field_to_data_file_partition_field.register(DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful, thanks 👍
As discussed in the monthly meeting, this is the first PR to break #353 down into 4 PRs:
1. Partitioned append with identity transform
other three:
2. Dynamic overwrite using delete + append, 2 snapshots in one commit
3. Hidden partitioning support (for slicing the arrow table, manifest file entry.partition, data file path)
4. Static overwrite using delete + append, 2 snapshots in one commit