Partitioned Append on Identity Transform #555

jqin61 · 2024-03-28T21:06:21Z

As discussed in the monthly meeting, this is the first PR to break #353 down into 4 PRs:
1. Partitioned append with identity transform

other three:
2. Dynamic overwrite using delete + append, 2 snapshots in one commit
3. Hidden partitioning support (for slicing the arrow table, manifest file entry.partition, data file path)
4. Static overwrite using delete + append, 2 snapshots in one commit

jqin61 · 2024-03-28T22:25:25Z

pyiceberg/table/__init__.py

-        table_metadata=table_metadata,
-        tasks=iter([WriteTask(write_uuid, next(counter), batches) for batches in bin_pack_arrow_table(df, target_file_size)]),  # type: ignore
-    )
+    if any(len(spec.fields) > 0 for spec in table_metadata.partition_specs):


It seems the old line was not checking whether the table is partitioned but was checking partition evolution?
if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 0]) > 0:

Great find!

tests/integration/test_partitioned_write.py

sungwy

@jqin61 - this PR is looking 🔥 🔥 it is super exciting to see this PR up and in such a great state already. I've left a few suggestions, please let me know if you want to discuss any of the suggested ideas in more detail

pyiceberg/table/__init__.py

sungwy · 2024-03-29T03:28:20Z

pyiceberg/table/__init__.py

    target_file_size = PropertyUtil.property_as_int(
        properties=table_metadata.properties,
        property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
        default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
    )
+    if target_file_size is None:
+        raise ValueError(
+            "Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file."


I have mixed feelings about this exception check, because we are setting the default value of target_file_size as TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT right in the previous line. I feel as though this is too redundant.

I understand why we are doing it though:

PropertyUtil.property_as_int returns Optional[int], and bin_packing expects an int, so we need to type check it.

If we run into more of these type checking redundancies in the code base, where when we are using property values that are always expected to have a none-null default value, maybe we should refactor PropertyUtil instead. Maybe we can have two methods, property_as_int that returns an Optional[int], and property_as_int_with_default, that returns an int?

property_as_int_with_default sounds better to me because all the exceptions raised due to missing default property could be centralized in the function? How do you feel about it

I like that as well, the ValueError is misleading and it is not directly obvious why we would raise it.

i just find the default value itself could be None:
PARQUET_COMPRESSION_LEVEL_DEFAULT = None
so this None checking is not unnecessary?

the original code for this target_file_size check just type: ignores it

sungwy · 2024-03-29T03:47:15Z

pyiceberg/table/__init__.py

-        table_metadata=table_metadata,
-        tasks=iter([WriteTask(write_uuid, next(counter), batches) for batches in bin_pack_arrow_table(df, target_file_size)]),  # type: ignore
-    )
+    if any(len(spec.fields) > 0 for spec in table_metadata.partition_specs):


Great find!

pyiceberg/table/__init__.py

sungwy · 2024-03-29T04:08:49Z

pyiceberg/table/__init__.py

+    """
+    import pyarrow as pa
+
+    partition_columns = get_partition_columns(iceberg_table_metadata, arrow_table)


How do you feel about this suggestion? Most of this function's responsibility seems to lie in making sure that the partition field is provided in the arrow_table, but we seem to already be checking the schema in the write functions now.

Suggested change

partition_columns = get_partition_columns(iceberg_table_metadata, arrow_table)

partition_columns = [iceberg_table_metadata.schema().find_column_name(partition_field.source_id) for partition_field in iceberg_table_metadata.spec().fields]

it will be more useful when there are hidden partition columns. And the check is also for mypy check because find_column_name returns optional[str]

pyiceberg/table/__init__.py

…gorithm

…k of running test

jqin61

@syun64 Please give another round of review, thank you!

Fokko

Left some small comments, apart from that it looks good to me 👍

Fokko · 2024-04-02T12:44:11Z

pyiceberg/manifest.py

@@ -289,10 +286,7 @@ def partition_field_to_data_file_partition_field(partition_field_type: IcebergTy


 @partition_field_to_data_file_partition_field.register(LongType)
-@partition_field_to_data_file_partition_field.register(DateType)


This single-dispatch is there only for the TimeType it seems. Probably we should we should also convert those into a native type.

fixed in the commit 82dd3ad

Beautiful, thanks 👍

Fokko · 2024-04-02T12:44:56Z

pyiceberg/table/__init__.py

@@ -1131,8 +1133,11 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT)
        if not isinstance(df, pa.Table):
            raise ValueError(f"Expected PyArrow table, got: {df}")

-        if len(self.spec().fields) > 0:
-            raise ValueError("Cannot write to partitioned tables")
+        supported = {IdentityTransform}


Nit:

Suggested change

supported = {IdentityTransform}

supported_transforms = {IdentityTransform}

Fokko · 2024-04-02T12:48:31Z

pyiceberg/table/__init__.py


-    # Later to be extended with partition information
+    def generate_data_file_partition_path(self) -> str:


Nit: This function looks redundant. The check is being done in generate_data_file_path() as well. I would merge those two.

Fokko · 2024-04-02T12:49:22Z

pyiceberg/table/__init__.py

    target_file_size = PropertyUtil.property_as_int(
        properties=table_metadata.properties,
        property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
        default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
    )
+    if target_file_size is None:
+        raise ValueError(
+            "Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file."


I like that as well, the ValueError is misleading and it is not directly obvious why we would raise it.

Fokko · 2024-04-02T12:50:35Z

pyiceberg/table/__init__.py

+    return table_partitions
+
+
+def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> Iterable[TablePartition]:


It would be good to have a bit more length filenames. I also think we should hide this from the outside user.

Suggested change

def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> Iterable[TablePartition]:

def _determine_partitions(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> List[TablePartition]:

I think we can also return a list, so folks know that it is already materialized.

Fokko · 2024-04-02T12:51:13Z

pyiceberg/table/__init__.py

+                    schema=table_metadata.schema(),
+                )
+                for partition in partitions
+                for batches in bin_pack_arrow_table(partition.arrow_table_partition, target_file_size)


This looks very nice!

pyiceberg/typedef.py

Fokko · 2024-04-02T12:55:21Z

tests/conftest.py

@@ -2000,7 +2000,11 @@ def spark() -> "SparkSession":
    'float': [0.0, None, 0.9],
    'double': [0.0, None, 0.9],
    'timestamp': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)],
-    'timestamptz': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)],
+    'timestamptz': [


Fokko · 2024-04-02T12:55:36Z

tests/conftest.py

+    import pyarrow as pa
+
+    """PyArrow table with all kinds of columns."""


Suggested change

import pyarrow as pa

"""PyArrow table with all kinds of columns."""

"""PyArrow table with all kinds of columns."""

import pyarrow as pa

Fokko · 2024-04-02T12:55:57Z

tests/conftest.py

+    import pyarrow as pa
+
+    """PyArrow table with all kinds of columns."""


Suggested change

import pyarrow as pa

"""PyArrow table with all kinds of columns."""

"""PyArrow table with all kinds of columns."""

import pyarrow as pa

jqin61

Thanks for reviewing! @Fokko I applied your suggestions and ready for another round of review. Thank you!

Fokko

This looks great @jqin61 Thanks again for working on this 👍

Fokko · 2024-04-04T19:20:54Z

pyiceberg/manifest.py

-
-
-def data_file_with_partition(partition_type: StructType, format_version: TableVersion) -> StructType:
+def data_file_with_partition(partition_type: StructType, format_version: Literal[1, 2]) -> StructType:


Nit:

Suggested change

def data_file_with_partition(partition_type: StructType, format_version: Literal[1, 2]) -> StructType:

def data_file_with_partition(partition_type: StructType, format_version: TableVersion) -> StructType:

Fokko · 2024-04-04T19:21:20Z

pyiceberg/manifest.py

@@ -289,10 +286,7 @@ def partition_field_to_data_file_partition_field(partition_field_type: IcebergTy


 @partition_field_to_data_file_partition_field.register(LongType)
-@partition_field_to_data_file_partition_field.register(DateType)


Beautiful, thanks 👍

partitioned append on identity transform

eef72f3

jqin61 changed the title ~~partitioned append on identity transform~~ Partitioned Append on Identity Transform Mar 28, 2024

merge with main

5a16b88

jqin61 commented Mar 28, 2024

View reviewed changes

tests/integration/test_partitioned_write.py Outdated Show resolved Hide resolved

remove unnecessary fixture

870e49b

sungwy reviewed Mar 29, 2024

View reviewed changes

jqin61 added 8 commits March 29, 2024 19:43

added null/empty table tests; fixed part of PR comments

6020297

tests for unsupported transforms; unit tests for partition slicing al…

aecd7ad

…gorithm

add a comprehensive partition unit test

d5f39f3

clean up

fd484ef

move common fixtures utils to utils.py and conftest

e8c9334

pull partitioned table fixtures into tests for more real-time feedbac…

7595b6b

…k of running test

fix linting

9b371c0

Merge branch 'main' into identity-partitioned-append

9553ec6

jqin61 commented Apr 1, 2024

View reviewed changes

Fokko reviewed Apr 2, 2024

View reviewed changes

jqin61 added 5 commits April 2, 2024 14:05

license

9c13dbb

save changes for swtiching codespaces

ebbec01

part of the comment fixes

caddcce

fix one type error

eab2865

add support for timetype

82dd3ad

jqin61 commented Apr 2, 2024

View reviewed changes

jqin61 added 2 commits April 3, 2024 17:43

Merge branch 'main' into identity-partitioned-append

d1f4ba8

Merge branch 'main' into identity-partitioned-append

d99aa5c

Fokko approved these changes Apr 4, 2024

View reviewed changes

jqin61 added 2 commits April 4, 2024 21:18

Merge branch 'main' into identity-partitioned-append

05721ff

small fix for type hint

f786ef4

Fokko merged commit 4148edb into apache:main Apr 5, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioned Append on Identity Transform #555

Partitioned Append on Identity Transform #555

jqin61 commented Mar 28, 2024 •

edited

Loading

jqin61 Mar 28, 2024

sungwy Mar 29, 2024

sungwy left a comment

sungwy Mar 29, 2024 •

edited

Loading

jqin61 Mar 29, 2024

Fokko Apr 2, 2024

jqin61 Apr 2, 2024 •

edited

Loading

sungwy Mar 29, 2024

sungwy Mar 29, 2024

jqin61 Mar 29, 2024

jqin61 left a comment

Fokko left a comment

Fokko Apr 2, 2024

jqin61 Apr 2, 2024

Fokko Apr 4, 2024

Fokko Apr 2, 2024

Fokko Apr 2, 2024

Fokko Apr 2, 2024

Fokko Apr 2, 2024

Fokko Apr 2, 2024

Fokko Apr 2, 2024

Fokko Apr 2, 2024

Fokko Apr 2, 2024

jqin61 left a comment

Fokko left a comment

Fokko Apr 4, 2024

Fokko Apr 4, 2024

	partition_columns = get_partition_columns(iceberg_table_metadata, arrow_table)
	partition_columns = [iceberg_table_metadata.schema().find_column_name(partition_field.source_id) for partition_field in iceberg_table_metadata.spec().fields]

		@@ -289,10 +286,7 @@ def partition_field_to_data_file_partition_field(partition_field_type: IcebergTy


		@partition_field_to_data_file_partition_field.register(LongType)
		@partition_field_to_data_file_partition_field.register(DateType)

	supported = {IdentityTransform}
	supported_transforms = {IdentityTransform}


		# Later to be extended with partition information
		def generate_data_file_partition_path(self) -> str:

		return table_partitions


		def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> Iterable[TablePartition]:

	def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> Iterable[TablePartition]:
	def _determine_partitions(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> List[TablePartition]:

		import pyarrow as pa

		"""PyArrow table with all kinds of columns."""



		def data_file_with_partition(partition_type: StructType, format_version: TableVersion) -> StructType:
		def data_file_with_partition(partition_type: StructType, format_version: Literal[1, 2]) -> StructType:

Partitioned Append on Identity Transform #555

Partitioned Append on Identity Transform #555

Conversation

jqin61 commented Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy left a comment

Choose a reason for hiding this comment

sungwy Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqin61 Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqin61 left a comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqin61 left a comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqin61 commented Mar 28, 2024 •

edited

Loading

sungwy Mar 29, 2024 •

edited

Loading

jqin61 Apr 2, 2024 •

edited

Loading