Python: ManifestWriter and ManifestListWriter #8012

JonasJ-ap · 2023-07-08T05:22:21Z

implements ManifestWriter and ManifestListWriter, which are part of the iceberg commit phase.

Based on: #7873

This PR currently includes prototypes of both writers, which are still subject to changes and improvements. I would greatly appreciate receiving some initial review and suggestions to foster the discussion around the development of the overall commit phase. Your insights and feedback would be invaluable. Thank you in advance for your kind assistance!

Fokko

This is great @JonasJ-ap!

Fokko · 2023-07-10T09:05:19Z

python/pyiceberg/manifest.py

-    NestedField(field_id=140, name="sort_order_id", field_type=IntegerType(), required=False, doc="Sort order ID"),
-    NestedField(field_id=141, name="spec_id", field_type=IntegerType(), required=False, doc="Partition spec ID"),
-)
+    def add_extension(self, filename: str) -> str:


Probably we want to add also the compression to it .zstd.parquet etc. Do we want to use a Literal here?

Suggested change

def add_extension(self, filename: str) -> str:

def add_extension(self, format: Literal['parquet', 'orc', 'avro']) -> str:

Fokko · 2023-07-10T09:14:58Z

python/pyiceberg/manifest.py

+
+
+def write_manifest(
+    format_version: int, spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int


Suggested change

format_version: int, spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int

format_version: Literal[1, 2], spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int

Fokko · 2023-07-10T09:15:20Z

python/pyiceberg/manifest.py

+    elif format_version == 2:
+        return ManifestWriterV2(spec, schema, output_file, snapshot_id)
+    else:
+        # TODO: replace it with UnsupportedOperationException


I think a ValueError is reasonable.

Fokko · 2023-07-10T09:25:23Z

python/pyiceberg/manifest.py

+            upper_bound=to_bytes(self._type, self._max) if self._max is not None else None,
+        )
+
+    def update(self, value: Any) -> PartitionFieldStats:


Do we need to return PartitionFieldStats? We don't use it below.

Fokko · 2023-07-10T09:26:01Z

python/pyiceberg/manifest.py

+                self._min = value
+                self._max = value
+            # TODO: may need to implement a custom comparator for incompatible types
+            elif value < self._min:


I would use Python's build in min and max

Fokko · 2023-07-10T09:41:41Z

python/pyiceberg/manifest.py

+        for i, field_type in enumerate(self._types):
+            assert isinstance(field_type, PrimitiveType), f"Expected a primitive type for the partition field, got {field_type}"
+            partition_key = partition_keys[i]
+            self._fields[i].update(conversions.partition_to_py(field_type, partition_key))


I'm not sure about the partition_to_py:

{ "name": "partition", "type": { "type": "record", "name": "r102", "fields": [{ "name": "tpep_pickup_datetime_day", "type": ["null", { "type": "int", "logicalType": "date" }], "default": null, "field-id": 1000 }] }, "field-id": 102 }

It looks like this is encoded as an int.

Fokko · 2023-07-10T10:12:34Z

python/pyiceberg/manifest.py

+        return f"{filename}.{self.name.lower()}"
+
+
+def data_file_type(partition_type: StructType) -> StructType:


I'm not a fan of this one, but I see why it is necessary. For reading, we can override certain field IDs:

iceberg/python/pyiceberg/manifest.py

Lines 319 to 335 in e389e4d

def read_manifest_list(input_file: InputFile) -> Iterator[ManifestFile]:

"""

Reads the manifests from the manifest list.

Args:

input_file: The input file where the stream can be read from.

Returns:

An iterator of ManifestFiles that are part of the list.

"""

with AvroFile[ManifestFile](

input_file,

MANIFEST_FILE_SCHEMA,

read_types={-1: ManifestFile, 508: PartitionFieldSummary},

read_enums={517: ManifestContent},

) as reader:

yield from reader

We could do the same when writing. We can override field-id 102 when constructing the writer. WDYT?

Fokko · 2023-07-10T10:25:34Z

python/pyiceberg/manifest.py

+    def summaries(self) -> List[PartitionFieldSummary]:
+        return [field.to_summary() for field in self._fields]
+
+    def update(self, partition_keys: Record) -> PartitionSummary:


More on a meta-level. Instead of this, and the class above, I would probably write a function to convert PartitionSpec's to PartitionSummaries. I think that's more Python (and for me also easier to follow, but that's super personal of course).

Fokko · 2023-07-10T10:35:14Z

python/pyiceberg/manifest.py

    StructType,
 )

+# TODO: Double-check what's its purpose in java


I'm not exactly sure what you're referring to. But when writing ManifestEntries, the sequence number is set to null because when we commit, there is a commit conflict, then we can retry. But when retrying we don't want to have to rewrite the Manifest files to update the sequence number. Therefore they are left null when written the first time. This is called Sequence number inheritance: https://iceberg.apache.org/spec/#sequence-number-inheritance

Fokko · 2023-07-10T13:12:24Z

python/pyiceberg/manifest.py

+            schema,
+            output_file,
+            snapshot_id,
+            {


More of a style thing, but I would prefer named arguments here.

JonasJ-ap · 2023-09-23T08:18:04Z

Based on offline discussion,this PR will continue in #8622. Thus closing this one. Thanks!

implement writers and add basic tests

d142824

github-actions bot added the python label Jul 8, 2023

Fokko reviewed Jul 10, 2023

View reviewed changes

corleyma mentioned this pull request Aug 15, 2023

Python write support #6564

Closed

4 tasks

Fokko mentioned this pull request Sep 20, 2023

Python: Compute parquet stats #7831

Merged

HonahX mentioned this pull request Sep 23, 2023

Python: ManifestWriter and ManifestListWriter #8622

Merged

JonasJ-ap closed this Sep 23, 2023

	def add_extension(self, filename: str) -> str:
	def add_extension(self, format: Literal['parquet', 'orc', 'avro']) -> str:



		def write_manifest(
		format_version: int, spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int

	format_version: int, spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int
	format_version: Literal[1, 2], spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int

		return f"{filename}.{self.name.lower()}"


		def data_file_type(partition_type: StructType) -> StructType:

	def read_manifest_list(input_file: InputFile) -> Iterator[ManifestFile]:
	"""
	Reads the manifests from the manifest list.

	Args:
	input_file: The input file where the stream can be read from.

	Returns:
	An iterator of ManifestFiles that are part of the list.
	"""
	with AvroFile[ManifestFile](
	input_file,
	MANIFEST_FILE_SCHEMA,
	read_types={-1: ManifestFile, 508: PartitionFieldSummary},
	read_enums={517: ManifestContent},
	) as reader:
	yield from reader

Python: ManifestWriter and ManifestListWriter #8012

Python: ManifestWriter and ManifestListWriter #8012

Uh oh!

Conversation

JonasJ-ap commented Jul 8, 2023

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JonasJ-ap commented Sep 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JonasJ-ap commented Sep 23, 2023 •

edited

Loading