Skip to content

Conversation

@JonasJ-ap
Copy link
Contributor

implements ManifestWriter and ManifestListWriter, which are part of the iceberg commit phase.

Based on: #7873

This PR currently includes prototypes of both writers, which are still subject to changes and improvements. I would greatly appreciate receiving some initial review and suggestions to foster the discussion around the development of the overall commit phase. Your insights and feedback would be invaluable. Thank you in advance for your kind assistance!

@github-actions github-actions bot added the python label Jul 8, 2023
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @JonasJ-ap!

NestedField(field_id=140, name="sort_order_id", field_type=IntegerType(), required=False, doc="Sort order ID"),
NestedField(field_id=141, name="spec_id", field_type=IntegerType(), required=False, doc="Partition spec ID"),
)
def add_extension(self, filename: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we want to add also the compression to it .zstd.parquet etc. Do we want to use a Literal here?

Suggested change
def add_extension(self, filename: str) -> str:
def add_extension(self, format: Literal['parquet', 'orc', 'avro']) -> str:



def write_manifest(
format_version: int, spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
format_version: int, spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int
format_version: Literal[1, 2], spec: PartitionSpec, schema: Schema, output_file: OutputFile, snapshot_id: int

elif format_version == 2:
return ManifestWriterV2(spec, schema, output_file, snapshot_id)
else:
# TODO: replace it with UnsupportedOperationException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a ValueError is reasonable.

upper_bound=to_bytes(self._type, self._max) if self._max is not None else None,
)

def update(self, value: Any) -> PartitionFieldStats:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return PartitionFieldStats? We don't use it below.

self._min = value
self._max = value
# TODO: may need to implement a custom comparator for incompatible types
elif value < self._min:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use Python's build in min and max

for i, field_type in enumerate(self._types):
assert isinstance(field_type, PrimitiveType), f"Expected a primitive type for the partition field, got {field_type}"
partition_key = partition_keys[i]
self._fields[i].update(conversions.partition_to_py(field_type, partition_key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the partition_to_py:

{
	"name": "partition",
	"type": {
		"type": "record",
		"name": "r102",
		"fields": [{
			"name": "tpep_pickup_datetime_day",
			"type": ["null", {
				"type": "int",
				"logicalType": "date"
			}],
			"default": null,
			"field-id": 1000
		}]
	},
	"field-id": 102
}

It looks like this is encoded as an int.

return f"{filename}.{self.name.lower()}"


def data_file_type(partition_type: StructType) -> StructType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this one, but I see why it is necessary. For reading, we can override certain field IDs:

def read_manifest_list(input_file: InputFile) -> Iterator[ManifestFile]:
"""
Reads the manifests from the manifest list.
Args:
input_file: The input file where the stream can be read from.
Returns:
An iterator of ManifestFiles that are part of the list.
"""
with AvroFile[ManifestFile](
input_file,
MANIFEST_FILE_SCHEMA,
read_types={-1: ManifestFile, 508: PartitionFieldSummary},
read_enums={517: ManifestContent},
) as reader:
yield from reader

We could do the same when writing. We can override field-id 102 when constructing the writer. WDYT?

def summaries(self) -> List[PartitionFieldSummary]:
return [field.to_summary() for field in self._fields]

def update(self, partition_keys: Record) -> PartitionSummary:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More on a meta-level. Instead of this, and the class above, I would probably write a function to convert PartitionSpec's to PartitionSummaries. I think that's more Python (and for me also easier to follow, but that's super personal of course).

StructType,
)

# TODO: Double-check what's its purpose in java
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what you're referring to. But when writing ManifestEntries, the sequence number is set to null because when we commit, there is a commit conflict, then we can retry. But when retrying we don't want to have to rewrite the Manifest files to update the sequence number. Therefore they are left null when written the first time. This is called Sequence number inheritance: https://iceberg.apache.org/spec/#sequence-number-inheritance

schema,
output_file,
snapshot_id,
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a style thing, but I would prefer named arguments here.

@JonasJ-ap
Copy link
Contributor Author

JonasJ-ap commented Sep 23, 2023

Based on offline discussion,this PR will continue in #8622. Thus closing this one. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants