Skip to content

Conversation

@maxdebayser
Copy link

Hi @Fokko , I was trying to implement writing to Iceberg from Python without using pyspark when I found your PR that already implemented some of the steps that are required. With the additions in this PR, I can build the avro files, upload them to S3 and call the API to update the table metadata to point to the metadata and data.

This PR is by no means complete, but I hope to get some feedback on this as I'm keen on having this feature in the pyiceberg library.

@github-actions github-actions bot added the python Pull requests that update Python code label Jun 4, 2023

def _read_header(self) -> AvroFileHeader:
return construct_reader(META_SCHEMA, {-1: AvroFileHeader}).read(self.decoder)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this class I can write Avro files like this:

with avro.AvroOutputFile(PyArrowFileIO().new_output("writetest.avro"), MANIFEST_ENTRY_SCHEMA, "manifest_entry") as out:
    out.write_block([entry])

But perhaps we could also refactor the AvroFile and AvroOutputFile classes into a single RW class.

if field.required:
return field_result
else:
return OptionWriter(field_result)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change the avro file structure was the same as for a required field, which can't be parsed as an optional field.

self.key_writer.write(encoder, k)
self.value_writer.write(encoder, v)
if len(val) > 0:
encoder.write_int(0)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://avro.apache.org/docs/1.11.1/specification/#schema-maps : "A block with count zero indicates the end of the map."

for v in val:
self.element_writer.write(encoder, v)
if len(val) > 0:
encoder.write_int(0)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://avro.apache.org/docs/1.11.1/specification/#arrays-1 : "A block with count zero indicates the end of the array."

name="column_sizes",
field_type=MapType(key_id=117, key_type=IntegerType(), value_id=118, value_type=LongType()),
required=True,
required=False,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR apache#7796

name="value_counts",
field_type=MapType(key_id=119, key_type=IntegerType(), value_id=120, value_type=LongType()),
required=True,
required=False,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

NestedField(3, "sequence_number", LongType(), required=False),
NestedField(4, "file_sequence_number", LongType(), required=False),
NestedField(2, "data_file", DATA_FILE_TYPE, required=False),
NestedField(2, "data_file", DATA_FILE_TYPE, required=True),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above


class AvroSchemaConversion:
def avro_to_iceberg(self, avro_schema: dict[str, Any]) -> Schema:
def avro_to_iceberg(self, avro_schema: Dict[str, Any]) -> Schema:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With python 3.8 dict wasn't recognized as a type annotation

)

raise exception(response) from exc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is basically a PoC now. Since there are many table update operations, we should look at the Java API to see if it should be a single function with complex parameters or several simpler functions.

return f"{self.__class__.__name__}[{', '.join(f'{key}={repr(value)}' for key, value in self.__dict__.items() if not key.startswith('_'))}]"

def fields(self) -> List[str]:
return [self.__getattribute__(v) if hasattr(self, v) else None for v in self._position_to_field_name.values()]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a more elegant way to return an array of fields preserving declaration order and preserving None for unset optional fields.

@Fokko
Copy link
Owner

Fokko commented Jun 18, 2023

@maxdebayser Can I do the following suggestion:

  • You can cherry-pick my work on the Avro writer.
  • Apply the fixes
  • Have a couple of tests where we read and write Avro files using our reader/writer, and ones written by fastavro (just to be sure).

@maxdebayser
Copy link
Author

Sure. Just to be clear, should I create a new branch off the apache/iceberg master, cherry pick the work from this branch over there and add the read/write tests?

@Fokko
Copy link
Owner

Fokko commented Jun 18, 2023

I would suggest the following:

@Fokko Fokko closed this Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants