Python: Add Avro read path #4920

Fokko · 2022-05-31T18:26:44Z

Reads Avro file by first reading the headers, and then extracting the schema Then we convert the Avro schema into Iceberg, and read the actual binary using the schema visitor.

The binary decoder and codecs have been copied from apache/avro because I didn't want to depend on the library for just that. Also, apache/avro uses the general IO interface, while we have our own FileStream interface for reading files.

To not make the PR too big, I'm working on the follow-up PRs:

Map the accessor onto an actual Python class
Support Manifest version 1 & 2
Support write/read schema

Reads Avro file by first reading the headers, and then extracting the schema Then we convert the Avro schema into Iceberg, and read the actual binary using the schema visitor

…ad-avro

samredai

This is awesome @Fokko! I left some comments. For some of the comments in src/iceberg/avro/, I know we're vendoring some of that so please feel free to ignore any nit/style comments there. I'm super excited that we'll have a read path that utilizes all standard lib stuff. 😄

python/src/iceberg/avro/codec.py

python/src/iceberg/avro/decoder.py

python/tests/avro/test_reader.py

samredai · 2022-06-04T05:39:24Z

python/tests/utils/test_schema_conversion.py

+    assert "Unknown logical/physical type combination:" in str(exc_info.value)
+
+
+def test_logical_map_with_invalid_fields():


I see a pattern here so we may be able to consolidate these into a handful of parametrized tests, making them easily extendable too. How about making a single function per method that's being tested?

AvroSchemaConversion()._convert_logical_type

AvroSchemaConversion()._convert_logical_map_type

AvroSchemaConversion()._convert_schema

AvroSchemaConversion()._convert_field

AvroSchemaConversion()._convert_record_type

AvroSchemaConversion()._convert_array_type

As an example, for AvroSchemaConversion()._convert_logical_type:

@pytest.mark.parametrize( "avro_logical_type,expected", [ ({"type": "int", "logicalType": "date"}, DateType()), ( {"type": "bytes", "logicalType": "decimal", "precision": 19, "scale": 25}, DecimalType(precision=19, scale=25), ), ..., ], ) def test_schema_conversion_convert_logical_type(avro_logical_type, expected): assert AvroSchemaConversion()._convert_logical_type(avro_logical_type) == expected

Personally, I'm not a fan of parameterized tests:

I find them hard to read with all the (curly)braces

If you have many tests, and the last one is failing, you have to rerun the earlier ones all the time, which is kind of annoying if you have breakpoints everywhere.

The parameterized arguments are evaluated every time, even if you run an unrelated test.

I don't mind a few additional tests.

python/tests/utils/test_schema_conversion.py

python/src/iceberg/avro/decoder.py

python/src/iceberg/avro/reader.py

rdblue · 2022-06-06T03:45:51Z

python/src/iceberg/avro/reader.py

+        return self._data[pos]
+
+
+class _AvroReader(SchemaVisitor[Union[AvroStructProtocol, Any]]):


It looks like this calls a decoder when it visits a schema, but I was expecting an implementation that creates a reader tree that accepts a decoder, like the other one. Why did you decide to go with this approach over the other one?

Hey @rdblue. This made the most sense to me at the time. Which one is the other one? I don't have a strong opinion on this way or the other. Probably you have done this more often than me :)

I suggested the other approach for a few reasons. First, I think when it is reasonable to match the approach taken by the Java codebase, that's a good idea. That way we don't have completely different implementations to validate and maintain.

Second, this approach traverses the schema for each record read. That is inefficient compared to building a tree of readers that handle this. The reader tree approach allows us to create basically a state machine that is ready to read records.

Last, this is harder to update. The next step is to reconcile the differences between the read and write schemas. Doing that for every record is hard to write with this approach. Same thing with reusing structs, dicts, and lists.

I've refactored the code and added a reader tree. I love the approach because it gives a much nicer decoupling between the actual reading and the schema. This will make the reader/writer schema much easier to implement.

I kept it as simple as possible to not prematurely optimize the code and keep the PR a bit more concise. We can add things like reusing structs, dicts and lists later on.

python/tests/utils/test_schema_conversion.py

…ad-avro

python/tests/avro/data/README.md

…ad-avro

python/src/iceberg/utils/singleton.py

python/NOTICE

python/src/iceberg/avro/codec.py

rdblue · 2022-06-12T18:21:47Z

python/src/iceberg/avro/codec.py

+
+    @staticmethod
+    def decompress(readers_decoder: BinaryDecoder) -> BinaryDecoder:
+        _ = readers_decoder.read_long()


Compression stores the length even if it is uncompressed?

I had to check, and this is the case. I've added a test with a snappy and a null-codec.

This is actually an artifact of how the decoder is used. Since the block length is right before the block, the decoder does the right thing when you call read_bytes. But in this case there's no need to create a new decoder. It just needs to consume the length bytes and return the existing one. It makes sense, although I do think it is a bit strange to couple the compression API with decoders.

Ah, you're right and this doesn't look good. This is actually a bug. I'll refactor this, including passing the bytes to the compression API instead of the decoder, which indeed doesn't make much sense.

python/src/iceberg/avro/decoder.py

rdblue

@Fokko, this is great! I made a few comments but it's really close.

…ad-avro

python/NOTICE

python/README.md

rdblue · 2022-06-13T22:50:45Z

python/src/iceberg/avro/decoder.py

+        the unix epoch, 1 January 1970 (ISO calendar).
+        """
+        days_since_epoch = self.read_int()
+        return date(1970, 1, 1) + timedelta(days_since_epoch)


Minor: I'd prefer to use days_to_date just like this uses micros_to_time and micros_to_timestamp.

I like it 👍🏻 Updated

python/src/iceberg/utils/datetime.py

rdblue · 2022-06-13T22:53:45Z

python/src/iceberg/utils/datetime.py


+def micros_to_timestamp(micros: int, tzinfo: timezone | None = None):
+    dt = timedelta(microseconds=micros)
+    unix_epoch_datetime = datetime(1970, 1, 1, 0, 0, 0, 0, tzinfo=tzinfo)


Can you use EPOCH_TIMESTAMP and EPOCH_TIMESTAMPTZ instead of creating a new datetime here?

def micros_to_timestamp(micros: int): return EPOCH_TIMESTAMP + timedelta(microseconds=micros) def micros_to_timestamptz(micros: int): return EPOCH_TIMESTAMPTZ + timedelta(microseconds=micros)

Fokko · 2022-06-14T20:08:58Z

Hey @rdblue I've gone through the comments, the only one open is currently: #4920 (comment) Not sure if we want to break the protocol, or make a helper class. Apart from that, I've added a lot of tests to do some checks and boost the coverage a bit. Let me know what you think!

rdblue · 2022-06-14T22:40:05Z

python/src/iceberg/avro/codecs/codec.py

+from abc import ABC, abstractmethod
+
+
+class Codec(ABC):


Minor: would it make sense to put this in the __init__.py file?

Some projects do this, and some don't :) For me, it makes sense to add base classes in the init. Mostly because they need to be loaded anyway, and by adding them to the __init__.py they are read when you access the module. Also, for the case of the codecs, this avoids having yet another file. WDYT?

+1 for adding it to __init__.py. Fewer files cuts down on load time and we will need to load it anyway.

rdblue · 2022-06-14T22:45:57Z

python/src/iceberg/avro/decoder.py

+        the unix epoch, 1 January 1970 (ISO calendar).
+        """
+        days_to_date = self.read_int()
+        return date(1970, 1, 1) + timedelta(days_to_date)


What about adding days_to_date to datetime.py? Then you could use EPOCH_DATE and this would be reusable.

Auch, that one slipped through the cracks. I just updated the code

rdblue · 2022-06-14T22:49:15Z

python/src/iceberg/avro/file.py

+    def __init__(self, input_file: InputFile) -> None:
+        self.input_file = input_file
+
+    def __enter__(self):


As a follow up, I think we should start a base class for our readers that handles __iter__, __enter__, and __exit__. This should probably use threading.local() to ensure that with is thread-safe if the file is shared, and I think we can make __iter__ return an iterator class. But these all work great for now.

I'm happy to do this, but I would suggest doing this in a separate PR. We could also extend the Input- and OutputFile. That __enter__ calls open() or create(). Having a separate Iterator isn't pythonic.

With regard to the threading, I think that's another can of worms. (At least as a start) reading a file should not be considered thread-safe, and we should not share the file across threads. I would rather suggest reading the files in parallel using something like multiprocessing. We could also split out the blocks in the Avro file if we like. Another option would be to make the read async. We could also go fancy and go for a async iterator. But looking at the size of this PR, I think we should split that out. Let me know if you feel otherwise.

Yeah, definitely as a separate PR. I didn't mean to suggest doing it now.

Fokko · 2022-06-15T22:22:01Z

@rdblue I've fixed the linting violation introduced by #5055

…ad-avro

rdblue · 2022-06-20T15:48:08Z

Thanks, @Fokko! Everything looks great in here. Thanks for updating the tests. Looking forward to the next steps!

github-actions bot added docs python labels May 31, 2022

Fokko force-pushed the fd-read-avro branch 3 times, most recently from 970872f to 730486a Compare June 1, 2022 13:42

Python: Add Avro read path

2f286ce

Reads Avro file by first reading the headers, and then extracting the schema Then we convert the Avro schema into Iceberg, and read the actual binary using the schema visitor

Fokko force-pushed the fd-read-avro branch from 730486a to 4c8cc5d Compare June 1, 2022 14:57

100% test coverage on schema coverage

c430cbd

Fokko force-pushed the fd-read-avro branch from 4c8cc5d to c430cbd Compare June 1, 2022 16:40

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

c0140b0

…ad-avro

danielcweeks requested review from rdblue and samredai June 3, 2022 22:24

samredai reviewed Jun 4, 2022

View reviewed changes

rdblue reviewed Jun 6, 2022

View reviewed changes

python/src/iceberg/avro/decoder.py Show resolved Hide resolved

rdblue reviewed Jun 6, 2022

View reviewed changes

python/src/iceberg/avro/reader.py Outdated Show resolved Hide resolved

rdblue reviewed Jun 6, 2022

View reviewed changes

python/tests/utils/test_schema_conversion.py Show resolved Hide resolved

Fokko added 3 commits June 6, 2022 17:03

Comments

5f1db8e

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

05b5557

…ad-avro

Pull latest master and run pre-commit

0e546da

Fokko force-pushed the fd-read-avro branch from 4975231 to 0e546da Compare June 6, 2022 15:11

samredai reviewed Jun 7, 2022

View reviewed changes

python/tests/avro/data/README.md Outdated Show resolved Hide resolved

Fokko added 3 commits June 8, 2022 08:27

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

06d270a

…ad-avro

Construct ReaderTree

a4dcfbb

Add NOTICE file for Avro

ff0f3ea

Fokko commented Jun 9, 2022

View reviewed changes

python/src/iceberg/utils/singleton.py Outdated Show resolved Hide resolved

Make the tests happy

6a2d44c

rdblue reviewed Jun 12, 2022

View reviewed changes

python/NOTICE Outdated Show resolved Hide resolved

rdblue reviewed Jun 12, 2022

View reviewed changes

python/src/iceberg/avro/codec.py Outdated Show resolved Hide resolved

rdblue reviewed Jun 12, 2022

View reviewed changes

python/src/iceberg/avro/decoder.py Outdated Show resolved Hide resolved

rdblue requested changes Jun 12, 2022

View reviewed changes

Fokko added 4 commits June 13, 2022 10:21

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

c1c064c

…ad-avro

Refactor codecs

aba1ec6

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

d1dd540

…ad-avro

Make the tests happy

c86c342

Fokko force-pushed the fd-read-avro branch from a1ea262 to c86c342 Compare June 13, 2022 22:38

rdblue reviewed Jun 13, 2022

View reviewed changes

python/NOTICE Outdated Show resolved Hide resolved

rdblue reviewed Jun 13, 2022

View reviewed changes

python/README.md Outdated Show resolved Hide resolved

rdblue reviewed Jun 13, 2022

View reviewed changes

python/src/iceberg/utils/datetime.py Show resolved Hide resolved

rdblue reviewed Jun 13, 2022

View reviewed changes

Add tests and process comments

dfabe85

rdblue reviewed Jun 14, 2022

View reviewed changes

Fokko and others added 3 commits June 15, 2022 13:03

Add days_to_date to datetime.py and update the decoder

a04f9d9

Merge branch 'master' into fd-read-avro

7428c36

Fix conflicts

39061f7

Fokko requested a review from rdblue June 19, 2022 18:28

Fokko added 2 commits June 20, 2022 14:49

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

e2e350d

…ad-avro

Make flake8 happy

a244b63

rdblue approved these changes Jun 20, 2022

View reviewed changes

rdblue merged commit da63c84 into apache:master Jun 20, 2022

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Python: Add Avro read path (apache#4920)

d26c498

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Python: Add Avro read path (apache#4920)

21342a7

alec-heif mentioned this pull request Nov 17, 2022

pyiceberg BinaryDecoder does not correctly read 4-byte little-endian int values #6210

Closed

		assert "Unknown logical/physical type combination:" in str(exc_info.value)


		def test_logical_map_with_invalid_fields():

		return self._data[pos]


		class _AvroReader(SchemaVisitor[Union[AvroStructProtocol, Any]]):

Python: Add Avro read path #4920

Python: Add Avro read path #4920

Uh oh!

Conversation

Fokko commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samredai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jun 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Fokko commented May 31, 2022 •

edited

Loading

Fokko Jun 15, 2022 •

edited

Loading

Fokko Jun 15, 2022 •

edited

Loading