-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Add Avro read path #4920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
2f286ce
Python: Add Avro read path
Fokko c430cbd
100% test coverage on schema coverage
Fokko c0140b0
Merge branch 'master' of https://github.com/apache/iceberg into fd-re…
Fokko 5f1db8e
Comments
Fokko 05b5557
Merge branch 'master' of https://github.com/apache/iceberg into fd-re…
Fokko 0e546da
Pull latest master and run pre-commit
Fokko 06d270a
Merge branch 'master' of https://github.com/apache/iceberg into fd-re…
Fokko a4dcfbb
Construct ReaderTree
Fokko ff0f3ea
Add NOTICE file for Avro
Fokko 6a2d44c
Make the tests happy
Fokko c1c064c
Merge branch 'master' of https://github.com/apache/iceberg into fd-re…
Fokko aba1ec6
Refactor codecs
Fokko d1dd540
Merge branch 'master' of https://github.com/apache/iceberg into fd-re…
Fokko c86c342
Make the tests happy
Fokko dfabe85
Add tests and process comments
Fokko a04f9d9
Add days_to_date to datetime.py and update the decoder
Fokko 7428c36
Merge branch 'master' into fd-read-avro
rdblue 39061f7
Fix conflicts
Fokko e2e350d
Merge branch 'master' of https://github.com/apache/iceberg into fd-re…
Fokko a244b63
Make flake8 happy
Fokko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -54,4 +54,4 @@ repos: | |
| rev: '4.0.1' | ||
| hooks: | ||
| - id: flake8 | ||
| args: [ "--ignore=E501,W503" ] | ||
| args: [ "--ignore=E501,W503,E203" ] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -57,3 +57,9 @@ UnboundPredicate | |
| BoundPredicate | ||
| BooleanExpression | ||
| BooleanExpressionVisitor | ||
| zigzag | ||
| unix | ||
| zlib | ||
| Codecs | ||
| codecs | ||
| uri | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
| """ | ||
| Contains Codecs for Python Avro. | ||
|
|
||
| Note that the word "codecs" means "compression/decompression algorithms" in the | ||
| Avro world (https://avro.apache.org/docs/current/spec.html#Object+Container+Files), | ||
| so don't confuse it with the Python's "codecs", which is a package mainly for | ||
| converting character sets (https://docs.python.org/3/library/codecs.html). | ||
| """ | ||
| from __future__ import annotations | ||
|
|
||
| from iceberg.avro.codecs.bzip2 import BZip2Codec | ||
| from iceberg.avro.codecs.codec import Codec | ||
| from iceberg.avro.codecs.deflate import DeflateCodec | ||
| from iceberg.avro.codecs.snappy_codec import SnappyCodec | ||
| from iceberg.avro.codecs.zstandard_codec import ZStandardCodec | ||
|
|
||
| KNOWN_CODECS: dict[str, type[Codec] | None] = { | ||
| "null": None, | ||
| "bzip2": BZip2Codec, | ||
| "snappy": SnappyCodec, | ||
| "zstandard": ZStandardCodec, | ||
| "deflate": DeflateCodec, | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| from __future__ import annotations | ||
|
|
||
| from iceberg.avro.codecs.codec import Codec | ||
|
|
||
| try: | ||
| import bz2 | ||
|
|
||
| class BZip2Codec(Codec): | ||
| @staticmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| compressed_data = bz2.compress(data) | ||
| return compressed_data, len(compressed_data) | ||
|
|
||
| @staticmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| return bz2.decompress(data) | ||
|
|
||
| except ImportError: | ||
|
|
||
| class BZip2Codec(Codec): # type: ignore | ||
| @staticmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| raise ImportError("Python bzip2 support not installed, please install the extension") | ||
|
|
||
| @staticmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| raise ImportError("Python bzip2 support not installed, please install the extension") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| from __future__ import annotations | ||
|
|
||
| from abc import ABC, abstractmethod | ||
|
|
||
|
|
||
| class Codec(ABC): | ||
| """Abstract base class for all Avro codec classes.""" | ||
|
|
||
| @staticmethod | ||
| @abstractmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| ... | ||
|
|
||
| @staticmethod | ||
| @abstractmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| ... | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| from __future__ import annotations | ||
|
|
||
| import zlib | ||
|
|
||
| from iceberg.avro.codecs.codec import Codec | ||
|
|
||
|
|
||
| class DeflateCodec(Codec): | ||
| @staticmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| # The first two characters and last character are zlib | ||
| # wrappers around deflate data. | ||
| compressed_data = zlib.compress(data)[2:-1] | ||
| return compressed_data, len(compressed_data) | ||
|
|
||
| @staticmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| # -15 is the log of the window size; negative indicates | ||
| # "raw" (no zlib headers) decompression. See zlib.h. | ||
| return zlib.decompress(data, -15) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| from __future__ import annotations | ||
|
|
||
| import binascii | ||
| import struct | ||
|
|
||
| from iceberg.avro.codecs.codec import Codec | ||
|
|
||
| STRUCT_CRC32 = struct.Struct(">I") # big-endian unsigned int | ||
|
|
||
| try: | ||
| import snappy | ||
|
|
||
| class SnappyCodec(Codec): | ||
| @staticmethod | ||
| def _check_crc32(bytes_: bytes, checksum: bytes) -> None: | ||
| """Incrementally compute CRC-32 from bytes and compare to a checksum | ||
|
|
||
| Args: | ||
| bytes_ (bytes): The bytes to check against `checksum` | ||
| checksum (bytes): Byte representation of a checksum | ||
|
|
||
| Raises: | ||
| ValueError: If the computed CRC-32 does not match the checksum | ||
| """ | ||
| if binascii.crc32(bytes_) & 0xFFFFFFFF != STRUCT_CRC32.unpack(checksum)[0]: | ||
| raise ValueError("Checksum failure") | ||
|
|
||
| @staticmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| compressed_data = snappy.compress(data) | ||
| # A 4-byte, big-endian CRC32 checksum | ||
| compressed_data += STRUCT_CRC32.pack(binascii.crc32(data) & 0xFFFFFFFF) | ||
| return compressed_data, len(compressed_data) | ||
|
|
||
| @staticmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| # Compressed data includes a 4-byte CRC32 checksum | ||
| data = data[0:-4] | ||
| uncompressed = snappy.decompress(data) | ||
| checksum = data[-4:] | ||
| SnappyCodec._check_crc32(uncompressed, checksum) | ||
| return uncompressed | ||
|
|
||
| except ImportError: | ||
|
|
||
| class SnappyCodec(Codec): # type: ignore | ||
| @staticmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| raise ImportError("Snappy support not installed, please install using `pip install pyiceberg[snappy]`") | ||
|
|
||
| @staticmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| raise ImportError("Snappy support not installed, please install using `pip install pyiceberg[snappy]`") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| from __future__ import annotations | ||
|
|
||
| from io import BytesIO | ||
|
|
||
| from iceberg.avro.codecs.codec import Codec | ||
|
|
||
| try: | ||
| from zstandard import ZstdCompressor, ZstdDecompressor | ||
|
|
||
| class ZStandardCodec(Codec): | ||
| @staticmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| compressed_data = ZstdCompressor().compress(data) | ||
| return compressed_data, len(compressed_data) | ||
|
|
||
| @staticmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| uncompressed = bytearray() | ||
| dctx = ZstdDecompressor() | ||
| with dctx.stream_reader(BytesIO(data)) as reader: | ||
| while True: | ||
| chunk = reader.read(16384) | ||
| if not chunk: | ||
| break | ||
| uncompressed.extend(chunk) | ||
| return uncompressed | ||
|
|
||
| except ImportError: | ||
|
|
||
| class ZStandardCodec(Codec): # type: ignore | ||
| @staticmethod | ||
| def compress(data: bytes) -> tuple[bytes, int]: | ||
| raise ImportError("Zstandard support not installed, please install using `pip install pyiceberg[zstandard]`") | ||
|
|
||
| @staticmethod | ||
| def decompress(data: bytes) -> bytes: | ||
| raise ImportError("Zstandard support not installed, please install using `pip install pyiceberg[zstandard]`") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: would it make sense to put this in the
__init__.pyfile?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some projects do this, and some don't :) For me, it makes sense to add base classes in the init. Mostly because they need to be loaded anyway, and by adding them to the
__init__.pythey are read when you access the module. Also, for the case of the codecs, this avoids having yet another file. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding it to
__init__.py. Fewer files cuts down on load time and we will need to load it anyway.