Python: Resolve write/read schemas #5116

Fokko · 2022-06-22T21:34:56Z

This PR allows you to supply a read schema when reading an Avro file. It will construct a read tree that only reads the actual fields that it needs, and skips over the ones that aren't part of the read schema.

Also combined read_long and read_int into one. They are still two separate readers, but just one method on the decoder. This is because they are binary compatible, and there is no long in Python.

python/src/iceberg/avro/reader.py

python/src/iceberg/avro/resolver.py

rdblue · 2022-06-27T22:19:21Z

python/src/iceberg/avro/resolver.py

+    # For now we only support the binary compatible ones
+    IntegerType: {LongType},
+    StringType: {BinaryType},
+    BinaryType: {StringType},


I don't think this is currently allowed by the spec, but I don't see a reason why it shouldn't be. Maybe we should add it to the spec? What do you think?

For background, one of the restrictions that we place on type promotion is that promotion is only allowed if hashing the value produces the same result. That's why hashInt(v) is implemented as hashLong(castToLong(v)). Otherwise, when a table column is promoted from int to long, any metadata values for bucket partitions would be suddenly incorrect.

It should be possible to promote from binary to string (or the opposite) because the hash value is the same.

I think we can also relax this constraint a bit, so if there is no bucket transform on a column, you can perform a type promotion that would not be allowed otherwise. For example, int to string promotion could only be allowed if there was no partition spec with a bucket function on the int column.

There are also odd cases with timestamps. If a long value is a timestamp in microseconds, then it could be promoted to timestamp or timestamptz because the hash function would match (no need to modify the value). But if the long was a timestamp in milliseconds, we could only promote to timestamp or timestamptz if there was no bucket transform applied to the value in a partition spec.

I got this from the Avro schema evolution:

And thanks for the elaborate background information. And I agree that we should add it to the spec indeed. For the hashing, for Python it seems to check out:

➜ python git:(fd-resolve-write-read-schemas) ✗ python3 Python 3.9.13 (main, May 24 2022, 21:13:51) [Clang 13.1.6 (clang-1316.0.21.2)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> hash("vo") 292714697147949111 >>> hash("vo".encode("utf8")) 292714697147949111

For Java, this does not seem to be the case:

➜ python git:(fd-resolve-write-read-schemas) ✗ scala Welcome to Scala 2.13.8 (OpenJDK 64-Bit Server VM, Java 18.0.1.1). Type in expressions for evaluation. Or try :help. scala> new java.lang.String("vo").hashCode val res1: Int = 3769 scala> new java.lang.String("vo").getBytes() val res2: Array[Byte] = Array(118, 111) scala> new java.lang.String("vo").getBytes().hashCode() val res3: Int = 1479286669

We should also make sure that the hashes are the same across languages. I think this is okay for now because we implement this using the MurmurHash library.

Hashing in this case is documented in the Iceberg spec: https://iceberg.apache.org/spec/#appendix-b-32-bit-hash-requirements

That's the one I'm talking about, and we have verified that it is the same between Java and Python.

I also want to note that the type promotions that we care about are the promotions that are allowed in Iceberg, not Avro. While we're using Avro files, Avro is more permissive in these cases than Iceberg allows.

I've created a PR here to add string/binary conversion to the Spec: #5151

I also want to note that the type promotions that we care about are the promotions that are allowed in Iceberg, not Avro. While we're using Avro files, Avro is more permissive in these cases than Iceberg allows.

Good point, I was focussing on the Avro ones here (since it is in the Avro package), but we can also make this pluggable by providing different promotion dictionaries for Avro and Iceberg.

rdblue · 2022-06-27T22:20:20Z

python/src/iceberg/avro/resolver.py

+    # These are all allowed according to the Avro spec
+    # IntegerType: {LongType, FloatType, DoubleType},
+    # LongType: {FloatType, DoubleType},
+    # FloatType: {DoubleType},


This is allowed by the Iceberg spec, as is promotion from DecimalType(P, S) to DecimalType(P2, S) where P2 > P.

Promoting float to double should be possible fairly easily. All you have to do is check whether it's allowed and then use the source type's reader. Since Python uses a double to represent both, it will automatically produce the correct result.

Same thing for DecimalType. As long as you're reading the correct fixed size you should be fine.

I left the float to double one out for now because of the binary incompatibility as mentioned a few lines above. It adds additional complexity because we still need to read the float (4 bytes) and convert it into a double. If we would just plug in the double reader, then we would consume 8 bytes which will lead to problems.

Yes, that's correct. We would need to still use the reader that corresponds to the write schema. It should be possible to do that in the PrimitiveType function.

Ah, you already mentioned that =) I've updated the code by converting the dict into a promote singledispatch which gives us a bit more flexibility. For the float/double case, it will just return the file type reader. For the decimal, it will check the precision, and for converting the int to a float, it will do the conversion. It comes with tests. Let me know what you think!

python/src/iceberg/avro/resolver.py

rdblue · 2022-06-27T22:32:57Z

python/tests/avro/test_resolver.py

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import pytest


I'd recommend taking a look at the test cases in Java, which are pretty thorough and can find a lot of weird cases:

iceberg/data/src/test/java/org/apache/iceberg/data/TestReadProjection.java

Line 38 in bd94084

public abstract class TestReadProjection {

I've been looking into this, and am happy to replicate the tests on the Python side. However, it depends on the writeAndRead method that requires a write path. Maybe an idea to postpone this until we have the write path (which I see already glooming in the not-so-distant future).

We may be able to use fastavro for it, but we'd need to make sure there are field IDs to do projection right. (We could add name mapping for that.) So yeah, I'm happy adding this when we have a write path. I think the implementation is correct.

We should probably add a write path soon, like you said.

I was thinking of using FastAvro, but then we need to do everything in Avro schemas, which is a bit of a hassle. Otherwise we can just use the typed iceberg schemas 🚀

python/tests/avro/test_resolver.py

python/tests/avro/test_decoder.py

python/tests/avro/test_resolver.py

…solve-write-read-schemas

python/src/iceberg/avro/resolver.py

…solve-write-read-schemas

rdblue · 2022-06-29T14:50:23Z

python/src/iceberg/exceptions.py


 class ValidationError(Exception):
-    ...
+    """Raises when there is an issue with the schema"""


This is used in other places as well, but the comment is fine for now.

rdblue · 2022-06-29T14:52:16Z