[Bug Fix] cast None `current-snapshot-id` as -1 for Backwards Compatibility #473

sungwy · 2024-02-26T16:41:07Z

The existing PyIceberg cleanup_snapshot_id validator creates tables that are not backward compatible.

On table creation, the existing behavior in Java is to create tables with current_snapshot_id = -1. Older versions of the Java implementations also seem to require that current_snapshot_id is a None-null long value, and throws an exception if it is None:

Retrying task after failure: Cannot parse missing long: current-snapshot-id
java.lang.IllegalArgumentException: Cannot parse missing long: current-snapshot-id
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:220)
	at org.apache.iceberg.util.JsonUtil.getLong(JsonUtil.java:133)
	at org.apache.iceberg.TableMetadataParser.fromJson(TableMetadataParser.java:436)
	at org.apache.iceberg.TableMetadataParser.fromJson(TableMetadataParser.java:303)
	at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:273)
	at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:266)

In order to preserve backward compatibility, we should cast None current_snapshot_id to -1, rather than casting -1 current_snapshot_id value to None.

HonahX · 2024-02-27T01:36:45Z

Thanks for the great catch @syun64 ! My understanding is that we need to write current_snapshot_id to -1 when serializing the new metadata object to JSON. Would it be better to directly update the serializer to support this backwards compatibility?
ToOutputFile.table_metadata

Internally, either -1 or None can represent "no current snapshot id". But I think None is better as it aligns more with the spec which states that current_snapshot_id is optional and make things more intuitive. WDYT?

sungwy · 2024-02-27T03:38:14Z

Thanks for the great catch @syun64 ! My understanding is that we need to write current_snapshot_id to -1 when serializing the new metadata object to JSON. Would it be better to directly update the serializer to support this backwards compatibility? ToOutputFile.table_metadata

Internally, either -1 or None can represent "no current snapshot id". But I think None is better as it aligns more with the spec which states that current_snapshot_id is optional and make things more intuitive. WDYT?

Great suggestion @HonahX . I tried adding a @field_serializer to the TableMetadata1 class to test it out, but unfortunately the serialized output from model_dump_json doesn't seem to serialize None value as -1 as we'd want. I'm still trying to figure out what about our pydantic class definition isn't allowing us to add the custom serializer, for example:

class TableMetadataV1(TableMetadataCommonFields, IcebergBaseModel):
    ...
    @field_serializer('current_snapshot_id')
    def serialize_current_snapshot_id(self, current_snapshot_id: Optional[int]):
        return current_snapshot_id if current_snapshot_id is not None else -1

Fokko · 2024-02-27T09:10:28Z

Thanks @syun64 for raising this. I agree with @HonahX that setting it to -1 is not a good idea, for mainly two reasons:

There are two options for representing saying that the current-snapshot-id should be ignored, both -1 and None.
Where -1 does not point to a valid snapshot.

However Java is the reference implementation, we should not blindly follow everything that they did there. There are some things in there that we can improve on since we don't carry the full history that Java does. Can you elaborate on why you need this change?

sungwy · 2024-02-27T12:12:19Z

Thanks @syun64 for raising this. I agree with @HonahX that setting it to -1 is not a good idea, for mainly two reasons:

There are two options for representing saying that the current-snapshot-id should be ignored, both -1 and None.

Where -1 does not point to a valid snapshot.

However Java is the reference implementation, we should not blindly follow everything that they did there. There are some things in there that we can improve on since we don't carry the full history that Java does. Can you elaborate on why you need this change?

Hi @Fokko thank you for the response! Tables created by PyIceberg currently do not persist a current_snapshot_id. This means that if no further commits were made after table creation, the table cannot be read by older versions of Java code (Trino, Spark, etc) that require that the current_snapshot_id be a valid long value, until a commit is made that commits a current_snapshot_id. Otherwise, it will throw above error trace when trying to parse the table metadata.

It looks like the PR to handle current_snapshot_id as optional was only introduced in the Java code a few months ago, meaning that all Java code running versions prior to that would throw above error on parsing table metadata created by PyIceberg in its current state. Hence I thought it was a backward incompatible change that's worth noting here.

I think as @HonahX suggested, serializing the output current_snapshot_id in TableMetadata as '-1' would be good for backwards compatibility.

sungwy · 2024-02-27T20:29:52Z

So it looks like using a custom @field_serializer isn't working in the current IcebergBaseModel definition with "None" values, because we are setting exclude_none=True by default for model_dump_json. Unfortunately the model exclusion logic runs before the custom field_serializer, meaning that if we store the current_snapshot_id internally as None, we won't be able to use custom pydantic serializers to cast None to -1 only on output. There is a recent discussion on this specific issue in the Pydantic community, and people had to resort to other hacky workarounds to support field_serialization on None values when exclude_none is set to True.

If we want to make Iceberg tables created by PyIceberg compatible with older versions of Java, I think we'll just have to store current_snapshot_id as -1 in the TableMetadata, or refactor the way we identify fields that we want to exclude if they are None in the IcebergBaseModel.

Fokko · 2024-02-28T12:44:55Z

Thanks for the full context. cc @rdblue @danielcweeks WDYT?

This has been fixed in Iceberg 1.4.0. We should break the spec if we want to support reading Iceberg tables written by an earlier version of Java Iceberg. We could also update the spec to make it non-optional, otherwise, other implementations will follow.

danielcweeks · 2024-02-28T16:31:51Z

The spec calls out the current_snapshot_id as optional, so omitting it "should" be valid. I think I agree with @Fokko that we shouldn't just blindly follow java's implementation. I would suggest a compatibility flag that sets the value to -1, but default it to off.

The reason I think this is safe is that the situation where someone is using an older version of java but also using python to create tables (no commits) is relatively narrow and they would identify the problem quickly and turn the flag on for their use cases. I feel like this is the best approach to move forward with the correct solution, but still accommodate backward compatibility.

sungwy · 2024-02-28T17:17:40Z

The spec calls out the current_snapshot_id as optional, so omitting it "should" be valid. I think I agree with @Fokko that we shouldn't just blindly follow java's implementation. I would suggest a compatibility flag that sets the value to -1, but default it to off.

The reason I think this is safe is that the situation where someone is using an older version of java but also using python to create tables (no commits) is relatively narrow and they would identify the problem quickly and turn the flag on for their use cases. I feel like this is the best approach to move forward with the correct solution, but still accommodate backward compatibility.

That makes sense @danielcweeks @Fokko . Thank you for sharing your opinions.

Would the compatibility flag be included as a environment variable? If we think that supporting this change is a hassle, and we want to honor correctness of the implementation according to the Spec, maybe we can make a conscious decision to not accommodate for backward compatibility in this edge case.

One of my goals with putting this PR was to at least document this issue and the discussion so others running into it can refer to it.

Fokko · 2024-02-28T19:03:16Z

I prefer to not exclude certain groups (Java <1.3.0 in this case, I'm not sure on which versions all the proprietary implementations). I think a flag is an elegant way of enabling this. I fully agree with you on the documentation part of it 👍

Would the compatibility flag be included as a environment variable?

The flag would be a configuration option on the catalog. This way you can set it through the ~/.pyiceberg.yaml and through an environment variable if you like.

sungwy · 2024-02-29T16:09:09Z

I prefer to not exclude certain groups (Java <1.3.0 in this case, I'm not sure on which versions all the proprietary implementations). I think a flag is an elegant way of enabling this. I fully agree with you on the documentation part of it 👍

Would the compatibility flag be included as a environment variable?

The flag would be a configuration option on the catalog. This way you can set it through the ~/.pyiceberg.yaml and through an environment variable if you like.

That makes sense @Fokko . What are your thoughts on my findings about the pydantic base model? I don't think there's an easy way to use a custom serializer only on the output, because we are using exclude_none=True on the model. So it will ignore the current_snapshot_id attribute if we stored it as None internally.

So do we want to:

Store current_snapshot_id as '-1' internally if we have this flag set to True? Or
model_dump a dictionary, and add current_snapshot_id as -1 if not present, then json.dumps(model_dump) on output serialization if the flag is set to True?

sungwy · 2024-02-29T18:37:23Z

I went forward with the Option 2 and I think it looks pretty clean. Let me know what you think @Fokko

mkdocs/docs/configuration.md

Fokko · 2024-02-29T20:46:23Z

pyiceberg/serializers.py

+            if Config().get_bool("legacy-current-snapshot-id") and metadata.current_snapshot_id is None:
+                model_dict = json.loads(model_dump)
+                model_dict[CURRENT_SNAPSHOT_ID] = -1
+                model_dump = json.dumps(model_dict)


I don't think this is the best place to fix this. Mostly because we have to deserialize and serialize the metadata, and the rest of the deserialization logic is part of the Pydantic model. I think option 1 is much cleaner. We can set the ignore-None to False:

@staticmethod def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: bool = False) -> None: """Write a TableMetadata instance to an output file. Args: output_file (OutputFile): A custom implementation of the iceberg.io.file.OutputFile abstract base class. overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`. """ with output_file.create(overwrite=overwrite) as output_stream: json_bytes = metadata.model_dump_json(exclude_none=False).encode(UTF8) json_bytes = Compressor.get_compressor(output_file.location).bytes_compressor()(json_bytes) output_stream.write(json_bytes)

Oh that's a good suggestion. I'll add the field_serializer and set exclude_none to False so we can print out -1 in the output.

pyiceberg/utils/config.py

Co-authored-by: Fokko Driesprong <[email protected]>

…rg-python into current-snapshot-bf

HonahX

Overall LGTM! Just have one comment. Thanks for working on this!

HonahX · 2024-03-01T00:45:03Z

pyiceberg/table/metadata.py

@@ -121,7 +122,7 @@ def check_sort_orders(table_metadata: TableMetadata) -> TableMetadata:

 def construct_refs(table_metadata: TableMetadata) -> TableMetadata:
    """Set the main branch if missing."""
-    if table_metadata.current_snapshot_id is not None:
+    if table_metadata.current_snapshot_id is not None and table_metadata.current_snapshot_id != -1:


Is this -1 check still necessary? construct_refs is an after validator. At this point, cleanup_snapshot_id should already turn current_snapshot_id=-1 to None.

You are right! I'll submit a fix thank you!

pyiceberg/table/metadata.py

HonahX · 2024-03-03T22:37:04Z

pyiceberg/serializers.py

@@ -127,6 +127,6 @@ def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite:
            overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`.
        """
        with output_file.create(overwrite=overwrite) as output_stream:
-            json_bytes = metadata.model_dump_json().encode(UTF8)
+            json_bytes = metadata.model_dump_json(exclude_none=False).encode(UTF8)


Maybe not a big deal: Shall we set exclude_none=False only in the legacy mode?

I would prefer that. Let me put in this update

Fokko

This looks great, one minor suggestion:

Could you add the legacy-current-snapshot-id key to the write options table as well: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md#write-options

Fokko · 2024-03-05T07:50:04Z

pyiceberg/table/metadata.py

+    @field_serializer('current_snapshot_id')
+    def serialize_current_snapshot_id(self, current_snapshot_id: Optional[int]) -> Optional[int]:
+        if current_snapshot_id is None and Config().get_bool("legacy-current-snapshot-id"):
+            return -1
+        return current_snapshot_id


sungwy · 2024-03-05T16:28:16Z

This looks great, one minor suggestion:

Could you add the legacy-current-snapshot-id key to the write options table as well: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md#write-options

I have this configuration documented under "Backward Compatibility" in the documentation now. I feel a bit awkward adding it to the write-options because the current list of write-options map to pyarrow parquet properties.

Are you suggesting that we just add the property to the write-options in the documentation as well? Like:

Write options

Key	Options	Default	Description
`write.parquet.compression-codec`	`{uncompressed,zstd,gzip,snappy}`	zstd	Sets the Parquet compression coddec.
`write.parquet.compression-level`	Integer	null	Parquet compression level for the codec. If not set, it is up to PyIceberg
`write.parquet.page-size-bytes`	Size in bytes	1MB	Set a target threshold for the approximate encoded size of data pages within a column chunk
`write.parquet.page-row-limit`	Number of rows	20000	Set a target threshold for the approximate encoded size of data pages within a column chunk
`write.parquet.dict-size-bytes`	Size in bytes	2MB	Set the dictionary page size limit per row group
`write.parquet.row-group-limit`	Number of rows	122880	The Parquet row group limit
`legacy-current-snapshot-id`	`{True,False}`	False	Set `current-snapshot-id` in serialized TableMetadata as `-1` instead of `None`

Are we looking to move the documentation to Write options or have it in both places?

sungwy · 2024-03-06T00:51:49Z

Shall we merge this in if the place in the documentation (Write options vs Backward Compatibility) is the only subject for debate? We can always update the docs. Would appreciate it if you can merge this in for me @Fokko @HonahX , and thank you both for the thorough review!

HonahX · 2024-03-06T01:41:10Z

Hi @Fokko @syun64

I feel a bit awkward adding it to the write-options because the current list of write-options map to pyarrow parquet properties.

I have a similar feeling, mainly because write-options table is for table properties while the legacy-current-snapshot-id is a global config in ~/.pyiceberg.yaml

catalog:
  default:
    type: glue
 
legacy-current-snapshot-id: True

Shall we merge this in if the place in the documentation..

Let's get this in first. Thanks for the great effort on it. Thanks everyone for reviewing!

…lity (apache#473) Backport to 0.6.1

…lity (#473) (#557) Backport to 0.6.1 Co-authored-by: Sung Yun <[email protected]>

djouallah · 2024-05-11T11:15:50Z

is this available in 0.6.1, bigquery can't read the table because of that

xxxx.metadata.json missing current-snapshot-id

Fokko · 2024-05-11T13:01:17Z

@djouallah Thanks for jumping in here. Sad to hear that Bigquery also suffers from this. This is part of 0.6.1, so setting the configuration, or writing an empty dataframe, should fix the issue.

djouallah · 2024-05-11T13:05:11Z

@Fokko sorry for the silly question, but i could not find it in the documentation ?

Fokko · 2024-05-11T13:15:30Z

@djouallah It is mentioned all the way at the bottom of the configuration page: https://py.iceberg.apache.org/configuration/#backward-compatibility

djouallah · 2024-05-11T13:28:08Z

thanks, it works

sungwy added 2 commits February 26, 2024 16:38

current-snapshot-id as -1

710f913

fix

d401045

legacy-current-snapshot-id

cd38c5a

lint

3a4ada5

Fokko reviewed Feb 29, 2024

View reviewed changes

sungwy and others added 5 commits February 29, 2024 16:32

Update mkdocs/docs/configuration.md

7b810c7

Co-authored-by: Fokko Driesprong <[email protected]>

adopt review feedback

fecbd4d

Merge branch 'current-snapshot-bf' of https://github.com/syun64/icebe…

cc9ead5

…rg-python into current-snapshot-bf

Merge branch 'main' into current-snapshot-bf

a929d20

lint

a102467

HonahX reviewed Mar 1, 2024

View reviewed changes

sungwy commented Mar 1, 2024

View reviewed changes

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved

Update pyiceberg/table/metadata.py

f4a302b

sungwy requested review from HonahX and Fokko March 1, 2024 13:26

HonahX approved these changes Mar 3, 2024

View reviewed changes

adopt review comment

130b3cc

Fokko approved these changes Mar 5, 2024

View reviewed changes

HonahX merged commit 14c021b into apache:main Mar 6, 2024
7 checks passed

sungwy deleted the current-snapshot-bf branch March 6, 2024 01:57

Fokko added this to the PyIceberg 0.6.1 milestone Mar 26, 2024

HonahX pushed a commit to HonahX/iceberg-python that referenced this pull request Mar 29, 2024

[Bug Fix] cast None current-snapshot-id as -1 for Backwards Compatibi…

ca13fa1

…lity (apache#473) Backport to 0.6.1

HonahX mentioned this pull request Mar 29, 2024

[0.6.x] Backport #473 for current-snapshot-id backwards compatibility #557

Merged

Fokko pushed a commit that referenced this pull request Mar 29, 2024

[Bug Fix] cast None current-snapshot-id as -1 for Backwards Compatibi…

f0bb203

…lity (#473) (#557) Backport to 0.6.1 Co-authored-by: Sung Yun <[email protected]>

Fokko mentioned this pull request Apr 25, 2024

Empty snapshot ID should be Null instead of -1 apache/iceberg-rust#352

Open

kevinjqliu mentioned this pull request Sep 11, 2024

Cannot parse missing long: current-snapshot-id #1164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix] cast None `current-snapshot-id` as -1 for Backwards Compatibility #473

[Bug Fix] cast None `current-snapshot-id` as -1 for Backwards Compatibility #473

sungwy commented Feb 26, 2024 •

edited

Loading

HonahX commented Feb 27, 2024

sungwy commented Feb 27, 2024

Fokko commented Feb 27, 2024

sungwy commented Feb 27, 2024 •

edited

Loading

sungwy commented Feb 27, 2024

Fokko commented Feb 28, 2024

danielcweeks commented Feb 28, 2024

sungwy commented Feb 28, 2024 •

edited

Loading

Fokko commented Feb 28, 2024

sungwy commented Feb 29, 2024

sungwy commented Feb 29, 2024

Fokko Feb 29, 2024

sungwy Feb 29, 2024

HonahX left a comment

HonahX Mar 1, 2024

sungwy Mar 1, 2024

HonahX Mar 3, 2024

sungwy Mar 4, 2024

Fokko left a comment

Fokko Mar 5, 2024

sungwy commented Mar 5, 2024

sungwy commented Mar 6, 2024

HonahX commented Mar 6, 2024

djouallah commented May 11, 2024

Fokko commented May 11, 2024

djouallah commented May 11, 2024

Fokko commented May 11, 2024

djouallah commented May 11, 2024

[Bug Fix] cast None current-snapshot-id as -1 for Backwards Compatibility #473

[Bug Fix] cast None current-snapshot-id as -1 for Backwards Compatibility #473

Conversation

sungwy commented Feb 26, 2024 • edited Loading

HonahX commented Feb 27, 2024

sungwy commented Feb 27, 2024

Fokko commented Feb 27, 2024

sungwy commented Feb 27, 2024 • edited Loading

sungwy commented Feb 27, 2024

Fokko commented Feb 28, 2024

danielcweeks commented Feb 28, 2024

sungwy commented Feb 28, 2024 • edited Loading

Fokko commented Feb 28, 2024

sungwy commented Feb 29, 2024

sungwy commented Feb 29, 2024

Fokko Feb 29, 2024

Choose a reason for hiding this comment

sungwy Feb 29, 2024

Choose a reason for hiding this comment

HonahX left a comment

Choose a reason for hiding this comment

HonahX Mar 1, 2024

Choose a reason for hiding this comment

sungwy Mar 1, 2024

Choose a reason for hiding this comment

HonahX Mar 3, 2024

Choose a reason for hiding this comment

sungwy Mar 4, 2024

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko Mar 5, 2024

Choose a reason for hiding this comment

sungwy commented Mar 5, 2024

Write options

sungwy commented Mar 6, 2024

HonahX commented Mar 6, 2024

djouallah commented May 11, 2024

Fokko commented May 11, 2024

djouallah commented May 11, 2024

Fokko commented May 11, 2024

djouallah commented May 11, 2024

[Bug Fix] cast None `current-snapshot-id` as -1 for Backwards Compatibility #473

[Bug Fix] cast None `current-snapshot-id` as -1 for Backwards Compatibility #473

sungwy commented Feb 26, 2024 •

edited

Loading

sungwy commented Feb 27, 2024 •

edited

Loading

sungwy commented Feb 28, 2024 •

edited

Loading