-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Add updates, moves and deletes #8374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # Strip the catalog name | ||
| if len(self._updates) > 0: | ||
| response = self._table.catalog._commit_table( # pylint: disable=W0212 | ||
| self._table._do_commit( # pylint: disable=W0212 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this being private for now, but we may want to actually expose it later.
| exists = False | ||
| try: | ||
| exists = self._schema.find_field(full_name, self._case_sensitive) is not None | ||
| except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would find_field throw a ValueError? Shouldn't it return None instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField:
"""Find a field using a field name or field ID.
Args:
name_or_id (Union[str, int]): Either a field name or a field ID.
case_sensitive (bool, optional): Whether to perform a case-sensitive lookup using a field name. Defaults to True.
Raises:
ValueError: When the value cannot be found.
Returns:
NestedField: The matched NestedField.
"""There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It throws an error indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was in there before any PyIceberg release. This follows the Easier to Ask Forgiveness Than Permission style, which is quite popular in Python. I personally like to avoid this pattern and avoid the exceptions. The more now we have the walrus := operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I guess it's established already though.
python/pyiceberg/table/__init__.py
Outdated
| new_type = assign_fresh_schema_ids(field_type, self.assign_new_column_id) | ||
| field = NestedField(field_id=new_id, name=name, field_type=new_type, required=required, doc=doc) | ||
|
|
||
| self._adds[parent_id] = self._adds.get(parent_id, []) + [field] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's me, but using + makes me wonder about the argument types. I find this more clear:
self._adds[parent_id] = tuple(*self._adds.get(parent_id, []), field)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to making it more readable, but that doesn't work on my machine:
➜ python git:(fd-followup) python3
Python 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> tuple(*[1, 2], 3)What does work is:
self._adds[parent_id] = self._adds.get(parent_id, tuple()), + (field,)But then we end up with the same thing as before, but then a tuple. I would suggest refactoring it as:
if parent_id in self._adds:
self._adds[parent_id].append(field)
else:
self._adds[parent_id] = [field]Then we don't recreate the collections all the time (but this isn't the hottest path, probably)
|
|
||
| if self._transaction is not None: | ||
| self._transaction._append_updates(*updates) # pylint: disable=W0212 | ||
| self._transaction._append_requirements(*requirements) # pylint: disable=W0212 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem, but note that if you run multiple schema updates in the same transaction, the transaction needs to ensure it sends only one AssertCurrentSchemaId. Right now that will fail but in the future we'll want to suppress the additional ones so that only the first schema ID is validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great observation. I checked and we only allow a single AddSchemaUpdate in a transaction. So I think we're okay for now. I've added a test to capture the behavior.
| raise ValueError(f"Cannot update a column that will be deleted: {full_name}") | ||
|
|
||
| if field_type is not None: | ||
| if not self._allow_incompatible_changes and field.field_type != field_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Java equivalent ensures that only primitive fields are updated because the only type that can be passed in is a PrimitiveType. I think we need to have an additional check here that this is not replacing a nested type.
It may be that promote already does this for us, but I want to highlight that we cannot allow calling this to update a struct, map, or list type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema evolution catches this:
Cannot change column type: foo: struct<2: bar: optional string> -> string
I've added a check to make the error more specific error:
Cannot change column type: struct<2: bar: optional string> is not a primitive
|
|
||
|
|
||
| @pytest.fixture() | ||
| def catalog() -> Catalog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are all of these tests done in integration tests? In Java, we test the schema update API by calling apply to produce the schema that will be committed. Only the commit needs to be done in an integration test that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it adds another layer of testing since the REST catalog will also reject certain things (for example, the requirements on the identifier fields), this caught already a few bugs. I also like to follow the same path that the end user would take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I get it now. Seems alright, although the REST catalog really doesn't check much. It would probably be good to move most of the schema evolution behavior tests to unit tests, but the important thing is that they're happening somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good to me! I think the only serious issues are:
- Schema ID is not being assigned in commit, which corrupts metadata
- Tests are run as integration tests, but most could be converted to unit tests that call
_apply update_columnalways creates an entry inchangeseven if the change should be a noop
|
Awesome work, @Fokko! This is really close. |
This was a great catch, updated this! The REST Catalog actually handled this correctly, but now it is also fixed and tested outside of the rest catalog.
I ported some tests outside of the integration tests as well. As mentioned above I really like testing against the REST catalog, since uses the same public API that the customer uses (which is a good thing from an OOP perspective), but we're testing much more code than we actually need. I also ported some test back to avoid having only integration tests, but the majority is still IT tests because I prefer them. Let me know if you want me to port the rest back as well.
Added and tested as well, also removed the |
|
I accidentally tagged this PR in another PR that was closed. |
| update.move_after("bid", "ask") | ||
| # In a struct, only the new name field | ||
| update.move_before("details.exchange", "properties.created_by") | ||
| # This will move `confirmed_by` before `exchange` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: created_by is moved, not confirmed_by.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I don't understand why so many schema tests are run as integration tests, but at least they run somewhere. We can get this in to unblock the release at least.
No description provided.