-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: First version of the rest catalog #5287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| partition_spec: PartitionSpec | None = None, | ||
| partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC, | ||
| sort_order: SortOrder = UNSORTED_SORT_ORDER, | ||
| properties: Properties | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we similarly default properties to {}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like that, but that's actually a code smell in Python (or a quirk in the language). The default {} will be a reference to a single object, if you mutate that one, the next time the default value is being assigned, it will give the reference to the same object. More info here: https://florimond.dev/en/posts/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ So the recommended way is just to set it to null, and then do properties or {} in the code. This {} will then always initialize an empty dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no immutable dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in the standard lib, there is a package, or we can implement it ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe later, but this is fine for now.
| """ | ||
|
|
||
| identifier: str | Identifier | ||
| class Table(IcebergBaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Table need to be serializable? I'm curious why Table extends IcebergBaseModel. It seems like it should be a regular class to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it needs to be serializable from/to json. It is the response of the REST API: https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L1481-L1503
The CreateTableResponse is an alias of the LoadTableResponse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the LoadTableResponse should be the same class as Table. Table is part of the public API and will expose a lot of methods and operations to users. LoadTableResponse is specific to the REST catalog and includes fields that aren't part of the table (config).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defined by the Catalog:
iceberg/python/pyiceberg/catalog/base.py
Lines 54 to 77 in 8495141
| @abstractmethod | |
| def create_table( | |
| self, | |
| identifier: str | Identifier, | |
| schema: Schema, | |
| location: str | None = None, | |
| partition_spec: PartitionSpec | None = None, | |
| properties: Properties | None = None, | |
| ) -> Table: | |
| """Create a table | |
| Args: | |
| identifier: Table identifier. | |
| schema: Table's schema. | |
| location: Location for the table. Optional Argument. | |
| partition_spec: PartitionSpec for the table. Optional Argument. | |
| properties: Table properties that can be a string based dictionary. Optional Argument. | |
| Returns: | |
| Table: the created table instance | |
| Raises: | |
| AlreadyExistsError: If a table with the name already exists | |
| """ |
How about subclassing the table:
class RestTable(Table):
config: Dict[str, str] = Field(default_factory=dict)And moving the specific items to there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will want to evolve this in the future, but it is okay for now.
We will need to think about whether we want to use the same TableOperations abstraction that Java uses. That allows us to have a base table implementation that can work with any catalog. In that design, the table wraps a TableMetadata that is handled (refreshed, committed) by TableOperations.
python/pyiceberg/catalog/rest.py
Outdated
| identifier = self.identifier_to_tuple(identifier) | ||
| return {"namespace": NAMESPACE_SEPARATOR.join(identifier[:-1]), "table": identifier[-1]} | ||
|
|
||
| def _split_identifier_for_json(self, identifier: Union[str, Identifier]) -> Dict[str, Union[Tuple[str, ...], str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The return type could be Dict[str, Union[str, Identifier]] instead of using Tuple[str, ...].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, thanks!
python/pyiceberg/catalog/rest.py
Outdated
| self._handle_non_200_response(exc, {404: NoSuchNamespaceError}) | ||
| response = ListTablesResponse(**response.json()) | ||
|
|
||
| return [(self.name, *entry.namespace, entry.name) for entry in response.identifiers] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should include self.name because we want to be able to pass any of the returned identifiers back into the catalog. The only place where we want to add the catalog name for context is when we create the Table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I've reverted this change.
|
|
||
|
|
||
| class TableResponse(IcebergBaseModel): | ||
| metadata_location: Optional[str] = Field(alias="metadata-location", default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is optional in the spec, but I think we want to make it required in the spec. Otherwise, you wouldn't be able to tell when the metadata changes. The refresh call will call the load route again and update the metadata if it has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having no metadata-location doesn't make much sense to me:
This is from the spec:
The table metadata JSON is returned in the
metadatafield. The corresponding file location of table metadata should be returned in themetadata-locationfield, unless the metadata is not yet committed. For example, a create transaction may return metadata that is staged but not committed.
Also, the location is in the metadata itself, where it is a required field, so I don't see how this can be optional. I can create a PR to the spec if you like (but we could also omit it since it is part of the metadata).
python/pyiceberg/catalog/rest.py
Outdated
| try: | ||
| response.raise_for_status() | ||
| except HTTPError as exc: | ||
| self._handle_non_200_response(exc, {404: NoSuchNamespaceError}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be NoSuchTableException. If you try to drop a table and get a 404, then the table doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyiceberg/catalog/rest.py
Outdated
| try: | ||
| response.raise_for_status() | ||
| except HTTPError as exc: | ||
| self._handle_non_200_response(exc, {404: NoSuchNamespaceError, 409: TableAlreadyExistsError}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be NoSuchTableError: https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L752-L753
Looks like the spec has an incorrect example there, but the 404 for this route means there was no such table to rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Here you suggested using NoSuchNamespaceError: #5287 (comment) Let me go over the spec and see which one to throw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 404 is context-specific. My comment was for the list_tables method, where a 404 indicates that the namespace does not exist and can't be listed. That shouldn't raise NoSuchTableError because 404 doesn't mean the namespace was empty -- that would result in an empty list of identifiers -- 404 means that the namespace itself doesn't exist.
Sorry for the confusion, I know it's a bit odd to have the meaning of 404 change based on the route, but it makes a bit more sense when you think about routes as identifying resources. /namespace/ns/tables primarily identifies a namespace and is interacting with the table set of that namespace.
| # under the License. | ||
|
|
||
|
|
||
| class NoSuchTableError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we want to change these errors, since they are independent of the REST catalog. I'd leave the existing ones unchanged and just make the REST-specific errors extend RESTError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same concern, also others could be applicable outside of the REST contest. I'll move NoSuchNamespaceError and NoSuchTableError back to inherit from Exception.
| token: str | ||
| config: Properties | ||
|
|
||
| host: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this should be uri, since that better describes the behavior: this is the base URI, not a host that will be embedded in https://{host}:443/ to build URIs. That also matches the common options used by Java and are documented as common catalog config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, I've renamed this one! 👍🏻
| token: The bearer token | ||
| """ | ||
| self.host = host | ||
| if client_id and client_secret: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be passed as a single string? I'm fine supporting both credential and client_id/client_secret. I'd just like to be consistent across implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing 👍🏻
| elif token: | ||
| self.token = token | ||
| else: | ||
| raise ValueError("Either set the client_id and client_secret, or provide a valid token") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should fail if there isn't a valid token. Not all environments will require auth. Instead of failing, this should just fall back to not passing the Authentication header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think of that, updated the code, thanks! 👍🏻
|
|
||
| def load_table(self, identifier: Union[str, Identifier]) -> Table: | ||
| response = requests.get( | ||
| self.url(Endpoints.load_table, prefixed=True, **self._split_identifier_for_path(identifier)), headers=self.headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: prefixed=True is the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less is more 👍🏻
| json={ | ||
| "error": { | ||
| "message": "Table does not exist: fokko.fokko2 in warehouse 8bcb0838-50fc-472d-9ddb-8feb89ef5f1e", | ||
| "type": "NoSuchNamespaceErrorException", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: NoSuchTableException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where that comes from 🤔
|
Awesome work, @Fokko! There are still a few minor comments, but there are no blockers so I merged this. Thanks! |
Some small comments pending from: apache#5287
Some small comments pending from: apache#5287
This does not implement table commits, but does implement the catalog portion of the REST API.
No description provided.