Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Jul 15, 2022

No description provided.

@Fokko Fokko force-pushed the fd-rest-catalog branch from d8688fe to f1a986e Compare July 15, 2022 17:00
partition_spec: PartitionSpec | None = None,
partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
sort_order: SortOrder = UNSORTED_SORT_ORDER,
properties: Properties | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we similarly default properties to {}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like that, but that's actually a code smell in Python (or a quirk in the language). The default {} will be a reference to a single object, if you mutate that one, the next time the default value is being assigned, it will give the reference to the same object. More info here: https://florimond.dev/en/posts/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ So the recommended way is just to set it to null, and then do properties or {} in the code. This {} will then always initialize an empty dict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no immutable dict?

Copy link
Contributor Author

@Fokko Fokko Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the standard lib, there is a package, or we can implement it ourselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe later, but this is fine for now.

"""

identifier: str | Identifier
class Table(IcebergBaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Table need to be serializable? I'm curious why Table extends IcebergBaseModel. It seems like it should be a regular class to me.

Copy link
Contributor Author

@Fokko Fokko Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it needs to be serializable from/to json. It is the response of the REST API: https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L1481-L1503

The CreateTableResponse is an alias of the LoadTableResponse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the LoadTableResponse should be the same class as Table. Table is part of the public API and will expose a lot of methods and operations to users. LoadTableResponse is specific to the REST catalog and includes fields that aren't part of the table (config).

Copy link
Contributor Author

@Fokko Fokko Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined by the Catalog:

@abstractmethod
def create_table(
self,
identifier: str | Identifier,
schema: Schema,
location: str | None = None,
partition_spec: PartitionSpec | None = None,
properties: Properties | None = None,
) -> Table:
"""Create a table
Args:
identifier: Table identifier.
schema: Table's schema.
location: Location for the table. Optional Argument.
partition_spec: PartitionSpec for the table. Optional Argument.
properties: Table properties that can be a string based dictionary. Optional Argument.
Returns:
Table: the created table instance
Raises:
AlreadyExistsError: If a table with the name already exists
"""

How about subclassing the table:

class RestTable(Table):
    config: Dict[str, str] = Field(default_factory=dict)

And moving the specific items to there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will want to evolve this in the future, but it is okay for now.

We will need to think about whether we want to use the same TableOperations abstraction that Java uses. That allows us to have a base table implementation that can work with any catalog. In that design, the table wraps a TableMetadata that is handled (refreshed, committed) by TableOperations.

@Fokko Fokko force-pushed the fd-rest-catalog branch from c913887 to 49acfb0 Compare July 17, 2022 21:49
@Fokko Fokko force-pushed the fd-rest-catalog branch from 70579bf to 4f373f2 Compare July 21, 2022 12:49
identifier = self.identifier_to_tuple(identifier)
return {"namespace": NAMESPACE_SEPARATOR.join(identifier[:-1]), "table": identifier[-1]}

def _split_identifier_for_json(self, identifier: Union[str, Identifier]) -> Dict[str, Union[Tuple[str, ...], str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The return type could be Dict[str, Union[str, Identifier]] instead of using Tuple[str, ...].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, thanks!

self._handle_non_200_response(exc, {404: NoSuchNamespaceError})
response = ListTablesResponse(**response.json())

return [(self.name, *entry.namespace, entry.name) for entry in response.identifiers]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should include self.name because we want to be able to pass any of the returned identifiers back into the catalog. The only place where we want to add the catalog name for context is when we create the Table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I've reverted this change.



class TableResponse(IcebergBaseModel):
metadata_location: Optional[str] = Field(alias="metadata-location", default=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is optional in the spec, but I think we want to make it required in the spec. Otherwise, you wouldn't be able to tell when the metadata changes. The refresh call will call the load route again and update the metadata if it has changed.

Copy link
Contributor Author

@Fokko Fokko Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having no metadata-location doesn't make much sense to me:

This is from the spec:

The table metadata JSON is returned in the metadata field. The corresponding file location of table metadata should be returned in the metadata-location field, unless the metadata is not yet committed. For example, a create transaction may return metadata that is staged but not committed.

Also, the location is in the metadata itself, where it is a required field, so I don't see how this can be optional. I can create a PR to the spec if you like (but we could also omit it since it is part of the metadata).

try:
response.raise_for_status()
except HTTPError as exc:
self._handle_non_200_response(exc, {404: NoSuchNamespaceError})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be NoSuchTableException. If you try to drop a table and get a 404, then the table doesn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:
response.raise_for_status()
except HTTPError as exc:
self._handle_non_200_response(exc, {404: NoSuchNamespaceError, 409: TableAlreadyExistsError})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be NoSuchTableError: https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L752-L753

Looks like the spec has an incorrect example there, but the 404 for this route means there was no such table to rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Here you suggested using NoSuchNamespaceError: #5287 (comment) Let me go over the spec and see which one to throw

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 404 is context-specific. My comment was for the list_tables method, where a 404 indicates that the namespace does not exist and can't be listed. That shouldn't raise NoSuchTableError because 404 doesn't mean the namespace was empty -- that would result in an empty list of identifiers -- 404 means that the namespace itself doesn't exist.

Sorry for the confusion, I know it's a bit odd to have the meaning of 404 change based on the route, but it makes a bit more sense when you think about routes as identifying resources. /namespace/ns/tables primarily identifies a namespace and is interacting with the table set of that namespace.

@Fokko Fokko force-pushed the fd-rest-catalog branch from 501b949 to dc79b27 Compare July 28, 2022 07:38
# under the License.


class NoSuchTableError(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we want to change these errors, since they are independent of the REST catalog. I'd leave the existing ones unchanged and just make the REST-specific errors extend RESTError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same concern, also others could be applicable outside of the REST contest. I'll move NoSuchNamespaceError and NoSuchTableError back to inherit from Exception.

token: str
config: Properties

host: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this should be uri, since that better describes the behavior: this is the base URI, not a host that will be embedded in https://{host}:443/ to build URIs. That also matches the common options used by Java and are documented as common catalog config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, I've renamed this one! 👍🏻

token: The bearer token
"""
self.host = host
if client_id and client_secret:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be passed as a single string? I'm fine supporting both credential and client_id/client_secret. I'd just like to be consistent across implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing 👍🏻

elif token:
self.token = token
else:
raise ValueError("Either set the client_id and client_secret, or provide a valid token")
Copy link
Contributor

@rdblue rdblue Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should fail if there isn't a valid token. Not all environments will require auth. Instead of failing, this should just fall back to not passing the Authentication header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think of that, updated the code, thanks! 👍🏻


def load_table(self, identifier: Union[str, Identifier]) -> Table:
response = requests.get(
self.url(Endpoints.load_table, prefixed=True, **self._split_identifier_for_path(identifier)), headers=self.headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefixed=True is the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less is more 👍🏻

json={
"error": {
"message": "Table does not exist: fokko.fokko2 in warehouse 8bcb0838-50fc-472d-9ddb-8feb89ef5f1e",
"type": "NoSuchNamespaceErrorException",
Copy link
Contributor

@rdblue rdblue Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: NoSuchTableException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where that comes from 🤔

@rdblue rdblue merged commit 9927572 into apache:master Jul 30, 2022
@rdblue
Copy link
Contributor

rdblue commented Jul 30, 2022

Awesome work, @Fokko! There are still a few minor comments, but there are no blockers so I merged this. Thanks!

@Fokko Fokko deleted the fd-rest-catalog branch July 31, 2022 19:56
Fokko added a commit to Fokko/iceberg that referenced this pull request Jul 31, 2022
Some small comments pending from:
apache#5287
Fokko added a commit to Fokko/iceberg that referenced this pull request Jul 31, 2022
Some small comments pending from:
apache#5287
abmo-x pushed a commit to abmo-x/iceberg that referenced this pull request Aug 2, 2022
This does not implement table commits, but does implement the catalog portion of the REST API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants