Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rest Catalog: catalog.name should not be part of namespace #742

Closed
c-thiel opened this issue May 14, 2024 · 5 comments · Fixed by #964 · May be fixed by #963
Closed

Rest Catalog: catalog.name should not be part of namespace #742

c-thiel opened this issue May 14, 2024 · 5 comments · Fixed by #964 · May be fixed by #963
Assignees

Comments

@c-thiel
Copy link
Contributor

c-thiel commented May 14, 2024

Apache Iceberg version

main (development)

Please describe the bug 🐞

This is a harder one:

I am currently unhappy with the way pyiceberg handles the RestCatalogs name property.
There is one probably undisputed bug here:

return {"namespace": NAMESPACE_SEPARATOR.join(identifier.namespace.root[1:]), "table": identifier.name}

In the previous line we cut the first segment of the namespace (presumably because this is the name) - even if the warehouse name is None. If there warehouse name is None, it isn't even added to the identifier namespace:

identifier=(self.name,) + identifier_tuple if self.name else identifier_tuple,

(notice the if self.name)

Thus, when name is None or "", the first bit of the namespace identifier is removed in L356 (first snippet) wrongly.
If the catalog has a name, we could argue that removing it is correct.

However there is a second Problem:
I would argue that the RestCatalogs name should NEVER be prepended to a Tables namespace identifier in the first place.
And pyiceberg has this opinion too for 50% ;) - unfortunately the namespace provided in the url and in some requests diverge.
With the script at the bottom of this issue I receive the following request to the endpoint POST http://localhost:8080/catalog/v1/2cb6e8f0-0b7e-11ef-a4fb-c3ca318d8520/namespaces/my_namespace/tables/my_table (so namespace "my_namespace"):

{
    "identifier": {
        "namespace": [
            "my_catalog_name",
            "my_namespace"
        ],
        "name": "my_table"
    },
    "requirements": [
        ...
    ],
    "updates": [
        ...
    ]
}

Notice how the namespace provided in the body is different from the namespace in the URL. The URL is not prefixed (due to the [:1] that the path magic function does as seen above), while the body contains a prefixed namespace.

My solution would be to get rid of the name as part of the namespace altogether, thus removing the name prefix here:

identifier=(self.name,) + identifier_tuple if self.name else identifier_tuple,

And consequently also the remove the [1:] here:

return {"namespace": NAMESPACE_SEPARATOR.join(identifier.namespace.root[1:]), "table": identifier.name}

Any thoughts welcome!
My baseline is that at least the URL and the body should match.

My code:

import pandas as pd
import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name="my_catalog_name",
    uri="http://localhost:8080/catalog/",
    warehouse="test",
    # For now we need a dummy token even for unauthenticated catalogs.
    # This is a requirement of pyiceberg.
    token="dummy",
)

# Lets work with the catalog
namespace = ("my_namespace",)
if namespace not in catalog.list_namespaces():
    catalog.create_namespace(namespace)

df = pd.DataFrame(
    [[1, 1.2, "foo"], [2, 2.2, "bar"]], columns=["my_ints", "my_floats", "strings"]
)
tbl = pa.Table.from_pandas(df)
table = catalog.create_table((*namespace, "my_table"), schema=tbl.schema)
table.overwrite(tbl) # <-- This is the important part
@c-thiel
Copy link
Contributor Author

c-thiel commented Jun 3, 2024

@Fokko do you maybe have some thoughts on this? I am happy to prepare a PR, but would like to get some Feedback first.

@kevinjqliu
Copy link
Contributor

I would argue that the RestCatalogs name should NEVER be prepended to a Tables namespace identifier in the first place.

+1, none of the other catalog implementations include the "catalog name" of part of the namespace.

@sungwy
Copy link
Collaborator

sungwy commented Jul 26, 2024

@corleyma and @c-thiel - @HonahX put together a fix for this release: #964

Could we ask for your help in confirming if this resolves your issue?

@c-thiel
Copy link
Contributor Author

c-thiel commented Jul 26, 2024

@sungwy I am on it. Give me a few minutes. Thank you so much for thinking of us!

@c-thiel
Copy link
Contributor Author

c-thiel commented Jul 26, 2024

@sungwy, @HonahX it works - including S3 remote signing! Green light from my side. Thank you so much for squeezing it in.
Once 0.7.0 is released, I'll add a bunch of tests from our side so that we notice if something breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment