Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions site/content/in-dev/unreleased/generic-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
title: Generic Table (Beta)
type: docs
weight: 435
---

The Generic Table in Apache Polaris is designed to provide support for non-Iceberg tables across different table formats includes delta, csv etc. It currently provides the following capabilities:
- Create a generic table under a namespace
- Load a generic table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the term "load", given other conversations under this PR, is still a bit confusing, IMHO, because it resonates with Iceberg's loadTable, which provides tables's metadata... However, this call is more like "get properties". All in all, I think it's ok since subsequent doc sections provide more clarity.

Copy link
Contributor

@eric-maynard eric-maynard Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API itself is actually named loadGenericTable, so I don't think it's exactly misleading. It does load the generic table's metadata, which is whatever metadata was registered in the catalog during createGenericTable. This is very similar to the behavior of, say, the HMS's getTable. Actually cracking open the metadata.json and returning its contents in the IRC is the exception, not the rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a confusing one but to match behavior of other systems with similar functionality I think load or get is probably correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i was mainly try to matching the term of other systems to use "load or get", and I don't think load has to be specifically to the iceberg metadata.

- Drop a generic table
- List all generic tables under a namespace

**NOTE** The current generic table is in beta release. Please use it with caution and report any issue if encountered.

## What is a Generic Table?

A generic table in Polaris is an entity that defines the following fields:

- **name** (required): A unique identifier for the table within a namespace
- **format** (required): The format for the generic table, i.e. "delta", "csv"
- **base-location** (optional): Table base location in URI format. For example: s3://<my-bucket>/path/to/table
- The table base location is a location that includes all files for the table
- A table with multiple disjoint locations (i.e. containing files that are outside the configured base location) is not compliant with the current generic table support in Polaris.
Comment on lines +39 to +41
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents leveraging "object store friendly paths", no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean volume usage? I think how volume is going to be supported in Polaris is currently not discussed yet. Since this is a beta feature, if we decided to support the use cases with mutilple locations, we can evolve quickly to support this.

- If no location is provided, clients or users are responsible for managing the location.
- **properties** (optional): Properties for the generic table passed on creation.
- Currently, there is no reserved property key defined.
- The property definition and interpretation is delegated to client or engine implementations.
- **doc** (optional): Comment or description for the table

## Generic Table API Vs. Iceberg Table API

Generic Table provides a different set of APIs to operate on the generic table entities while Iceberg APIs operates on
the Iceberg table entities.

| Operations | **Iceberg Table API** | **Generic Table API** |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| Create Table | Create an Iceberg table | Create a generic table |
| Load Table | Load an Iceberg table. If the table to load is a generic table, you need to call the Generic Table loadTable API, otherwise a TableNotFoundException will be thrown | Load a generic table. Similarly, try to load an Iceberg table through Generic Table API will thrown a TableNotFoundException. |
| Drop Table | Drop an Iceberg table. Similar as load table, if the table to drop is a Generic table, a tableNotFoundException will be thrown. | Drop a generic table. Drop an Iceberg table through Generic table endpoint will thrown an TableNotFound Exception |
| List Table | List all Iceberg tables | List all generic tables |

Note that generic table shares the same namespace with Iceberg tables, the table name has to be unique under the same namespace. Furthermore, since
there is currently no support for Update Generic Table, any update to the existing table requires a drop and re-create.

## Working with Generic Table

There are two ways to work with Polaris Generic Tables today:
1) Directly communicate with Polaris through REST API calls using tools such as `curl`. Details will be described in the later section.
2) Use the Spark client provided if you are working with Spark. Please refer to [Polaris Spark Client]({{% ref "polaris-spark-client" %}}) for detailed instructions.

### Create a Generic Table

To create a generic table, you need to provide the corresponding fields as described in [What is a Generic Table](#what-is-a-generic-table).

The REST API for creating a generic Table is `POST /polaris/v1/{prefix}/namespaces/{namespace}/generic-tables`, and the
request body looks like the following:

```json
{
"name": "<table_name>",
"format": "<table_format>",
"base-location": "<table_base_location>",
"doc": "<comment or description for table>",
"properties": {
"<property-key>": "<property-value>"
}
}
```

Here is an example to create a generic table with name `delta_table` and format as `delta` under a namespace `delta_ns`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's no "update" API, what happens if there's a mistake in the initial create request? WIll the client have to delete and re-create the table?

I do not mean to cause API changes at this point, just trying to clarity things for potential non-Polaris readers.

Copy link
Contributor

@eric-maynard eric-maynard Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, yes. I think adding update is a good idea but the intent here should be to document the existing behavior.

IIRC the motivation to not have update in v0 was due to a potential lack of clarity around what responsibilities the catalog takes on for updates (i.e. it's not the same as an Iceberg update where the catalog writes metadata).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late reply! as we don't have update capability today, rename will require a re-creation, I added some description at section Generic Table API Vs. Iceberg Table API

for catalog `delta_catalog` using curl:

```shell
curl -X POST http://localhost:8181/api/catalog/polaris/v1/delta_catalog/namespaces/delta_ns/generic-tables \
-H "Content-Type: application/json" \
-d '{
"name": "delta_table",
"format": "delta",
"base-location": "s3://<my-bucket>/path/to/table",
"doc": "delta table example",
"properties": {
"key1": "value1"
}
}'
```

### Load a Generic Table
The REST endpoint for load a generic table is `GET /polaris/v1/{prefix}/namespaces/{namespace}/generic-tables/{generic-table}`.

Here is an example to load the table `delta_table` using curl:
```shell
curl -X GET http://localhost:8181/api/catalog/polaris/v1/delta_catalog/namespaces/delta_ns/generic-tables/delta_table
```
And the response looks like the following:
```json
{
"table": {
"name": "delta_table",
"format": "delta",
"base-location": "s3://<my-bucket>/path/to/table",
"doc": "delta table example",
"properties": {
"key1": "value1"
}
}
}
```

### List Generic Tables
The REST endpoint for listing the generic tables under a given
namespace is `GET /polaris/v1/{prefix}/namespaces/{namespace}/generic-tables/`.

Following curl command lists all tables under namespace delta_namespace:
```shell
curl -X GET http://localhost:8181/api/catalog/polaris/v1/delta_catalog/namespaces/delta_ns/generic-tables/
```
Example Response:
```json
{
"identifiers": [
{
"namespace": ["delta_ns"],
"name": "delta_table"
}
],
"next-page-token": null
}
```

### Drop a Generic Table
The drop generic table REST endpoint is `DELETE /polaris/v1/{prefix}/namespaces/{namespace}/generic-tables/{generic-table}`

The following curl call drops the table `delat_table`:
```shell
curl -X DELETE http://localhost:8181/api/catalog/polaris/v1/delta_catalog/namespaces/delta_ns/generic-tables/{generic-table}
```

### API Reference

For the complete and up-to-date API specification, see the [Catalog API Spec](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/polaris/refs/heads/main/spec/generated/bundled-polaris-catalog-service.yaml).

## Limitations

Current limitations of Generic Table support:
1) Limited spec information. Currently, there is no spec for information like Schema, Partition etc.
2) No commit coordination or update capability provided at the catalog service level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this is a very serious issue. No way to coordinate changes means there will be consistency issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all formats even have a way to do transactional commits. The basic premises here is to behave like the Spark Catalog with HMS (or Unity) which have these same guarantees for any sources.

For example registering a Cassandra would work but there is nothing in the Polaris world that would (or could) manage commits for a C* source.

Another example would be Delta Lake, which only can optionally (in 4.0) use a Catalog based commit coordinator and usually does not even consider the catalog when making commits.

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all formats even have a way to do transactional commits.

Delta has, no?

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

How is a consistent view on metadata ensured?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean, this blog post mentions A data catalog serves as the central registry for a table’s metadata. It manages transactions and table state, as well as access controls and read/write interoperability.

Copy link
Member

@RussellSpitzer RussellSpitzer Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the metadata only exists in polaris. Only this set of properties.

Not all formats even have a way to do transactional commits.

Delta has, no?

Delta does without using the catalog, and has an optional "commit coordinator" which uses another api which is not provided here. So like users of HMS for a delta table they would need to use a third party commit coordinator if they wanted to use the optional "commit coordinator"

Polaris is only guaranteeing a consistent view of the metadata about the entity, not any guarantees about the underlying data.

How is a consistent view on metadata ensured?

I think you may want to check out the original design docs here. The "metadata" we are talking about here is referring to what the user puts in their Create statement that talks to Polaris. That is the only thing Polaris knows about the table and is the part that will not change. Again, this is similar to how the HMS works with Spark or how Iceberg originally worked with the HMS Catalog implementaiton. The Catalog is essentially just holding a bag of properties we we will maintain. Changes to these properties (not yet allowed) would be atomic but they are essentially disconnected from the underlying format.

Copy link
Contributor

@eric-maynard eric-maynard Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if basically only the table name is there, what's a user's benefit for it?

I'm confused -- are we cross-examining the feature here or documenting it?

The benefit is as is documented here; you can use Delta and other non-Iceberg tables in Spark using the Spark connector. The doc walks you through how that works. If that benefit is unclear in the doc, let's fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if basically only the table name is there, what's a user's benefit for it?

As @RussellSpitzer mentioned, the milestone polaris accomplishes today is enable polaris as a centralized catalog service for Spark Catalog. Furthermore, for delta, the state is inside the delta log, as far as the client is able to load the delta log, "base-location" is the only information needed to enable access to the delta table. I can try to make it more clear in the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really do not like is implying promises (A data catalog serves as the central registry for a table’s metadata. It manages transactions and table state, as well as access controls and read/write interoperability.) which just do not hold true: there is no way to manage transactions and table state, control access, etc.

WRT to what's in Polaris - it's incomplete (I suspect there's no doubt there).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think it's incomplete in that context? I wouldn't imagine we would ever support those things for this endpoint and I don't think it would be a surprise to any user of Spark who uses this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @snazy 's point that the blog post (link above) is not really aligned with the Generic Tables API as far as Catalog as a "central registry" for table's metadata is concerned. The Generic Table API actually diminishes the role of the catalog as metadata registry by delegating most of the metadata loading to the client (even location is optional in the API).

That said, as far as this PR is concerned, I believe it should be sufficient to describe actual behaviour of Polaris in this respect. There is certainly room for improvements in terms of clarity and precision in this doc, but I think the current state of this PR is probably acceptable for 1.0.


Therefore, the catalog itself is unaware of anything about the underlying table except some of the loosely defined metadata.
It is the responsibility of the engine (and plugins used by the engine) to determine exactly how loading or commiting data
should look like based on the metadata. For example, with the delta support, th delta log serialization, deserialization
and update all happens at client side.
Loading