GCP: Add Iceberg Catalog for GCP BigQuery Metastore #11039

hesham-medhat · 2024-08-28T17:13:25Z

In Cloud NEXT ‘24, Google Cloud announced BigQuery Metastore (https://youtu.be/LIMnhzJWmLQ?si=btAtXC7jNveswZfH), Google Cloud’s serverless Unified Metastore.
BigQuery Metastore provides a single, shared metastore for the lakehouse enabling data sharing across engines (ex: Spark and BigQuery). This eliminates the need to maintain separate metadata stores for your open source workloads.

Users increasingly want to run multiple analytics and AI use cases on a single copy of their data. However, the fragmented nature of today’s analytics and AI systems makes this challenging. Fragmented data processed by multiple engines means that customers have to create multiple representations of data across engine-specific table interfaces. BigLake from Google Cloud helps unify the data and data sharing across engines, however, the metadata remains fragmented. The current workaround is to copy and sync metadata using bespoke tools which leads to sync latencies and an overall confusing user experience.

With BigQuery Metastore, you can store and manage the metadata of your open-source tables, databases and namespaces processed by multiple engines (Spark, BigQuery) in one place. This eliminates the need to maintain separate metadata stores thereby giving you engine interoperability, reduction in total cost of ownership and a unified experience.

This pull request merges the Iceberg experience for BigQuery Metastore as users had the chance to try out the standalone Iceberg Catalog plugin for this implementation in preview.

rdblue · 2024-09-10T21:59:40Z

build.gradle

    implementation project(':iceberg-common')
    implementation project(':iceberg-core')

+    implementation("com.google.apis:google-api-services-bigquery:v2-rev20240602-2.0.0")


Should this be in a different module? The GCP support is largely for storage right now and I'm not sure that we would want to add unexpected components. Maybe we should have an iceberg-bigquery module instead?

Hmm..we could do it but it would break the pattern similar to how we have the project :iceberg-hive-metastore with everything Hive in there, both related to storage and metadata.
This project is also named :iceberg-gcp and BigQuery is part of GCP. So we could have two projects iceberg-bigquery and iceberg-gcs or so if you want, but I think it's honestly fine, this is just an API client library and I think it belongs here as the BigQuery code does use a fair amount of these dependencies anyway.

I definitely prefer iceberg-bigquery. One of the mistakes we made with AWS was having a single giant module that ended up with many more dependencies than necessary for most people.

That also separates this out so that we don't inadvertently add all of the transitive dependencies from this new dependency into our runtime Jars without validating licenses.

Sure. Done. Kept iceberg-gcp organization as-is for now but let me know if you want to split things differently or rename the existing iceberg-gcp project and its bundle.

rdblue · 2024-09-10T22:00:07Z

.gitignore

 derby.log
+
+# BigQuery/metastore files
+gcp/db_folder/


What is this? Can it be in a build or other temporary folder managed by gradle instead?

Yep. Done. It's generated by unit tests similar to spark-warehouse/ above.

build.gradle

brunsgaard · 2024-09-11T20:30:19Z

Hi @hesham-medhat ,
I assume this PR will also enable read/write support for Flink pipelines that use BigLake external tables with BigQuery?

rdblue · 2024-09-11T20:36:47Z

gcp/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientImpl.java

+      }
+      return convertExceptionIfUnsuccessful(response).parseAs(Table.class);
+    } catch (IOException e) {
+      throw new RuntimeIOException(e);


Why not add context here, like what the request was?

No need since it just does general error mapping to Iceberg exceptions.
Resource-specific mapping is done in method bodies if needed before this convertExceptionIfUnsuccessful gets called, like in getTable() for example, catching a 404 for a NoSuchTableException unlike the 404 in getDataset() which is mapped differently to a NoSuchNamespaceException.

rdblue · 2024-09-11T20:38:56Z

gcp/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientImpl.java

+        String.format(
+            "%s\n%s",
+            response.getStatusMessage(),
+            response.getContent() != null


Why does this include the content in the error message? Is it relevant?

Yes, it shows the full error message, unlike the "statusMessage" which unfortunately could only give out one word like "Unauthorized" or "Not found" without showing the full error message of what is not found for example.

rdblue · 2024-09-11T20:39:19Z

gcp/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientImpl.java

+  private HttpResponse convertExceptionIfUnsuccessful(HttpResponse response) throws IOException {
+    if (response.isSuccessStatusCode()) {
+      return response;
+    }


Style: Iceberg adds newlines between control flow blocks and the following statments.

Done everywhere.

hesham-medhat · 2024-09-12T00:25:40Z

Hi @hesham-medhat , I assume this PR will also enable read/write support for Flink pipelines that use BigLake external tables with BigQuery?

Hi @brunsgaard! No. This one is not about BigLake tables or their metastore, but rather the newly announced BigQuery Metastore. Previously, there was a PR for merging BigLake Metastore but we steered away from that direction in favor of this new larger-scoped project.

github-actions · 2024-11-12T00:15:00Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

z-kovacs · 2024-11-13T10:13:01Z

@hesham-medhat @rdblue - could you pls give an update on this PR? It seems it would massively simplify the Iceberg table management in GCP for non-spark usecases.

thanks!

brunsgaard · 2024-11-14T08:15:58Z

@hesham-medhat @rdblue, like @z-kovacs I would also really appreciate an update on this if possible <3

k-alkiek · 2024-11-15T16:39:16Z

This would be a great addition. Love to see the collaboration with GCP. @rdblue, can we get it across the finish line?

hesham-medhat · 2024-11-15T19:45:10Z

Thank you all for your enthusiasm! This is close, it's pending @rdblue's final pass/approval. A little while ago he told me he has been busy nevertheless will get to it as soon as he can; so he could be right on it or soon will.

github-actions · 2024-12-16T00:17:11Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

rdblue · 2025-02-05T19:41:57Z

bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientImpl.java

+   * resource-specific exceptions like NoSuchTableException, NoSuchNamespaceException, etc.
+   */
+  @SuppressWarnings("FormatStringAnnotation")
+  private HttpResponse convertExceptionIfUnsuccessful(HttpResponse response) throws IOException {


This doesn't convert, it handles/throws the exception.

rdblue · 2025-02-05T19:44:50Z

bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientImpl.java

+              .get(datasetReference.getProjectId(), datasetReference.getDatasetId())
+              .executeUnparsed();
+      if (response.getStatusCode() == HttpStatusCodes.STATUS_CODE_NOT_FOUND) {
+        throw new NoSuchNamespaceException(response.getStatusMessage());


What is the status message? Is it the table name? What about the "full message" that is in the message content? Seems like handling the response should be the same across all places that throw exceptions.

rdblue · 2025-02-05T19:46:38Z

bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientImpl.java

+  public BigQueryClientImpl() throws IOException, GeneralSecurityException {
+    // Initialize client that will be used to send requests. This client only needs to be created
+    // once, and can be reused for multiple requests
+    HttpCredentialsAdapter httpCredentialsAdapter =


Should this use a client pool?

rdblue · 2025-02-05T19:47:38Z

bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientImpl.java

+                  httpCredentialsAdapter.initialize(httpRequest);
+                  httpRequest.setThrowExceptionOnExecuteError(
+                      false); // Less catching of exceptions, more analysis of the same HttpResponse
+                  // object, inspecting its status code


Looks like this comment was auto-wrapped. Can you fix it?

rdblue · 2025-02-05T19:50:42Z

bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClient.java

+ * A client of Google BigQuery Metastore functions over the BigQuery service. Uses the Google
+ * BigQuery API.
+ */
+public interface BigQueryClient {


Why is there an interface for BigQueryClient? My assumption was that there would be a test implementation, but BigQueryClientImplTest uses a BigQueryClientImpl and injects a mock Bigquery client via the package-private constructor.

Looks like the catalog tests use a mock BigQueryClient, while the tests for BigQueryClient use a mock Bigquery. Is this necessary? It seems to me that this sets up an unnecessary test surface. For instance, to delete a table, the BigQueryClientImplTest validates that client.deleteTable calls a mocked Delete.executeUnparsed method once. The dropTable test then just tests for side-effects (directory is present, directory is not present).

I don't think that this structure is necessary because the drop table test could validate that the underlying Delete.executeUnparsed is called. This interface seems unnecssary.

After looking at the test structure here, I think the bigger problem is that these tests don't really ensure behavior and will break if the implementation changes. Using the dropTable test as an example again, it first loads the table and then calls the client to delete it. An arguably better implementation would simply call deleteTable and catch the NoSuchTableException and return false. If you made that change, the mock tests would break. Another problem is that deleteTable also loads the table by calling getTable, which is hidden by the structure of the tests.

I think that this should be tested like other implementations and needs to use CatalogTests. That will test the behavior. You could potentially do that using a Bigquery mock, but I think the way mocks are used in this PR is very specific to the implementation, so I would prefer just having a fake in-memory implementation of the Bigquery interface instead.

I would keep in mind that the goal of these tests isn't to test that dropTable is translated into getTable followed by deleteTable. If tests are validating a list of forwarded calls (at two different levels) then you aren't ensuring that the behavior of the actual catalog is correct.

rdblue · 2025-02-05T20:15:39Z