Skip to content

Conversation

@hesham-medhat
Copy link

In Cloud NEXT ‘24, Google Cloud announced BigQuery Metastore (https://youtu.be/LIMnhzJWmLQ?si=btAtXC7jNveswZfH), Google Cloud’s serverless Unified Metastore.
BigQuery Metastore provides a single, shared metastore for the lakehouse enabling data sharing across engines (ex: Spark and BigQuery). This eliminates the need to maintain separate metadata stores for your open source workloads.

Users increasingly want to run multiple analytics and AI use cases on a single copy of their data. However, the fragmented nature of today’s analytics and AI systems makes this challenging. Fragmented data processed by multiple engines means that customers have to create multiple representations of data across engine-specific table interfaces. BigLake from Google Cloud helps unify the data and data sharing across engines, however, the metadata remains fragmented. The current workaround is to copy and sync metadata using bespoke tools which leads to sync latencies and an overall confusing user experience.

With BigQuery Metastore, you can store and manage the metadata of your open-source tables, databases and namespaces processed by multiple engines (Spark, BigQuery) in one place. This eliminates the need to maintain separate metadata stores thereby giving you engine interoperability, reduction in total cost of ownership and a unified experience.

This pull request merges the Iceberg experience for BigQuery Metastore as users had the chance to try out the standalone Iceberg Catalog plugin for this implementation in preview.

implementation project(':iceberg-common')
implementation project(':iceberg-core')

implementation("com.google.apis:google-api-services-bigquery:v2-rev20240602-2.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in a different module? The GCP support is largely for storage right now and I'm not sure that we would want to add unexpected components. Maybe we should have an iceberg-bigquery module instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm..we could do it but it would break the pattern similar to how we have the project :iceberg-hive-metastore with everything Hive in there, both related to storage and metadata.
This project is also named :iceberg-gcp and BigQuery is part of GCP. So we could have two projects iceberg-bigquery and iceberg-gcs or so if you want, but I think it's honestly fine, this is just an API client library and I think it belongs here as the BigQuery code does use a fair amount of these dependencies anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely prefer iceberg-bigquery. One of the mistakes we made with AWS was having a single giant module that ended up with many more dependencies than necessary for most people.

That also separates this out so that we don't inadvertently add all of the transitive dependencies from this new dependency into our runtime Jars without validating licenses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done. Kept iceberg-gcp organization as-is for now but let me know if you want to split things differently or rename the existing iceberg-gcp project and its bundle.

.gitignore Outdated
derby.log

# BigQuery/metastore files
gcp/db_folder/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Can it be in a build or other temporary folder managed by gradle instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Done. It's generated by unit tests similar to spark-warehouse/ above.

@brunsgaard
Copy link

Hi @hesham-medhat ,
I assume this PR will also enable read/write support for Flink pipelines that use BigLake external tables with BigQuery?

}
return convertExceptionIfUnsuccessful(response).parseAs(Table.class);
} catch (IOException e) {
throw new RuntimeIOException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add context here, like what the request was?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need since it just does general error mapping to Iceberg exceptions.
Resource-specific mapping is done in method bodies if needed before this convertExceptionIfUnsuccessful gets called, like in getTable() for example, catching a 404 for a NoSuchTableException unlike the 404 in getDataset() which is mapped differently to a NoSuchNamespaceException.

String.format(
"%s\n%s",
response.getStatusMessage(),
response.getContent() != null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this include the content in the error message? Is it relevant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it shows the full error message, unlike the "statusMessage" which unfortunately could only give out one word like "Unauthorized" or "Not found" without showing the full error message of what is not found for example.

private HttpResponse convertExceptionIfUnsuccessful(HttpResponse response) throws IOException {
if (response.isSuccessStatusCode()) {
return response;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Iceberg adds newlines between control flow blocks and the following statments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done everywhere.

@hesham-medhat
Copy link
Author

Hi @hesham-medhat , I assume this PR will also enable read/write support for Flink pipelines that use BigLake external tables with BigQuery?

Hi @brunsgaard! No. This one is not about BigLake tables or their metastore, but rather the newly announced BigQuery Metastore. Previously, there was a PR for merging BigLake Metastore but we steered away from that direction in favor of this new larger-scoped project.

@github-actions github-actions bot added the INFRA label Sep 12, 2024
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 12, 2024
@z-kovacs
Copy link

@hesham-medhat @rdblue - could you pls give an update on this PR? It seems it would massively simplify the Iceberg table management in GCP for non-spark usecases.

thanks!

@github-actions github-actions bot removed the stale label Nov 14, 2024
@brunsgaard
Copy link

brunsgaard commented Nov 14, 2024

@hesham-medhat @rdblue, like @z-kovacs I would also really appreciate an update on this if possible <3

@k-alkiek
Copy link

This would be a great addition. Love to see the collaboration with GCP. @rdblue, can we get it across the finish line?

@hesham-medhat
Copy link
Author

Thank you all for your enthusiasm! This is close, it's pending @rdblue's final pass/approval. A little while ago he told me he has been busy nevertheless will get to it as soon as he can; so he could be right on it or soon will.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

* resource-specific exceptions like NoSuchTableException, NoSuchNamespaceException, etc.
*/
@SuppressWarnings("FormatStringAnnotation")
private HttpResponse convertExceptionIfUnsuccessful(HttpResponse response) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't convert, it handles/throws the exception.

.get(datasetReference.getProjectId(), datasetReference.getDatasetId())
.executeUnparsed();
if (response.getStatusCode() == HttpStatusCodes.STATUS_CODE_NOT_FOUND) {
throw new NoSuchNamespaceException(response.getStatusMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status message? Is it the table name? What about the "full message" that is in the message content? Seems like handling the response should be the same across all places that throw exceptions.

public BigQueryClientImpl() throws IOException, GeneralSecurityException {
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests
HttpCredentialsAdapter httpCredentialsAdapter =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use a client pool?

httpCredentialsAdapter.initialize(httpRequest);
httpRequest.setThrowExceptionOnExecuteError(
false); // Less catching of exceptions, more analysis of the same HttpResponse
// object, inspecting its status code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment was auto-wrapped. Can you fix it?

* A client of Google BigQuery Metastore functions over the BigQuery service. Uses the Google
* BigQuery API.
*/
public interface BigQueryClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an interface for BigQueryClient? My assumption was that there would be a test implementation, but BigQueryClientImplTest uses a BigQueryClientImpl and injects a mock Bigquery client via the package-private constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the catalog tests use a mock BigQueryClient, while the tests for BigQueryClient use a mock Bigquery. Is this necessary? It seems to me that this sets up an unnecessary test surface. For instance, to delete a table, the BigQueryClientImplTest validates that client.deleteTable calls a mocked Delete.executeUnparsed method once. The dropTable test then just tests for side-effects (directory is present, directory is not present).

I don't think that this structure is necessary because the drop table test could validate that the underlying Delete.executeUnparsed is called. This interface seems unnecssary.

After looking at the test structure here, I think the bigger problem is that these tests don't really ensure behavior and will break if the implementation changes. Using the dropTable test as an example again, it first loads the table and then calls the client to delete it. An arguably better implementation would simply call deleteTable and catch the NoSuchTableException and return false. If you made that change, the mock tests would break. Another problem is that deleteTable also loads the table by calling getTable, which is hidden by the structure of the tests.

I think that this should be tested like other implementations and needs to use CatalogTests. That will test the behavior. You could potentially do that using a Bigquery mock, but I think the way mocks are used in this PR is very specific to the implementation, so I would prefer just having a fake in-memory implementation of the Bigquery interface instead.

I would keep in mind that the goal of these tests isn't to test that dropTable is translated into getTable followed by deleteTable. If tests are validating a list of forwarded calls (at two different levels) then you aren't ensuring that the behavior of the actual catalog is correct.

projectId,
// Sometimes extensions have the namespace contain the table name too, so we are forced to
// allow invalid namespace and just take the first part here like other catalog
// implementations do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to other cases of this? It seems incorrect to me. If the namespace contains the table name after being passed in here, it is not correct.

this.conf = conf;
}

private String getDefaultStorageLocationUri(String dbId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg does not use get in method names. It is almost always better to replace it with a more specific verb (like fetch, load, generate, etc.) or omit it because it doesn't provide value.

Also, this isn't a URI?

}

private static Namespace getNamespace(Datasets datasets) {
return Namespace.of(datasets.getDatasetReference().getDatasetId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is datasets plural?

return new DatasetReference().setProjectId(projectId).setDatasetId(namespace.level(0));
}

private TableReference toBqTableReference(TableIdentifier tableIdentifier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Bq needed? I think it is clear that TableReference is not a TableIdentifier.

.setTableId(tableIdentifier.name());
}

private static Map<String, String> getMetadata(Dataset dataset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toMetadata?

Preconditions.checkArgument(namespace.levels().length == 1, invalidNamespaceMessage(namespace));
}

private static String invalidNamespaceMessage(Namespace namespace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use separate methods to build error messages. This could easily be inlined.

@Override
public boolean setProperties(Namespace namespace, Map<String, String> properties) {
client.setDatasetParameters(toDatasetReference(namespace), properties);
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should return true if the properties are changed. Does this detect when the operation is a noop?

* behavior.
* We can support database or catalog level config controlling file deletion in the future.
*/
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return true if the namespace was dropped and false otherwise. It should not throw NoSuchNamespaceException.

// BQMS does not support namespaces under database or tables, returns empty.
// It is called when dropping a namespace to make sure it's empty (listTables is called as
// well), returns empty to unblock deletion.
return ImmutableList.of();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this not validate that the namespace name has length 1 for this case? Any other namespace should not exist.

}

// TODO(b/354981675): Enable once supported by the API.
throw new ServiceFailureException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnsupportedOperationException?


implementation("com.google.apis:google-api-services-bigquery:v2-rev20240602-2.0.0")

compileOnly('org.apache.hive:hive-metastore:4.0.0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?


testImplementation project(path: ':iceberg-core', configuration: 'testArtifacts')

testImplementation 'org.apache.hadoop:hadoop-common:3.4.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is used from hadoop-common?

}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary whitespace change.

@github-actions
Copy link

github-actions bot commented Mar 8, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 8, 2025
@hesham-medhat
Copy link
Author

Thanks for finding the time, Ryan. Unironically, now I am on a long medical leave, but Google is committed to this, and we will find the time to finalize this review soon.

@github-actions github-actions bot removed the stale label Mar 9, 2025
@hangc0276
Copy link
Contributor

Thanks for finding the time, Ryan. Unironically, now I am on a long medical leave, but Google is committed to this, and we will find the time to finalize this review soon.

Hi @hesham-medhat, I test the GCP BigQuery Metastore integration with your Pr, and the server side response the following exception when creating Iceberg table.

org.apache.iceberg.exceptions.BadRequestException: Bad Request
{
  "error": {
    "code": 400,
    "message": "Error while reading table: test_v7, error message: Expected integer for current-snapshot-id. Found null instead File: bigstore/xxx/test_v7/metadata/00000-1599da9e-8dd7-4ad6-8eb1-68b461092d5f.metadata.json",
    "errors": [
      {
        "message": "Error while reading table: test_v7, error message: Expected integer for current-snapshot-id. Found null instead File: bigstore/xxx/test_v7/metadata/00000-1599da9e-8dd7-4ad6-8eb1-68b461092d5f.metadata.json",
        "domain": "global",
        "reason": "invalid"
      }
    ],
    "status": "INVALID_ARGUMENT"
  }
}

It shows the current-snapshot-id is null. Do you have any ideas?

@hesham-medhat
Copy link
Author

Hi @hangc0276, I noticed internally you got in touch with Google's support channels for that so I trust you are in good hands with troubleshooting your issue.
I'd rather not turn this PR into a support channel but for others too: you do not have to build this on your own without manual instructions / documentation; we have this implementation prepackaged as a standalone plugin at https://cloud.google.com/bigquery/docs/bqms-use-dataproc#download-iceberg-plugin as this PR was delayed, and I suggest following this documentation for the time being for instructions on how to use this implementation and onboarding to BigQuery Metastore generally as we finalize this PR.
Hope this helps anyone running across this.

@talatuyarer
Copy link
Contributor

Thank you @hesham-medhat for this PR. I created new pr to address comments. Lets close this and continue on this pr: #12808

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 17, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this May 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants