-
Notifications
You must be signed in to change notification settings - Fork 3k
GCP: Add Iceberg Catalog for GCP BigQuery Metastore #11039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| implementation project(':iceberg-common') | ||
| implementation project(':iceberg-core') | ||
|
|
||
| implementation("com.google.apis:google-api-services-bigquery:v2-rev20240602-2.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in a different module? The GCP support is largely for storage right now and I'm not sure that we would want to add unexpected components. Maybe we should have an iceberg-bigquery module instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm..we could do it but it would break the pattern similar to how we have the project :iceberg-hive-metastore with everything Hive in there, both related to storage and metadata.
This project is also named :iceberg-gcp and BigQuery is part of GCP. So we could have two projects iceberg-bigquery and iceberg-gcs or so if you want, but I think it's honestly fine, this is just an API client library and I think it belongs here as the BigQuery code does use a fair amount of these dependencies anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely prefer iceberg-bigquery. One of the mistakes we made with AWS was having a single giant module that ended up with many more dependencies than necessary for most people.
That also separates this out so that we don't inadvertently add all of the transitive dependencies from this new dependency into our runtime Jars without validating licenses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done. Kept iceberg-gcp organization as-is for now but let me know if you want to split things differently or rename the existing iceberg-gcp project and its bundle.
.gitignore
Outdated
| derby.log | ||
|
|
||
| # BigQuery/metastore files | ||
| gcp/db_folder/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? Can it be in a build or other temporary folder managed by gradle instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Done. It's generated by unit tests similar to spark-warehouse/ above.
9d6aad5 to
3d88af9
Compare
|
Hi @hesham-medhat , |
| } | ||
| return convertExceptionIfUnsuccessful(response).parseAs(Table.class); | ||
| } catch (IOException e) { | ||
| throw new RuntimeIOException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add context here, like what the request was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need since it just does general error mapping to Iceberg exceptions.
Resource-specific mapping is done in method bodies if needed before this convertExceptionIfUnsuccessful gets called, like in getTable() for example, catching a 404 for a NoSuchTableException unlike the 404 in getDataset() which is mapped differently to a NoSuchNamespaceException.
| String.format( | ||
| "%s\n%s", | ||
| response.getStatusMessage(), | ||
| response.getContent() != null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this include the content in the error message? Is it relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it shows the full error message, unlike the "statusMessage" which unfortunately could only give out one word like "Unauthorized" or "Not found" without showing the full error message of what is not found for example.
| private HttpResponse convertExceptionIfUnsuccessful(HttpResponse response) throws IOException { | ||
| if (response.isSuccessStatusCode()) { | ||
| return response; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: Iceberg adds newlines between control flow blocks and the following statments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done everywhere.
Hi @brunsgaard! No. This one is not about BigLake tables or their metastore, but rather the newly announced BigQuery Metastore. Previously, there was a PR for merging BigLake Metastore but we steered away from that direction in favor of this new larger-scoped project. |
af99cc5 to
196975a
Compare
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
@hesham-medhat @rdblue - could you pls give an update on this PR? It seems it would massively simplify the Iceberg table management in GCP for non-spark usecases. thanks! |
|
@hesham-medhat @rdblue, like @z-kovacs I would also really appreciate an update on this if possible <3 |
|
This would be a great addition. Love to see the collaboration with GCP. @rdblue, can we get it across the finish line? |
|
Thank you all for your enthusiasm! This is close, it's pending @rdblue's final pass/approval. A little while ago he told me he has been busy nevertheless will get to it as soon as he can; so he could be right on it or soon will. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
| * resource-specific exceptions like NoSuchTableException, NoSuchNamespaceException, etc. | ||
| */ | ||
| @SuppressWarnings("FormatStringAnnotation") | ||
| private HttpResponse convertExceptionIfUnsuccessful(HttpResponse response) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't convert, it handles/throws the exception.
| .get(datasetReference.getProjectId(), datasetReference.getDatasetId()) | ||
| .executeUnparsed(); | ||
| if (response.getStatusCode() == HttpStatusCodes.STATUS_CODE_NOT_FOUND) { | ||
| throw new NoSuchNamespaceException(response.getStatusMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the status message? Is it the table name? What about the "full message" that is in the message content? Seems like handling the response should be the same across all places that throw exceptions.
| public BigQueryClientImpl() throws IOException, GeneralSecurityException { | ||
| // Initialize client that will be used to send requests. This client only needs to be created | ||
| // once, and can be reused for multiple requests | ||
| HttpCredentialsAdapter httpCredentialsAdapter = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use a client pool?
| httpCredentialsAdapter.initialize(httpRequest); | ||
| httpRequest.setThrowExceptionOnExecuteError( | ||
| false); // Less catching of exceptions, more analysis of the same HttpResponse | ||
| // object, inspecting its status code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment was auto-wrapped. Can you fix it?
| * A client of Google BigQuery Metastore functions over the BigQuery service. Uses the Google | ||
| * BigQuery API. | ||
| */ | ||
| public interface BigQueryClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there an interface for BigQueryClient? My assumption was that there would be a test implementation, but BigQueryClientImplTest uses a BigQueryClientImpl and injects a mock Bigquery client via the package-private constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the catalog tests use a mock BigQueryClient, while the tests for BigQueryClient use a mock Bigquery. Is this necessary? It seems to me that this sets up an unnecessary test surface. For instance, to delete a table, the BigQueryClientImplTest validates that client.deleteTable calls a mocked Delete.executeUnparsed method once. The dropTable test then just tests for side-effects (directory is present, directory is not present).
I don't think that this structure is necessary because the drop table test could validate that the underlying Delete.executeUnparsed is called. This interface seems unnecssary.
After looking at the test structure here, I think the bigger problem is that these tests don't really ensure behavior and will break if the implementation changes. Using the dropTable test as an example again, it first loads the table and then calls the client to delete it. An arguably better implementation would simply call deleteTable and catch the NoSuchTableException and return false. If you made that change, the mock tests would break. Another problem is that deleteTable also loads the table by calling getTable, which is hidden by the structure of the tests.
I think that this should be tested like other implementations and needs to use CatalogTests. That will test the behavior. You could potentially do that using a Bigquery mock, but I think the way mocks are used in this PR is very specific to the implementation, so I would prefer just having a fake in-memory implementation of the Bigquery interface instead.
I would keep in mind that the goal of these tests isn't to test that dropTable is translated into getTable followed by deleteTable. If tests are validating a list of forwarded calls (at two different levels) then you aren't ensuring that the behavior of the actual catalog is correct.
| projectId, | ||
| // Sometimes extensions have the namespace contain the table name too, so we are forced to | ||
| // allow invalid namespace and just take the first part here like other catalog | ||
| // implementations do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to other cases of this? It seems incorrect to me. If the namespace contains the table name after being passed in here, it is not correct.
| this.conf = conf; | ||
| } | ||
|
|
||
| private String getDefaultStorageLocationUri(String dbId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg does not use get in method names. It is almost always better to replace it with a more specific verb (like fetch, load, generate, etc.) or omit it because it doesn't provide value.
Also, this isn't a URI?
| } | ||
|
|
||
| private static Namespace getNamespace(Datasets datasets) { | ||
| return Namespace.of(datasets.getDatasetReference().getDatasetId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is datasets plural?
| return new DatasetReference().setProjectId(projectId).setDatasetId(namespace.level(0)); | ||
| } | ||
|
|
||
| private TableReference toBqTableReference(TableIdentifier tableIdentifier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Bq needed? I think it is clear that TableReference is not a TableIdentifier.
| .setTableId(tableIdentifier.name()); | ||
| } | ||
|
|
||
| private static Map<String, String> getMetadata(Dataset dataset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toMetadata?
| Preconditions.checkArgument(namespace.levels().length == 1, invalidNamespaceMessage(namespace)); | ||
| } | ||
|
|
||
| private static String invalidNamespaceMessage(Namespace namespace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use separate methods to build error messages. This could easily be inlined.
| @Override | ||
| public boolean setProperties(Namespace namespace, Map<String, String> properties) { | ||
| client.setDatasetParameters(toDatasetReference(namespace), properties); | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should return true if the properties are changed. Does this detect when the operation is a noop?
| * behavior. | ||
| * We can support database or catalog level config controlling file deletion in the future. | ||
| */ | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return true if the namespace was dropped and false otherwise. It should not throw NoSuchNamespaceException.
| // BQMS does not support namespaces under database or tables, returns empty. | ||
| // It is called when dropping a namespace to make sure it's empty (listTables is called as | ||
| // well), returns empty to unblock deletion. | ||
| return ImmutableList.of(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this not validate that the namespace name has length 1 for this case? Any other namespace should not exist.
| } | ||
|
|
||
| // TODO(b/354981675): Enable once supported by the API. | ||
| throw new ServiceFailureException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnsupportedOperationException?
|
|
||
| implementation("com.google.apis:google-api-services-bigquery:v2-rev20240602-2.0.0") | ||
|
|
||
| compileOnly('org.apache.hive:hive-metastore:4.0.0') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
|
|
||
| testImplementation project(path: ':iceberg-core', configuration: 'testArtifacts') | ||
|
|
||
| testImplementation 'org.apache.hadoop:hadoop-common:3.4.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is used from hadoop-common?
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary whitespace change.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
Thanks for finding the time, Ryan. Unironically, now I am on a long medical leave, but Google is committed to this, and we will find the time to finalize this review soon. |
Hi @hesham-medhat, I test the GCP BigQuery Metastore integration with your Pr, and the server side response the following exception when creating Iceberg table. It shows the |
|
Hi @hangc0276, I noticed internally you got in touch with Google's support channels for that so I trust you are in good hands with troubleshooting your issue. |
|
Thank you @hesham-medhat for this PR. I created new pr to address comments. Lets close this and continue on this pr: #12808 |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
In Cloud NEXT ‘24, Google Cloud announced BigQuery Metastore (https://youtu.be/LIMnhzJWmLQ?si=btAtXC7jNveswZfH), Google Cloud’s serverless Unified Metastore.
BigQuery Metastore provides a single, shared metastore for the lakehouse enabling data sharing across engines (ex: Spark and BigQuery). This eliminates the need to maintain separate metadata stores for your open source workloads.
Users increasingly want to run multiple analytics and AI use cases on a single copy of their data. However, the fragmented nature of today’s analytics and AI systems makes this challenging. Fragmented data processed by multiple engines means that customers have to create multiple representations of data across engine-specific table interfaces. BigLake from Google Cloud helps unify the data and data sharing across engines, however, the metadata remains fragmented. The current workaround is to copy and sync metadata using bespoke tools which leads to sync latencies and an overall confusing user experience.
With BigQuery Metastore, you can store and manage the metadata of your open-source tables, databases and namespaces processed by multiple engines (Spark, BigQuery) in one place. This eliminates the need to maintain separate metadata stores thereby giving you engine interoperability, reduction in total cost of ownership and a unified experience.
This pull request merges the Iceberg experience for BigQuery Metastore as users had the chance to try out the standalone Iceberg Catalog plugin for this implementation in preview.