Skip to content

Conversation

@gaborkaszab
Copy link
Collaborator

The freshness-aware table loading requires some additional support for HTTP headers:

  • Response headers for get and post requests
  • Input headers for get request

Extended the RESTClient and its implementations to fill in the gaps where these headers weren't supported.

@gaborkaszab
Copy link
Collaborator Author

Conflict with this refactor: #11992
Will take care of the rebase soon.

@nastra nastra self-requested a review February 11, 2025 16:16
@gaborkaszab gaborkaszab force-pushed the main_rest_response_headers branch from c214964 to 0ee6b38 Compare February 13, 2025 12:21
@nastra nastra changed the title REST: Extended header support for RESTClient implementations Core: Extended header support for RESTClient implementations Feb 19, 2025
@gaborkaszab gaborkaszab requested a review from nastra March 4, 2025 13:41
@nastra nastra requested a review from danielcweeks March 4, 2025 15:41
Copy link
Contributor

@1raghavmahajan 1raghavmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to allow users to provide their own HTTPClient implementation here and leverage HTTP Caching in a transparent manner while also exposing cache control/size?
This simplifies the flow from Iceberg perspective as well. I think it's better to avoid explicit header handling when it could be done via a compliant HTTP client.

@1raghavmahajan
Copy link
Contributor

1raghavmahajan commented Mar 5, 2025

To make things simpler from a user perspective we could also add a default CachedHttpClient to the builder in case someone wants something that just works out of the box.

@gaborkaszab
Copy link
Collaborator Author

Thanks for taking a look, @1raghavmahajan!

This is the proposal doc the freshness-aware table loading just in case: https://docs.google.com/document/d/1rnVSP_iv2I47giwfAe-Z3DYhKkKwWCVvCkC9rEvtaLA

In general the whole improvement could work with letting the HttpClient to take care of the caching out of the box itself as you proposed.
However, The idea is to do the caching on RESTSessionCatalog side so that we can cache Table objects instead of the HTTP messages. Additionally, if the caching was on RESTSessionCatalog level then the cached objects could be tied to the lifecycle of the tables, e.g. dropping a table could also evict the table object from the cache (via using weak-references) while the CachedHttpClient would only see HTTP messages.

So I'd continue with this PR to add extended header support for Iceberg's REST clients.

@gaborkaszab gaborkaszab force-pushed the main_rest_response_headers branch 2 times, most recently from 752eb00 to 7ad2f3f Compare March 7, 2025 12:53
}

@Override
public <T extends RESTResponse> T get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding this, would it be possible to push the other get implementation to the interface?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response.

Do you mean we could move that other get function from this class into the interface? We could have a "responseHeader-less" version in the interface that calls the version with the responseHeader param using a non-null h -> {} consumer, but I think that would be a behaviour change within the interface. Calling the "responseHeader-less" version would result in an UnsupportedOperationException from the default implementation of the other get function in the interface.

With the current approach I tried to follow how this PR from @nastra introduced the same for the post methods.

* additional resource allocation.
*/
private HTTPClient(HTTPClient parent, AuthSession authSession) {
HTTPClient(HTTPClient parent, AuthSession authSession) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the builder of this class instead of changing the visibility of the constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I'll revert this part (the second commit) from this PR because that was a kind of experiment to see how we could test the new interface in HTTPClient. Agreed, with @nastra that this isn't the way we'd want to go.

@gaborkaszab gaborkaszab force-pushed the main_rest_response_headers branch from 7ad2f3f to df18911 Compare April 2, 2025 09:40
@gaborkaszab
Copy link
Collaborator Author

Just sharing some updates for the record:
We discussed with @eduard previously that I should have a test that verifies that the headers are populated with this implementation. I investigated different ways to achieve that, but apparently this part of the code is not that easy to mock. The reason in a nutshell is that even if I provide a custom HTTPClient to the RESTCatalog in the tests, when the catalog is initialised the withAuthSession call returns a new HTTPClient instance and hence we loose the override for functions.
The only way I managed to make this part of the code testable required me changing the visibility of the HTTPClient constructor to public, but this is not something we'd want to do. So I reverted that part of the code from this PR.

I had limited capacity to follow-up on this recently, but this changes soon hopefully. I continued with the implementation of the freshness-aware loading, but I don't think there is an easy way anyway to test this part of the code as a separate building block. I wonder if we can merge this one as it is (with the additional test reverted) to have granularity with the implementation. @nastra @danielcweeks

@gaborkaszab gaborkaszab force-pushed the main_rest_response_headers branch from df18911 to 744bb96 Compare April 2, 2025 11:15
@github-actions
Copy link

github-actions bot commented May 3, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 3, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this May 11, 2025
@gaborkaszab
Copy link
Collaborator Author

Hi @danielcweeks , @nastra ,

I left this PR to expire since I plan to come up with a wider code change that covers more parts of the freshness aware loading. I just wanted to let you know that I recently had some shift in priorities and didn't have time to work in this, but in some weeks I'll again find capacity to continue the implementation. I have some open question wrt the design, I might reach out to you if you don't mind.
Thanks!

@dramaticlly
Copy link
Contributor

@gaborkaszab @nastra @danielcweeks I am interested in this change to configure/expose the etag headers on GET request, would love to push this forward in a separate PR, please let me know if there's any WIP effort on this.

@danielcweeks
Copy link
Contributor

@dramaticlly I don't think anyone is actively working this, so if you want to pick it up and move forward, I think that would be great.

@gaborkaszab
Copy link
Collaborator Author

gaborkaszab commented Jul 31, 2025

@dramaticlly @danielcweeks I admit I got a bit sidetracked from this, but still this is on my plate, so I'd be a bit uncomfortable to give it away. Can we coordinate on intentions here before we move one?

Anyway, this was my last message on this PR:

I just wanted to let you know that I recently had some shift in priorities and didn't have time to work in this, but in some weeks I'll again find capacity to continue the implementation

@gaborkaszab gaborkaszab reopened this Jul 31, 2025
@gaborkaszab gaborkaszab force-pushed the main_rest_response_headers branch from 744bb96 to 4e7fe97 Compare July 31, 2025 08:54
@dramaticlly
Copy link
Contributor

@gaborkaszab do you want to take a look at suggestion from @nastra? Happy to help in any way

…n ETags

The freshness-aware table loading requires some additional support for HTTP headers:
- Response headers for get and post requests
- Input headers for get request

Extended the RESTClient and its implementations to fill in the gaps where these
headers weren't supported.

With this patch, RESTCatalogAdapter populates the ETag HTTP response header.
@gaborkaszab gaborkaszab force-pushed the main_rest_response_headers branch from e834987 to 669b44e Compare August 19, 2025 13:20
Copy link
Collaborator Author

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @nastra and @dramaticlly !
I've taken care of your comments except the one with ETags for stageCreate. I think there we should have some additional discussion.

import org.apache.iceberg.rest.responses.LoadTableResponse;

/** Interface for creating the content of the ETag HTTP headers */
public interface ETagProvider {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the interface and kept the utility class with a static method

import org.apache.iceberg.relocated.com.google.common.hash.Hashing;
import org.apache.iceberg.rest.responses.LoadTableResponse;

public class DefaultETagProvider implements ETagProvider {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private static final HashFunction MURMUR3 = Hashing.murmur3_32_fixed();

@Override
public String of(LoadTableResponse resp) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nastra @dramaticlly ! changed the parameter to metadata location

private static final HashFunction MURMUR3 = Hashing.murmur3_32_fixed();

@Override
public String of(LoadTableResponse resp) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return this;
}

public RESTClient withETagProvider(ETagProvider eTagProv) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


respHeaders.clear();

Table tbl = catalog.loadTable(TABLE);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

TableIdentifier.of(TABLE.namespace(), "other_table"),
((BaseTable) tbl).operations().current().metadataFileLocation());

assertThat(respHeaders).containsKey(HttpHeaders.ETAG);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, done

assertThat(eTag).isEqualTo(respHeaders.get(HttpHeaders.ETAG));
}

private RESTCatalog setUpETagTest(Map<String, String> respHeaders) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


RESTCatalog catalog = catalog(adapter);

if (requiresNamespaceCreate()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

request.validate();
if (request.stageCreate()) {
return castResponse(
responseType, CatalogHandlers.stageTableCreate(catalog, namespace, request));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure here. Isn't the metadata location null for stage create? I created a simple test and it was null there, so I don't think we can add an ETag here. Do I miss something?

@gaborkaszab gaborkaszab requested a review from nastra August 19, 2025 14:29
import org.apache.iceberg.relocated.com.google.common.hash.HashFunction;
import org.apache.iceberg.relocated.com.google.common.hash.Hashing;

class ETagProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might make sense to have a small test class for this where the metadata location is null/well-defined and where we compare against a precalculated etag value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by recalculated tag value do you mean that a test where the expected output is hard-coded? Would that guard against someone changing the implementation of ETag creation? I added a test for that, let me know if this is what you mean

Copy link
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @gaborkaszab!

@gaborkaszab gaborkaszab requested a review from nastra August 26, 2025 08:24
@dramaticlly
Copy link
Contributor

@nastra @pvary do we need anything else before we can merge this change?

@gaborkaszab
Copy link
Collaborator Author

According to Slack, @nastra seems to be offline until mid Sept. Maybe @amogh-jahagirdar would you mind taking a look? I got some approvals previously already, I just addressed a nit on top to add some extra tests.

@amogh-jahagirdar amogh-jahagirdar self-requested a review September 6, 2025 01:02
@pvary pvary merged commit 9ea3b13 into apache:main Sep 8, 2025
43 checks passed
@pvary
Copy link
Contributor

pvary commented Sep 8, 2025

Merged to main.
Thanks @gaborkaszab for the PR and for all the reviewers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants