deduplicate storage requests when loading views.#3488
Conversation
| new AttributeMap() | ||
| .put(EventAttributes.CATALOG_NAME, catalogName) | ||
| .put(EventAttributes.TABLE_IDENTIFIER, tableIdentifier))); | ||
| // Create FileIO once outside the retry lambda to avoid redundant storage requests |
There was a problem hiding this comment.
How does this help with reducing storage requests? Creating a FileIO should not hit storage, AFAIK 🤔
This is just about the comment. Code change LGTM.
There was a problem hiding this comment.
Prior to the change, if refreshFromMetadataLocation was retried, it will created another FileIO, I can update the comment.
| } | ||
|
|
||
| /** | ||
| * Override to fix a bug in {@link BaseMetastoreViewCatalog#loadView} where {@link #newViewOps} is |
There was a problem hiding this comment.
tangential: Could you also double check for something like this in tables?
There was a problem hiding this comment.
@evindj : any insight here? it's not required for current PR, but since you're involved in this issue already, I thought it might be easy for you to quickly double check that 😉
There was a problem hiding this comment.
yeah, it is my understanding that this would not happen for LoadTable, because loadtable directly uses the operations from the table to construct the response. See https://github.com/apache/polaris/blob/main/runtime/service/src/main/java/org/apache/polaris/service/catalog/iceberg/CatalogHandlerUtils.java#L387
| ViewMetadata view = response.metadata(); | ||
|
|
||
| // Assert: newViewOps() should be reused (called exactly once) | ||
| verify(spyCatalog, times(1)).newViewOps(VIEW_ID); |
There was a problem hiding this comment.
Running a @QuarkusTest (which builds a full server) to do some mock checks looks like an overkill to me.
Could we leverage MeasuredFileIO instead and move this test to an existing Quarkus test (to reduce the number of test servers)?
There was a problem hiding this comment.
Also, if testing this particular bug turns out to be cumbersome, I think manual verification is sufficient.
Existing CI should be sufficient to ensure the changed code works well.
There was a problem hiding this comment.
Yeah it would be a bit challenging to test without turning up an environment because, the duplicate request happens while loading the view and while returning the response. I have verified manually so I am ok removing this expensive test.
| new AttributeMap() | ||
| .put(EventAttributes.CATALOG_NAME, catalogName) | ||
| .put(EventAttributes.VIEW_IDENTIFIER, identifier))); | ||
| // Create FileIO once outside the retry lambda to avoid redundant storage requests |
There was a problem hiding this comment.
Same concern here - not re-creating FileIO is good, but I do not think it relates to duplicate storage requests...
To be completely clear wrt #3061, WDYT about making the FileIO optimization change in a separate PR?
There was a problem hiding this comment.
Other than that this PR LGTM 👍 :)
There was a problem hiding this comment.
I can move the FileIO optimization to a separate PR. let me remove then here.
| // then we should use the actual current table properties for IO refresh here | ||
| // instead of the general tableDefaultProperties. | ||
| String latestLocationDir = | ||
| latestLocation.substring(0, latestLocation.lastIndexOf('/')); |
There was a problem hiding this comment.
one last thing, please - would you mind removing these changes now that there are not necessary for the purpose of this PR?
| new HashMap<>(tableDefaultProperties), | ||
| Set.of(PolarisStorageActions.READ, PolarisStorageActions.LIST)); | ||
|
|
||
There was a problem hiding this comment.
Please run .//gradlew spotlessApply or make spotless-apply to fix the lint then CI will move forward.
* dedup metadata fetch by reusing ops
* (doc) Outdated changelog (apache#3503) * deduplicate storage requests when loading views. (apache#3488) * dedup metadata fetch by reusing ops * Update Gradle to 9.3.0 (apache#3514) * Site: add tool to mark versioned-docs as "do not index" (apache#3485) Adds a shell script to add the front-matter tags `robots: noindex` (HTML META tag) and `exclude_search: true` (local site search) for versioned docs for a specific version. The rudimentary script handles only markdown files, not asciidoc files, because there are currently only .md files in the versioned-docs. * feat(metrics): Evolve PolarisMetricsReporter interface with timestamp parameter and comprehensive documentation (apache#3468) Enhance the `PolarisMetricsReporter` SPI interface by adding a timestamp parameter to the `reportMetric()` method, enabling accurate time-series metrics reporting to external systems. * Make python version configuration in Makefile (apache#3510) * Update max supported python version (apache#3509) * (doc) Add doc for couple new feature flags introduced recently (apache#3511) * Add doc for couple new feature flags * Add doc for couple new feature flags * (nit) Site: Fading anchor (apache#3522) * Fading anchor * Fading anchor * Update doc to use /Users/richardliu rather than quoted tilde (apache#3472) * Remove KMS policies when KMS is not configured and improved default KMS permission for RO/RW (apache#3493) * Update triggers in "Hugo Site" workflow (apache#3518) The `site.yml` workflow is currently triggered in the following scenarios: 1. A push to the `main` branch, using the state of the site and the workflow as on `main`. 2. A push to a `release/*`branch, using the state of the site and the workflow as on that release branch. Notice that workflows get the repo state (from the checkout actions) and the workflow state as its on that particular branch. Put in other words: if we'd have a change to some old `release/1.0.x` branch, the web site would be updated as it is defined on that `release/1.0.x` branch, which is wrong and not the intended behavior. This change updates the workflow triggers to only run for pushes to the `main` branch, plus PRs against the `main` branch and when called from another workflow. This is part of apache#3516 * Let `site/bin/checkout-releases.sh` pull the latest state (apache#3517) The script `site/bin/checkout-releases.sh` is a convenience to get the `versioned-docs` branch locally. It is missing a `git pull` to get the latest state of that branch though, which can be confusing. * Last merged commit 1bf72bc --------- Co-authored-by: Yong Zheng <yongzheng0809@gmail.com> Co-authored-by: Innocent Djiofack <djiofack007@gmail.com> Co-authored-by: Anand K Sankaran <lists@anands.net> Co-authored-by: Richard Liu <35082658+RichardLiu2001@users.noreply.github.com>
The PR addresses issue #3061 where loadview makes multiple requests for the same file.
This PR has two main changes to avoid making too many storage requests when loading tables and views.
The first change is aimed at reusing
fileIOin caserefreshFromMetadataLocationis retried. this is done for both tables and views.The second change is related to
loadViewonly it overridesloadViewinBaseMetastoreViewCatalogto reuse viewOps instead of creating them twice.Checklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)