Skip to content

Conversation

@joelanford
Copy link
Member

@joelanford joelanford commented Jun 23, 2022

Signed-off-by: Joe Lanford joe.lanford@gmail.com

Description of the change:
This PR modifies opm serve so that loading FBC is handled asynchronously in the background so that the grpc server can start immediately, thus enabling the health probes to return successfully faster.

All of the grpc endpoints that require the FBC to be loaded now block, waiting for the initialization to complete. Clients can now initiate a connection and the server will block the response until initialization completes. Clients that don't want to wait indefinitely can set timeouts on their requests.

Motivation for the change:
While we shipped a change in catalog-operator to make use of the startupProbe functionality, that fix will not solve problems caused by opm serve's long startup times on clusters without the startupProbe.

This fix solves that problem by ensuring the grpc server starts immediately (much like the sqlite-based grpc server).

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci openshift-ci bot requested review from anik120 and exdx June 23, 2022 15:56
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2022
@joelanford joelanford changed the title make 'opm serve' initialization asynchronous wip: make 'opm serve' initialization asynchronous Jun 23, 2022
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2022
@joelanford joelanford force-pushed the opm-serve-async-startup branch 5 times, most recently from da0db50 to 0a337c5 Compare June 23, 2022 17:52
@codecov
Copy link

codecov bot commented Jun 23, 2022

Codecov Report

Merging #977 (2c6e79b) into master (8250533) will increase coverage by 2.67%.
The diff coverage is 85.45%.

❗ Current head 2c6e79b differs from pull request most recent head 304c402. Consider uploading reports for the commit 304c402 to get more accurate results

@@            Coverage Diff             @@
##           master     #977      +/-   ##
==========================================
+ Coverage   52.60%   55.27%   +2.67%     
==========================================
  Files         104      130      +26     
  Lines        9422    10826    +1404     
==========================================
+ Hits         4956     5984    +1028     
- Misses       3536     3763     +227     
- Partials      930     1079     +149     
Impacted Files Coverage Δ
pkg/registry/query.go 67.50% <85.45%> (+6.56%) ⬆️
pkg/image/execregistry/registry.go 54.54% <0.00%> (ø)
pkg/lib/unstructured/unstructured.go 60.71% <0.00%> (ø)
pkg/api/grpc_health_v1/health_grpc.pb.go 0.00% <0.00%> (ø)
...inertools/containertoolsfakes/fake_label_reader.go 0.00% <0.00%> (ø)
pkg/sqlite/sqlitefakes/fake_rowscanner.go 47.78% <0.00%> (ø)
pkg/lib/log/writerhook.go 0.00% <0.00%> (ø)
pkg/lib/config/validate.go 0.00% <0.00%> (ø)
...s/containertoolsfakes/fake_dockerfile_generator.go 0.00% <0.00%> (ø)
pkg/image/containerdregistry/resolver.go 57.53% <0.00%> (ø)
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8250533...304c402. Read the comment docs.

@joelanford
Copy link
Member Author

https://github.com/operator-framework/operator-registry/runs/7029116066?check_suite_focus=true#step:4:80

I think this is saying that var _ GRPCQuery = Querier{} would fail to compile. And I think this PR "breaks" that by making all of those Querier methods use pointer receivers. I don't think we ever intentionally tried to ensure Querier implemented GRPCQuery

However, we did intend for *Querier to implement GRPCQuery and that is still the case:

https://github.com/joelanford/operator-registry/blob/0a337c5535ae605d4b3fd4c86f9592d0180ce0ec/pkg/registry/query.go#L45

@joelanford joelanford force-pushed the opm-serve-async-startup branch from 0a337c5 to d101180 Compare June 23, 2022 19:32
…t of grpc server

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@joelanford joelanford force-pushed the opm-serve-async-startup branch 2 times, most recently from f860fd8 to f658013 Compare July 6, 2022 15:27
@joelanford joelanford changed the title wip: make 'opm serve' initialization asynchronous make 'opm serve' initialization asynchronous Jul 6, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 6, 2022
@joelanford joelanford force-pushed the opm-serve-async-startup branch from f658013 to 7db066c Compare July 7, 2022 12:23
Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@joelanford joelanford force-pushed the opm-serve-async-startup branch from 7db066c to eb0d664 Compare July 7, 2022 13:29
.PHONY: unit
unit:
$(GO) test -coverprofile=coverage.out $(SPECIFIC_UNIT_TEST) $(TAGS) $(TEST_RACE) -count=1 ./pkg/... ./alpha/...
$(GO) test -coverprofile=coverage.out -coverpkg=./... $(SPECIFIC_UNIT_TEST) $(TAGS) $(TEST_RACE) -count=1 ./pkg/... ./alpha/...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means we now consider cross-package coverage. (i.e. the tests in pkg/server/server_test.go can contribute to coverage in pkg/registry/query.go)

This very likely means there's an artificial increase in the repo-wide coverage, but I did my best to add specific unit tests related to this PR in pkg/registry/query_test.go to make sure the new code paths are relatively well-covered.

@dinhxuanvu dinhxuanvu removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2022
Copy link
Member

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Just one question.


func (q Querier) ListBundles(ctx context.Context) ([]*api.Bundle, error) {
func (q *Querier) ListBundles(ctx context.Context) ([]*api.Bundle, error) {
if err := q.Wait(ctx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, gRPC client on OLM side doesn't have a deadline setting so the rRPC calls can wait here for a long time. I wonder if we should set a deadline now so that if it goes for too long, it will fail out with DEADLINE_EXCEED error. Or perhaps "waiting until ready regardless how long it is" is a desire behavior? I would love this to be tested in a real cluster to see if things work well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait has three cases when it returns:

  1. The passed in context was canceled (not sure of all the ways that context is propagated from the GRPC server that calls this function, one way is if the server is shutdown). Perhaps this could be the avenue to set a timeout on a per request basis?
  2. There was an error loading the FBC.
  3. The FBC was successfully loaded.

So if Wait blocks, it means we're still churning through the initialization and haven't encountered any issues yet.

I'll take a look at the request deadline possibilities.

@joelanford
Copy link
Member Author

/hold

There are some implications on OLM that we need to consider:

  1. With the current request-blocking behavior, that might cause the OLM cache controller to block anytime it makes a request against a catalog source that is initializing. Blocking could manifest as higher latency reconciliations.
  2. If we add a server-enforced timeout, the OLM cache controller could get deadline errors during catalog startup and cause unexpected behavior (e.g. failed resolutions, incorrect retry logic)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2022
Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 7, 2022

@joelanford: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jdockter
Copy link

@joelanford is this issue on permanent hold or do you see this getting resolved as is?

@joelanford
Copy link
Member Author

joelanford commented Jul 25, 2022

@jdockter I don't think it's permanently blocked. We just need to prioritize checking the implications of this change on clients of the GRPC server (catalog-operator and packageserver).

@joelanford
Copy link
Member Author

Closing in favor of #1005

@joelanford joelanford closed this Aug 24, 2022
@joelanford joelanford deleted the opm-serve-async-startup branch August 24, 2022 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants