Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdk/db: do not hold the lock on Close #29097

Merged
merged 5 commits into from
Jan 9, 2025
Merged

Conversation

fairclothjm
Copy link
Contributor

@fairclothjm fairclothjm commented Dec 4, 2024

Description

This fixes a bug between the Vault grpc client and the plugin grpc server. This is related to slow database session End calls which could be slow network or database limits. When we see a slow Database session close, this is enough for goroutines to be blocked and prevent a new database config from being created due to an internal timeout on the plugin in the Type() call. This produces the following error in Vault:

error creating database object: invalid database version: 2 errors occurred: 
  * error getting plugin type: unable to get database plugin type: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  * Incompatible API version with plugin. Plugin version: 5, Client versions: [3 4]

TODO only if you're a HashiCorp employee

  • Backport Labels: If this fix needs to be backported, use the appropriate backport/ label that matches the desired release branch. Note that in the CE repo, the latest release branch will look like backport/x.x.x, but older release branches will be backport/ent/x.x.x+ent.
    • LTS: If this fixes a critical security vulnerability or severity 1 bug, it will also need to be backported to the current LTS versions of Vault. To ensure this, use all available enterprise labels.
  • ENT Breakage: If this PR either 1) removes a public function OR 2) changes the signature
    of a public function, even if that change is in a CE file, double check that
    applying the patch for this PR to the ENT repo and running tests doesn't
    break any tests. Sometimes ENT only tests rely on public functions in CE
    files.
  • Jira: If this change has an associated Jira, it's referenced either
    in the PR description, commit message, or branch name.
  • RFC: If this change has an associated RFC, please link it in the description.
  • ENT PR: If this change has an associated ENT PR, please link it in the
    description. Also, make sure the changelog is in this PR, not in your ENT PR.

@fairclothjm fairclothjm requested a review from a team as a code owner December 4, 2024 21:48
@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Dec 4, 2024
Copy link

github-actions bot commented Dec 4, 2024

CI Results:
All Go tests succeeded! ✅

Copy link

github-actions bot commented Dec 4, 2024

Build Results:
All builds succeeded! ✅

Copy link
Contributor

@mpalmi mpalmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fairclothjm I may totally be missing something here, but this change feels unsafe...

What happens if there are two concurrent requests to Close the same database? If we drop the lock for the call to Close, we run the risk of having the g.instances reference changed underneath us and potentially panic, which doesn't seem like something we'd want.

Sorry for the drive-by review, but this seemed a little fishy so I figured I'd throw in my 2c. Hopefully I'm wrong and this is fine to do!

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fairclothjm thanks for putting this up! I think there are a couple of tweaks to get the locking logic right here but it shouldn't take too much.

@mpalmi good spot on the missing unlocks on return.

I wonder if there is a way we can still get the safety of defer by default 🤔. I'll think about this more but just fixing the two early returns for now should be fine.

What happens if there are two concurrent requests to Close the same database?

Great thought. I think that's safe (assuming we clean up the return unlocks) because:

  1. First one to get the lock will remove the DB implementation from the map before it unlocks
  2. second one will not find it and so return an error fmt.Errorf("no database instance found")
  3. either: the first call succeeds and the DB is closed and removed
  4. or: the first call fails the Close call, re-grabs the lock and puts the implementation back. This might not be necessary but it preserves the old behaviour in case anything relied on being able to retry Close if it fails due to a transient error instead of just leaking the resources so it seems lower risk.

I think the one thing in that that could be a problem is that we should probably not blindly put back the broken one on error just in case a new plugin isntance has been started in the interim with the same ID and we end up replacing it with the old broken instance.

sdk/database/dbplugin/v5/grpc_server.go Show resolved Hide resolved
sdk/database/dbplugin/v5/grpc_server.go Show resolved Hide resolved
sdk/database/dbplugin/v5/grpc_server.go Outdated Show resolved Hide resolved
@fairclothjm
Copy link
Contributor Author

@mpalmi @banks Thanks both for the thorough review! I will look closer at this one.

@fairclothjm fairclothjm removed the request for review from divyapola5 December 5, 2024 21:49
@fairclothjm
Copy link
Contributor Author

@divyapola5 Hi, not sure how I accidentally added you as a reviewer here. Sorry about that!

@fairclothjm fairclothjm requested a review from banks January 9, 2025 17:07
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice JM. I think this addresses all the feedback so far around lock safety and the edge case around overriding a newer instance when we put it back on a failure.

@fairclothjm fairclothjm enabled auto-merge (squash) January 9, 2025 17:33
@fairclothjm fairclothjm merged commit 36d7e0c into main Jan 9, 2025
91 of 92 checks passed
@fairclothjm fairclothjm deleted the sdk/db/close-non-locking branch January 9, 2025 17:33
@fairclothjm fairclothjm added backport/ent/1.16.x+ent Changes are backported to 1.16.x+ent backport/1.18.x backport/ent/1.17.x+ent Changes are backported to 1.17.x+ent labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.16.x+ent Changes are backported to 1.16.x+ent backport/ent/1.17.x+ent Changes are backported to 1.17.x+ent backport/1.18.x hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed pr/no-milestone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants