Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logical.ReadRawWithDataWithContext cancels response body #18658

Closed
terinjokes opened this issue Jan 10, 2023 · 10 comments · Fixed by #18708
Closed

Logical.ReadRawWithDataWithContext cancels response body #18658

terinjokes opened this issue Jan 10, 2023 · 10 comments · Fixed by #18708
Labels
bug Used to indicate a potential bug core/api devex Developer Experience

Comments

@terinjokes
Copy link

terinjokes commented Jan 10, 2023

Describe the bug

To fetch the public certificates of certificate authorities from the PKI plugin, I was using the Client's RawRequestWithContext function, which has since been deprecated. I've migrated to the Logical client's ReadRawWithDataWithContext function, but found it to be inconsistently failing.

To support client HTTP timeouts, both implementations call context.WithTimeout. However, RawRequestWithContext never called the cancel function.

vault/api/client.go

Lines 1268 to 1277 in c782b66

// Deprecated: This method should not be used directly. Use higher level
// methods instead.
func (c *Client) RawRequestWithContext(ctx context.Context, r *Request) (*Response, error) {
// Note: we purposefully do not call cancel manually. The reason is
// when canceled, the request.Body will EOF when reading due to the way
// it streams data in. Cancel will still be run when the timeout is
// hit, so this doesn't really harm anything.
ctx, _ = c.withConfiguredTimeout(ctx)
return c.rawRequestWithContext(ctx, r)
}

The replacement ReadRawWithDataWithContext does call cancel, which occasionally starts causing failures reading the response body.

vault/api/logical.go

Lines 101 to 106 in c782b66

func (c *Logical) ReadRawWithDataWithContext(ctx context.Context, path string, data map[string][]string) (*Response, error) {
ctx, cancelFunc := c.c.withConfiguredTimeout(ctx)
defer cancelFunc()
return c.readRawWithDataWithContext(ctx, path, data)
}

(Note that this code block eventually calls the above deprecated function via readRawWithDataWithContext)

To Reproduce
Since the failure seems to depend on the precise scheduling of the Go runtime, and behavior of select cases, I've been unable to develop an example that consistently fails. However, as an partial example:

func GetCertifcateForAuthority(ctx context.Context, client *vault.Client, caName string) ([]byte, error) {
	pemPath := path.Join(caName, "ca", "pem")
	resp, err := client.Logical().ReadRawWithDataWithContext(ctx, pemPath, nil)
	if err != nil {
		return nil, fmt.Errorf("fetching certificate from Vault returned error: %w", err)
	}
	defer resp.Body.Close()

	switch resp.StatusCode {
	case http.StatusNoContent:
		// Vault returns 204 when CA exists, but key material hasn't been submitted or generated
		return []byte{}, nil
	case http.StatusOK:
		p, err := io.ReadAll(resp.Body)
		if err != nil {
			return nil, fmt.Errorf("reading body from Vault: %w", err)
		}

		block, _ := pem.Decode(p)
		if block == nil {
			return nil, errors.New("no PEM block returned by Vault")
		} else if block.Type != "CERTIFICATE" {
			return nil, fmt.Errorf("unknown PEM %q block returned by Vault", block.Type)
		}

		return pem.EncodeToMemory(block), nil
	default:
		return nil, fmt.Errorf("unexpected status code returned by Vault: %d", resp.StatusCode)
	}
}

With using a Vault client generated with vault.DefaultConfig(), this function will eventually unexpectedly fail with the following error: reading body from Vault: context canceled. After disabling the HTTP timeout in the Vault client (eg, config.Timeout = 0) I've been unable to reproduce this error.

Expected behavior
I can't see the use case for closing the body of a ReadRaw* request before the caller has a chance to inspect is the intended result of these APIs.

Should I open a PR to duplicate the note and behavior from Client? Or was this timeout behavior only intended on the non-raw paths?

@heatherezell heatherezell added bug Used to indicate a potential bug core/api devex Developer Experience labels Jan 11, 2023
@averche
Copy link
Contributor

averche commented Jan 11, 2023

Hi @terinjokes,

Thanks for reporting this!

The ReadRawWithDataWithContext function you are calling seems to have reintroduced an old bug where the context gets cancelled before the body is closed.

Should I open a PR to duplicate the note and behavior from Client? Or was this timeout behavior only intended on the non-raw paths?

I believe it's the later. We had to deprecate RawRequestWithContext precisely because of the timeout cancellation. I'm going to put in a PR to remove the configured timeout behavior and add a comment to reflect that.

@averche
Copy link
Contributor

averche commented Jan 11, 2023

Actually, it appears that @cipherboy has already addressed this issue in #17750 PR from November

@terinjokes
Copy link
Author

It looks like that PR takes the former approach of duplicating the behavior from client, which seems to be the reason it was deprecated in the first place?

averche added a commit that referenced this issue Jan 17, 2023
#18708)

Removing the timeout logic from raw-response functions and adding documentation comments. The following functions are affected:

- `ReadRaw`
- `ReadRawWithContext` (newly added)
- `ReadRawWithData`
- `ReadRawWithDataWithContext`

The previous logic of using `ctx, _ = c.c.withConfiguredTimeout(ctx)` could cause a potential [context leak](https://pkg.go.dev/context):

> Failing to call the CancelFunc leaks the child and its children until the parent is canceled or the timer fires. The go vet tool checks that CancelFuncs are used on all control-flow paths.

Cancelling the context would have caused more issues since the context would be cancelled before the request body is closed.

Resolves: #18658
@averche
Copy link
Contributor

averche commented Jan 19, 2023

Hi @terinjokes, I've merged the PR to fix this issue (removing timeout logic altogether from ReadRaw* requests) and tagged it as api/v1.8.3. Please update to this version to pick up the latest changes.

Since ReadRaw* methods no longer handle timeouts, you will probably want to do it manually. In your example above, it would be something like:

func GetCertifcateForAuthority(ctx context.Context, client *vault.Client, caName string) ([]byte, error) {
	ctx, cancel := context.WithTimeout(context.Background(), client.ClientTimeout())
	defer cancel()
	
	// the rest of the code is unchanged
}

@terinjokes
Copy link
Author

terinjokes commented Jan 20, 2023

Yes, I also did that with the workaround of disabling the cfg.Timeout. I'll upgrade and test the new release over the next few days.

@terinjokes
Copy link
Author

@averche Do you have an estimate of when the github.com/hashicorp/vault module will be updated to use SDK v0.7.0? Updating api to 1.8.3 also upgrades the SDK, which causes my project to fail building:

/home/terin/go/pkg/mod/github.com/hashicorp/[email protected]/vault/logical_system.go:4284:9: not enough arguments in call to framework.NewOASDocument
	have ()
	want (string)

@averche
Copy link
Contributor

averche commented Jan 20, 2023

@terinjokes It appears that you are importing vault/go.mod directly. It is not meant to be imported as a dependency in your project. Please refer to this section for more details.

I've updated vault's go.mod to SDK v0.7.0 and API v1.8.3 (#18765, #18773) but if you are building against [email protected], it will likely not be backwards compatible.

@terinjokes
Copy link
Author

Yes, it looks like I started using the testrig in my E2E tests before some of those warnings went in?

@averche
Copy link
Contributor

averche commented Jan 23, 2023

Adding the following in go.mod worked for me:

require (
	github.com/hashicorp/vault v1.2.1-0.20230117213031-11f6aad2b293
	github.com/hashicorp/vault/api v1.8.3
)

Hope this helps!

In general, as I mentioned above, removing the dependency on github.com/hashicorp/vault would be a much better long term solution.

@terinjokes
Copy link
Author

I plan on just pulling on the binary and wiring it up to the test runner. Fortunately, I've already isolated the E2Es from the rest of the project.

Thanks, we can probably call it done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/api devex Developer Experience
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants