Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prow: sidecar: allow configuring ignore interrupts #21644

Merged
merged 1 commit into from
Apr 26, 2021

Conversation

viveksyngh
Copy link
Contributor

@viveksyngh viveksyngh commented Apr 3, 2021

Add a field upload_ignores_interrupts to decoration config spec to configure ignore interrupts while performing
upload. It also tries to peform best-effort upload when upload_ignores_interrupts is set to false and interrupt
is received. While performing best-effort upload if main upload is ready it interrupts best-effort upload and switches
to main upload.

Fixes: #21167

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @viveksyngh!

It looks like this is your first PR to kubernetes/test-infra 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/test-infra has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @viveksyngh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 3, 2021
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/prow Issues or PRs related to prow area/prow/pod-utilities Issues or PRs related to prow's pod-utilities component sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 3, 2021
@viveksyngh viveksyngh changed the title Add ignore interrupt to decorator config Add ignore_interrupts field to decorator config Apr 3, 2021
@viveksyngh viveksyngh changed the title Add ignore_interrupts field to decorator config Add ignore_interrupts field to decoration config Apr 3, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 3, 2021
@viveksyngh viveksyngh changed the title Add ignore_interrupts field to decoration config prow: sidecar: allow configuring ignore interrupts Apr 3, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 5, 2021
@viveksyngh viveksyngh marked this pull request as ready for review April 5, 2021 04:56
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2021
@k8s-ci-robot k8s-ci-robot added the area/prow/sidecar Issues or PRs related to prow's sidecar component label Apr 7, 2021
@alvaroaleman
Copy link
Member

@viveksyngh can you explain the motivation/usecase for this new knob?

@viveksyngh
Copy link
Contributor Author

viveksyngh commented Apr 7, 2021

@alvaroaleman Sorry I didn't link the issue but I have created to fix this issue #21167

I have linked the issue now.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2021
Copy link
Member

@cjwagner cjwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Exposing the ignore_interrupts option is nice, but the fix to the best effort upload logic is even better. With that fix the option hopefully won't be needed at all.


// IgnoreInterrupts causes sidecar to ignore interrupts and
// hope that the test process exits cleanly before starting an upload.
IgnoreInterrupts *bool `json:"ignore_interrupts,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this to something that indicates that this applies to the job result upload only and that interrupts will still be forwarded to the test process, e.g. upload_ignores_interrupts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to upload_ignores_interrupts

Comment on lines 745 to 750
var ignoreInterrupts bool
if pj.Spec.DecorationConfig.IgnoreInterrupts == nil { // if nil, set it to false
ignoreInterrupts = false
} else {
ignoreInterrupts = *pj.Spec.DecorationConfig.IgnoreInterrupts
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is a bit simpler

Suggested change
var ignoreInterrupts bool
if pj.Spec.DecorationConfig.IgnoreInterrupts == nil { // if nil, set it to false
ignoreInterrupts = false
} else {
ignoreInterrupts = *pj.Spec.DecorationConfig.IgnoreInterrupts
}
ignoreInterrupts := pj.Spec.DecorationConfig.IgnoreInterrupts != nil && *pj.Spec.DecorationConfig.IgnoreInterrupts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it

@@ -108,6 +108,10 @@ func (o Options) Run(ctx context.Context) (int, error) {
return 0, fmt.Errorf("could not resolve job spec: %v", err)
}

entries := o.entries()
buildLogs := logReaders(entries)
metadata := combineMetadata(entries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logReaders and combineMetadata cannot be called this early. They need to be called immediately before doUpload so that we don't open and read the files before they've been written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these calls to immediately before doUpload and bestEffortUpload call.

@@ -125,7 +129,12 @@ func (o Options) Run(ctx context.Context) (int, error) {
// second upload but we can tolerate this as we'd rather get SOME
// data into GCS than attempt to cancel these uploads and get none.
logrus.Errorf("Received an interrupt: %s, cancelling...", s)
cancel()

err = o.doUpload(spec, false, true, metadata, buildLogs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case skips some of the steps before doUpload is called, in particular checking for the deprecated wrapper options and censoring the data before upload. Please put all this common logic in a function that can be called in both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new method preUpload which performs steps required before upload and call it before both upload process.

@@ -125,7 +129,12 @@ func (o Options) Run(ctx context.Context) (int, error) {
// second upload but we can tolerate this as we'd rather get SOME
// data into GCS than attempt to cancel these uploads and get none.
logrus.Errorf("Received an interrupt: %s, cancelling...", s)
cancel()

err = o.doUpload(spec, false, true, metadata, buildLogs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this preemptive upload to be interrupted if we see that the marker files are written and we're able to begin the main upload.
To achieve this we should instrument the call stack of doUpload with a context that can be used to cancel the upload early and then pass the ctx context here. That way if the wait function returns on line 148 we'll cancel the preemptive upload on line 150 and begin the main upload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled this by adding a new method doBestEffortUpload which uses doUpload but also switches context when main upload is ready.

prow/sidecar/run.go Show resolved Hide resolved
Copy link
Member

@cjwagner cjwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review.

select {
case <-ctx.Done():
logrus.Infof("Interrupting best-effort upload as main upload is ready")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning here does not actually interrupt the goroutine created above. o.doUpload() will still run to completion unless we pass ctx as a parameter down its call stack to be used to interrupt the upload. Specifically o.doUpload() above, then

if err := o.GcsOptions.Run(spec, uploadTargets); err != nil {

then
if err := gcs.Upload(o.Bucket, o.StorageClientOptions.GCSCredentialsFile, o.StorageClientOptions.S3CredentialsFile, uploadTargets); err != nil {

to replace the background context here
ctx := context.Background()

After this change it won't be necessary to call doUpload() from a separate goroutine above.

Since we only need to be able to cancel the best effort doUpload() call, the main doUpload() call can use context.Background() rather than a cancelable context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the functions to pass the context in stack and removed separate go routine

@cjwagner
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 20, 2021
@k8s-ci-robot k8s-ci-robot added area/prow/gcsupload Issues or PRs related to prow's gcsupload component area/prow/initupload Issues or PRs related to prow's initupload component labels Apr 23, 2021
@alvaroaleman
Copy link
Member

alvaroaleman commented Apr 23, 2021

@cjwagner is this change really addressing the issue in question? From your original comment:

I think the comment describing the race is misleading and the logic could be improved here. In particular we don't actually try to perform the upload twice if an interrupt is received, we just immediately begin the upload then terminate. I would expect sidecar to immediately begin the upload, but then check/wait for the marker files to be written, and reupload when the markers are present

Copy link
Member

@cjwagner cjwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except for one thread safety issue.

buildLogs := logReaders(entries)
metadata := combineMetadata(entries)
return failures, o.doUpload(spec, passed, aborted, metadata, buildLogs)
ctx = context.Background()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think changing the value of ctx here is threadsafe. Please use context.Background() directly or define a new variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

@cjwagner
Copy link
Member

@cjwagner is this change really addressing the issue in question?

Yes, I believe so. It isn't very clear from PR title/description, but the main change in this PR is the fix of the best effort upload to actually do what the comment described. The PR also makes the interrupt ignoring behavior configurable which may not be useful if the main fix solves the problem, but it shouldn't hurt though.

Add a field `upload_ignores_interrupts` to decoration config spec to configure ignore interrupts while performing
upload. It also tries to peform best-effort upload when `upload_ignores_interrupts` is set to `false` and interrupt
is received. While performing best-effort upload if main upload is ready it interrupts best-effort upload and switches
to main upload.

Signed-off-by: Vivek Singh <[email protected]>
Copy link
Member

@cjwagner cjwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, Vivek!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 26, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cjwagner, viveksyngh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2021
@k8s-ci-robot k8s-ci-robot merged commit 1982976 into kubernetes:master Apr 26, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/prow/gcsupload Issues or PRs related to prow's gcsupload component area/prow/initupload Issues or PRs related to prow's initupload component area/prow/pod-utilities Issues or PRs related to prow's pod-utilities component area/prow/sidecar Issues or PRs related to prow's sidecar component area/prow Issues or PRs related to prow cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sidecar doesn't properly handle best effort upload and races with entrypoint.
4 participants