Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar doesn't properly handle best effort upload and races with entrypoint. #21167

Closed
cjwagner opened this issue Mar 4, 2021 · 9 comments · Fixed by #21644
Closed

Sidecar doesn't properly handle best effort upload and races with entrypoint. #21167

cjwagner opened this issue Mar 4, 2021 · 9 comments · Fixed by #21644
Assignees
Labels
area/prow/pod-utilities Issues or PRs related to prow's pod-utilities component area/prow Issues or PRs related to prow help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@cjwagner
Copy link
Member

cjwagner commented Mar 4, 2021

We've recently had multiple presubmits report failure in spyglass and finished.json even though they were actually aborted due to more recent commits being pushed. The ProwJob resource disagrees with finished.json and properly indicates the aborted state.

I think we are encountering this race:

// If we are being asked to terminate by the kubelet but we have
// NOT seen the test process exit cleanly, we need a to start
// uploading artifacts to GCS immediately. If we notice the process
// exit while doing this best-effort upload, we can race with the
// second upload but we can tolerate this as we'd rather get SOME
// data into GCS than attempt to cancel these uploads and get none.

This option is designed to help mitigate when used in conjunction with an appropriate graceperiod timeout:

// IgnoreInterrupts instructs the waiting process to ignore interrupt
// signals. An interrupt signal is sent to this process when the kubelet
// decides to delete the test Pod. This may be as a result of:
// - the ProwJob exceeding the `default_job_timeout` as configured for Prow
// - the ProwJob exceeding the `timeout` as configured for the job itself
// - the Pod exceeding the `pod_running_timeout` as configured for Prow
// - cluster instability causing the Pod to be evicted
//
// When this happens, the `entrypoint` process also gets the signal, and
// forwards it to the process under test. `entrypoint` will wait for the
// test process to exit, either configured with:
// - `grace_period` in the default decoration configurations for Prow
// - `grace_period` in the job's specific configuration
// After the grace period, `entrypoint` will forcefully terminate the test
// process and signal to `sidecar` that the process has exited.
//
// In parallel, the kubelet will be waiting on the Pod's `terminationGracePeriod`
// after sending the interrupt signal, at which point the kubelet will forcefully
// terminate all containers in the Pod.
//
// If `ignore_interrupts` is set, `sidecar` will do nothing upon receipt of
// the interrupt signal; this implicitly means that upload of logs and artifacts
// will begin when the test process exits, which may be as long as the grace
// period if the test process does not gracefully handle interrupts. This will
// require that the user configures the Pod's termination grace period to be
// longer than the `entrypoint` grace period for the test process and the time
// taken by `sidecar` to upload all relevant artifacts.
IgnoreInterrupts bool `json:"ignore_interrupts,omitempty"`

However, the option is always set to false, we don't have a way to configure it:

sidecar, err := Sidecar(pj.Spec.DecorationConfig, blobStorageOptions, blobStorageMounts, logMount, outputMount, encodedJobSpec, !RequirePassingEntries, !IgnoreInterrupts, wrappers...)


I think the comment describing the race is misleading and the logic could be improved here. In particular we don't actually try to perform the upload twice if an interrupt is received, we just immediately begin the upload then terminate. I would expect sidecar to immediately begin the upload, but then check/wait for the marker files to be written, and reupload when the markers are present.
Based on the comment this sounds like what was actually intended.

/help

@cjwagner cjwagner added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2021
@k8s-ci-robot
Copy link
Contributor

@cjwagner:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We've recently had multiple presubmits report failure in spyglass and finished.json even though they were actually aborted due to more recent commits being pushed. The ProwJob resource disagrees with finished.json and properly indicates the aborted state.

I think we are encountering this race:

// If we are being asked to terminate by the kubelet but we have
// NOT seen the test process exit cleanly, we need a to start
// uploading artifacts to GCS immediately. If we notice the process
// exit while doing this best-effort upload, we can race with the
// second upload but we can tolerate this as we'd rather get SOME
// data into GCS than attempt to cancel these uploads and get none.

This option is designed to help mitigate when used in conjunction with an appropriate graceperiod timeout:

// IgnoreInterrupts instructs the waiting process to ignore interrupt
// signals. An interrupt signal is sent to this process when the kubelet
// decides to delete the test Pod. This may be as a result of:
// - the ProwJob exceeding the `default_job_timeout` as configured for Prow
// - the ProwJob exceeding the `timeout` as configured for the job itself
// - the Pod exceeding the `pod_running_timeout` as configured for Prow
// - cluster instability causing the Pod to be evicted
//
// When this happens, the `entrypoint` process also gets the signal, and
// forwards it to the process under test. `entrypoint` will wait for the
// test process to exit, either configured with:
// - `grace_period` in the default decoration configurations for Prow
// - `grace_period` in the job's specific configuration
// After the grace period, `entrypoint` will forcefully terminate the test
// process and signal to `sidecar` that the process has exited.
//
// In parallel, the kubelet will be waiting on the Pod's `terminationGracePeriod`
// after sending the interrupt signal, at which point the kubelet will forcefully
// terminate all containers in the Pod.
//
// If `ignore_interrupts` is set, `sidecar` will do nothing upon receipt of
// the interrupt signal; this implicitly means that upload of logs and artifacts
// will begin when the test process exits, which may be as long as the grace
// period if the test process does not gracefully handle interrupts. This will
// require that the user configures the Pod's termination grace period to be
// longer than the `entrypoint` grace period for the test process and the time
// taken by `sidecar` to upload all relevant artifacts.
IgnoreInterrupts bool `json:"ignore_interrupts,omitempty"`

However, the option is always set to false, we don't have a way to configure it:

sidecar, err := Sidecar(pj.Spec.DecorationConfig, blobStorageOptions, blobStorageMounts, logMount, outputMount, encodedJobSpec, !RequirePassingEntries, !IgnoreInterrupts, wrappers...)


I think the comment describing the race is misleading and the logic could be improved here. In particular we don't actually try to perform the upload twice if an interrupt is received, we just immediately begin the upload then terminate. I would expect sidecar to immediately begin the upload, but then check/wait for the marker files to be written, and reupload when the markers are present.
Based on the comment this sounds like what was actually intended.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 4, 2021
@BenTheElder BenTheElder added area/prow Issues or PRs related to prow area/prow/pod-utilities Issues or PRs related to prow's pod-utilities component sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 5, 2021
@viveksyngh
Copy link
Contributor

I would like to work on this issue

@stevekuznetsov
Copy link
Contributor

I think that's what the behavior did when I originally wrote it @cjwagner lol. Perhaps that's been broken since. Unfortunately hard to e2e test this component.

@stevekuznetsov
Copy link
Contributor

@viveksyngh sure thing - shoot me an /assign and I'll review when you are ready

@stevekuznetsov
Copy link
Contributor

Actually @cjwagner as of #20923 we should intelligently set the Pod's terminationGracePeriodSeconds to be larger than the entrypoint's grace period, so perhaps it would be safe to default to ignoring interrupts if we think that the default buffer we add is enough to do the upload.

@cjwagner
Copy link
Member Author

so perhaps it would be safe to default to ignoring interrupts if we think that the default buffer we add is enough to do the upload.

That would be one way to mitigate this problem and it would probably be sufficient most of the time, but I don't think it is as robust as doing a best effort upload then continuing to wait and try the real upload. The double upload seems strictly better since it ensures that we upload something even if the main upload takes too long to complete.

@viveksyngh
Copy link
Contributor

/assign

@viveksyngh
Copy link
Contributor

I have opened a draft MR to #21644 to address this, where I am adding a field in DecorationConfig to allow configuring IgnoreInterrupts for sidecar. Please let me know if I am moving in right direction here.

@spiffxp
Copy link
Member

spiffxp commented Jul 13, 2021

/milestone v1.22

@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow/pod-utilities Issues or PRs related to prow's pod-utilities component area/prow Issues or PRs related to prow help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants