Sidecar doesn't properly handle best effort upload and races with entrypoint. #21167

cjwagner · 2021-03-04T23:28:46Z

We've recently had multiple presubmits report failure in spyglass and finished.json even though they were actually aborted due to more recent commits being pushed. The ProwJob resource disagrees with finished.json and properly indicates the aborted state.

I think we are encountering this race:

test-infra/prow/sidecar/run.go

Lines 90 to 95 in 2a5710e

    
           // If we are being asked to terminate by the kubelet but we have 
        
           // NOT seen the test process exit cleanly, we need a to start 
        
           // uploading artifacts to GCS immediately. If we notice the process 
        
           // exit while doing this best-effort upload, we can race with the 
        
           // second upload but we can tolerate this as we'd rather get SOME 
        
           // data into GCS than attempt to cancel these uploads and get none.

This option is designed to help mitigate when used in conjunction with an appropriate graceperiod timeout:

test-infra/prow/sidecar/options.go

Lines 50 to 77 in 2a5710e

    
           // IgnoreInterrupts instructs the waiting process to ignore interrupt 
        
           // signals. An interrupt signal is sent to this process when the kubelet 
        
           // decides to delete the test Pod. This may be as a result of: 
        
           //  - the ProwJob exceeding the `default_job_timeout` as configured for Prow 
        
           //  - the ProwJob exceeding the `timeout` as configured for the job itself 
        
           //  - the Pod exceeding the `pod_running_timeout` as configured for Prow 
        
           //  - cluster instability causing the Pod to be evicted 
        
           // 
        
           // When this happens, the `entrypoint` process also gets the signal, and 
        
           // forwards it to the process under test. `entrypoint` will wait for the 
        
           // test process to exit, either configured with: 
        
           //  - `grace_period` in the default decoration configurations for Prow 
        
           //  - `grace_period` in the job's specific configuration 
        
           // After the grace period, `entrypoint` will forcefully terminate the test 
        
           // process and signal to `sidecar` that the process has exited. 
        
           // 
        
           // In parallel, the kubelet will be waiting on the Pod's `terminationGracePeriod` 
        
           // after sending the interrupt signal, at which point the kubelet will forcefully 
        
           // terminate all containers in the Pod. 
        
           // 
        
           // If `ignore_interrupts` is set, `sidecar` will do nothing upon receipt of 
        
           // the interrupt signal; this implicitly means that upload of logs and artifacts 
        
           // will begin when the test process exits, which may be as long as the grace 
        
           // period if the test process does not gracefully handle interrupts. This will 
        
           // require that the user configures the Pod's termination grace period to be 
        
           // longer than the `entrypoint` grace period for the test process and the time 
        
           // taken by `sidecar` to upload all relevant artifacts. 
        
           IgnoreInterrupts bool `json:"ignore_interrupts,omitempty"`

However, the option is always set to false, we don't have a way to configure it:

test-infra/prow/pod-utils/decorate/podspec.go

Line 723 in 276f55e

    
           sidecar, err := Sidecar(pj.Spec.DecorationConfig, blobStorageOptions, blobStorageMounts, logMount, outputMount, encodedJobSpec, !RequirePassingEntries, !IgnoreInterrupts, wrappers...)

I think the comment describing the race is misleading and the logic could be improved here. In particular we don't actually try to perform the upload twice if an interrupt is received, we just immediately begin the upload then terminate. I would expect sidecar to immediately begin the upload, but then check/wait for the marker files to be written, and reupload when the markers are present.
Based on the comment this sounds like what was actually intended.

/help

k8s-ci-robot · 2021-03-04T23:28:47Z

@cjwagner:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We've recently had multiple presubmits report failure in spyglass and finished.json even though they were actually aborted due to more recent commits being pushed. The ProwJob resource disagrees with finished.json and properly indicates the aborted state.

I think we are encountering this race:

test-infra/prow/sidecar/run.go

Lines 90 to 95 in 2a5710e

// If we are being asked to terminate by the kubelet but we have

// NOT seen the test process exit cleanly, we need a to start

// uploading artifacts to GCS immediately. If we notice the process

// exit while doing this best-effort upload, we can race with the

// second upload but we can tolerate this as we'd rather get SOME

// data into GCS than attempt to cancel these uploads and get none.

This option is designed to help mitigate when used in conjunction with an appropriate graceperiod timeout:

test-infra/prow/sidecar/options.go

Lines 50 to 77 in 2a5710e

// IgnoreInterrupts instructs the waiting process to ignore interrupt

// signals. An interrupt signal is sent to this process when the kubelet

// decides to delete the test Pod. This may be as a result of:

// - the ProwJob exceeding the `default_job_timeout` as configured for Prow

// - the ProwJob exceeding the `timeout` as configured for the job itself

// - the Pod exceeding the `pod_running_timeout` as configured for Prow

// - cluster instability causing the Pod to be evicted

//

// When this happens, the `entrypoint` process also gets the signal, and

// forwards it to the process under test. `entrypoint` will wait for the

// test process to exit, either configured with:

// - `grace_period` in the default decoration configurations for Prow

// - `grace_period` in the job's specific configuration

// After the grace period, `entrypoint` will forcefully terminate the test

// process and signal to `sidecar` that the process has exited.

//

// In parallel, the kubelet will be waiting on the Pod's `terminationGracePeriod`

// after sending the interrupt signal, at which point the kubelet will forcefully

// terminate all containers in the Pod.

//

// If `ignore_interrupts` is set, `sidecar` will do nothing upon receipt of

// the interrupt signal; this implicitly means that upload of logs and artifacts

// will begin when the test process exits, which may be as long as the grace

// period if the test process does not gracefully handle interrupts. This will

// require that the user configures the Pod's termination grace period to be

// longer than the `entrypoint` grace period for the test process and the time

// taken by `sidecar` to upload all relevant artifacts.

IgnoreInterrupts bool `json:"ignore_interrupts,omitempty"`

However, the option is always set to false, we don't have a way to configure it:

test-infra/prow/pod-utils/decorate/podspec.go

Line 723 in 276f55e

sidecar, err := Sidecar(pj.Spec.DecorationConfig, blobStorageOptions, blobStorageMounts, logMount, outputMount, encodedJobSpec, !RequirePassingEntries, !IgnoreInterrupts, wrappers...)

I think the comment describing the race is misleading and the logic could be improved here. In particular we don't actually try to perform the upload twice if an interrupt is received, we just immediately begin the upload then terminate. I would expect sidecar to immediately begin the upload, but then check/wait for the marker files to be written, and reupload when the markers are present.
Based on the comment this sounds like what was actually intended.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

viveksyngh · 2021-03-25T17:16:51Z

I would like to work on this issue

stevekuznetsov · 2021-03-25T17:20:58Z

I think that's what the behavior did when I originally wrote it @cjwagner lol. Perhaps that's been broken since. Unfortunately hard to e2e test this component.

stevekuznetsov · 2021-03-25T17:21:17Z

@viveksyngh sure thing - shoot me an /assign and I'll review when you are ready

stevekuznetsov · 2021-03-25T17:23:27Z

Actually @cjwagner as of #20923 we should intelligently set the Pod's terminationGracePeriodSeconds to be larger than the entrypoint's grace period, so perhaps it would be safe to default to ignoring interrupts if we think that the default buffer we add is enough to do the upload.

cjwagner · 2021-03-25T18:10:02Z

so perhaps it would be safe to default to ignoring interrupts if we think that the default buffer we add is enough to do the upload.

That would be one way to mitigate this problem and it would probably be sufficient most of the time, but I don't think it is as robust as doing a best effort upload then continuing to wait and try the real upload. The double upload seems strictly better since it ensures that we upload something even if the main upload takes too long to complete.

viveksyngh · 2021-03-25T19:00:52Z

/assign

viveksyngh · 2021-04-04T06:02:50Z

I have opened a draft MR to #21644 to address this, where I am adding a field in DecorationConfig to allow configuring IgnoreInterrupts for sidecar. Please let me know if I am moving in right direction here.

spiffxp · 2021-07-13T16:32:51Z

/milestone v1.22

cjwagner added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2021

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 4, 2021

BenTheElder added area/prow Issues or PRs related to prow area/prow/pod-utilities Issues or PRs related to prow's pod-utilities component sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 5, 2021

k8s-ci-robot assigned viveksyngh Mar 25, 2021

viveksyngh mentioned this issue Apr 7, 2021

prow: sidecar: allow configuring ignore interrupts #21644

Merged

k8s-ci-robot closed this as completed in #21644 Apr 26, 2021

k8s-ci-robot added this to the v1.22 milestone Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidecar doesn't properly handle best effort upload and races with entrypoint. #21167

Sidecar doesn't properly handle best effort upload and races with entrypoint. #21167

cjwagner commented Mar 4, 2021

k8s-ci-robot commented Mar 4, 2021

viveksyngh commented Mar 25, 2021

stevekuznetsov commented Mar 25, 2021

stevekuznetsov commented Mar 25, 2021

stevekuznetsov commented Mar 25, 2021

cjwagner commented Mar 25, 2021

viveksyngh commented Mar 25, 2021

viveksyngh commented Apr 4, 2021

spiffxp commented Jul 13, 2021

Sidecar doesn't properly handle best effort upload and races with entrypoint. #21167

Sidecar doesn't properly handle best effort upload and races with entrypoint. #21167

Comments

cjwagner commented Mar 4, 2021

k8s-ci-robot commented Mar 4, 2021

viveksyngh commented Mar 25, 2021

stevekuznetsov commented Mar 25, 2021

stevekuznetsov commented Mar 25, 2021

stevekuznetsov commented Mar 25, 2021

cjwagner commented Mar 25, 2021

viveksyngh commented Mar 25, 2021

viveksyngh commented Apr 4, 2021

spiffxp commented Jul 13, 2021