Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisor] set pod failure reason when supervisor is reaped #20318

Merged
merged 4 commits into from
Oct 25, 2024
Merged

Conversation

mustard-mh
Copy link
Contributor

@mustard-mh mustard-mh commented Oct 24, 2024

Description

  • Revert debug commit 1b6bde8 /hold

Root problem of CLC-877:

There can be a racing between reaper.Reap() and runCommand.Wait(). Our logic can process runCommand.Wait() well but once it's reaped by reaper.Reap() we know nothing (like exitCode). And the pod will exit without message or proper exitCode and then reach this line to update pod failed reason

return fmt.Sprintf("container %s completed; containers of a workspace pod are not supposed to do that", cs.Name), nil

We created a fork gitpod-io/go-reaper#1, temporary add a function ourselves, and after upstream https://github.com/ramr/go-reaper/tree/notifier to release, we need a follow-up PR

Related Issue(s)

Fixes CLC-877

How to test

Note

Please use https://github.com/gitpod-io/empty to open workspaces to save space usage of preview env

  • With debug commit applied 1b6bde8
  • Start X workspaces with any JetBrains IDEs, and wait 2 more minutes. The workspace failed reason should be always xxx timed out to start after 2 minutes (as we don't know when the racing will happen, but it's very frequently with this case actually, ten times should be fine -> you could start 4 workspaces at the same time)
  • Regular workspace stop should work like before

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft preemptible
    Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

@mustard-mh mustard-mh changed the title [supervisor] set pod failed reason when supervisor been reap [supervisor] set pod failure reason when supervisor is reaped Oct 24, 2024
Copy link

socket-security bot commented Oct 24, 2024

New and removed dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
golang/github.com/gitpod-io/[email protected] None 0 9.19 kB

🚮 Removed packages: golang/github.com/ramr/[email protected]

View full report↗︎

Copy link
Member

@filiptronicek filiptronicek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, a brief suite of tests also didn't turn up anything unexpected. Left one relevant comment.

Nice work!

select {
case <-ctx.Done(): // timeout
case exitCode := <-handledByReaper:
handleSupervisorExit(exitCode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet!

components/supervisor/cmd/init.go Show resolved Hide resolved
@mustard-mh
Copy link
Contributor Author

mustard-mh commented Oct 25, 2024

@filiptronicek Thanks for your review! I will revert debug commit later today and 🚢

}
log.WithError(err).Error("supervisor run error")
return
}
}()
// start the reaper to clean up zombie processes
reaper.Reap()
reaper.Start(reaper.Config{
Pid: -1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this config mean? Would it make sense to add a comment for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome find @mustard-mh ! 🚀 ✨

Test worked 10/10 ✔️

This reverts commit 1b6bde8.
@mustard-mh
Copy link
Contributor Author

Debug commit reverted, smoke tested it can start JetBrains IDE after reverting. /unhold

image

@roboquat roboquat merged commit 9e8da3b into main Oct 25, 2024
16 checks passed
@roboquat roboquat deleted the hw/CLC-877 branch October 25, 2024 13:04
@mustard-mh mustard-mh restored the hw/CLC-877 branch October 28, 2024 06:41
kylos101 added a commit that referenced this pull request Oct 28, 2024
roboquat pushed a commit that referenced this pull request Oct 29, 2024
… run error with unexpected exit code` (#20325)

* Revert "[supervisor] switch lib back to use upstream `ramr/go-reaper` (#20322)"

This reverts commit 9442b52.

* Revert "[supervisor] set pod failure reason when supervisor is reaped (#20318)"

This reverts commit 9e8da3b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants