Skip to content

test(controller): race condition in handleSigterm()#5821

Open
vflaux wants to merge 1 commit intokubernetes-sigs:masterfrom
vflaux:fix_sigterm_handler
Open

test(controller): race condition in handleSigterm()#5821
vflaux wants to merge 1 commit intokubernetes-sigs:masterfrom
vflaux:fix_sigterm_handler

Conversation

@vflaux
Copy link
Copy Markdown
Contributor

@vflaux vflaux commented Sep 8, 2025

What does it do ?

Fix a race condition between the signal handler setup and the signal send in the test for handleSigterm().

In the test, in rare case the SIGTERM signal could be send before the handleSigterm() calls signal.Notify().

In this case, the cancel function is never called.

The test always pass because it silently ignore this with the select case sig := <-sigChan.

I think we need to test that the signal received is a SIGTERM and that the cancel func has been called. Not one of the two conditions.

Motivation

This primarily concerns testing and ensuring correctness. There is most likely no issue when the controller run.

I could reproduce this with go test -timeout 30s -count 1000 -run ^TestHandleSigterm$ sigs.k8s.io/external-dns/controller after removing the sig := <-sigChan select case in the current code.
(edit: test was removed in #5816)

Details
func TestHandleSigterm(t *testing.T) {
	cancelCalled := make(chan bool, 1)
	cancel := func() {
		cancelCalled <- true
	}

	var logOutput bytes.Buffer
	log.SetOutput(&logOutput)
	defer log.SetOutput(os.Stderr)

	go handleSigterm(cancel)

	// Simulate sending a SIGTERM signal
	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGTERM)
	err := syscall.Kill(syscall.Getpid(), syscall.SIGTERM)
	assert.NoError(t, err)

	// Wait for the cancel function to be called
	select {
	case <-cancelCalled:
		assert.Contains(t, logOutput.String(), "Received SIGTERM. Terminating...")
	// case sig := <-sigChan:
	// 	assert.Equal(t, syscall.SIGTERM, sig)
	case <-time.After(1 * time.Second):
		t.Fatal("cancel function was not called")
	}
}
$ go test -timeout 30s -count 1000 -run ^TestHandleSigterm$ sigs.k8s.io/external-dns/controller
--- FAIL: TestHandleSigterm (1.00s)
    execute_test.go:350: cancel function was not called
--- FAIL: TestHandleSigterm (1.00s)
    execute_test.go:350: cancel function was not called
--- FAIL: TestHandleSigterm (1.00s)
    execute_test.go:350: cancel function was not called
--- FAIL: TestHandleSigterm (1.00s)
    execute_test.go:350: cancel function was not called
time="2026-03-28T04:55:43+01:00" level=info msg="Received SIGTERM. Terminating..."
--- FAIL: TestHandleSigterm (1.00s)
    execute_test.go:350: cancel function was not called
--- FAIL: TestHandleSigterm (1.00s)
    execute_test.go:350: cancel function was not called
time="2026-03-28T04:55:45+01:00" level=info msg="Received SIGTERM. Terminating..."
FAIL
FAIL    sigs.k8s.io/external-dns/controller     6.352s
FAIL

After this patch:

$ go test -timeout 30s -count 1000 -run ^TestHandleSigterm$ sigs.k8s.io/external-dns/controller
ok      sigs.k8s.io/external-dns/controller     0.032s [no tests to run]

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 8, 2025
@k8s-ci-robot k8s-ci-robot added the controller Issues or PRs related to the controller label Sep 8, 2025
@k8s-ci-robot k8s-ci-robot requested a review from szuecs September 8, 2025 13:12
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 8, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @vflaux. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 8, 2025
@vflaux vflaux force-pushed the fix_sigterm_handler branch from 09a2b7c to ad9a3b1 Compare September 8, 2025 13:36
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 8, 2025
@vflaux vflaux force-pushed the fix_sigterm_handler branch from ad9a3b1 to b1c4916 Compare September 8, 2025 13:39
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 8, 2025
@vflaux

This comment was marked as outdated.

@mloiseleur mloiseleur changed the title fix(controller): race condition in handleSigterm() test test(controller): race condition in handleSigterm() Sep 8, 2025
@vflaux vflaux force-pushed the fix_sigterm_handler branch from b1c4916 to 04b6ada Compare September 19, 2025 15:23
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 19, 2025
@vflaux vflaux force-pushed the fix_sigterm_handler branch from 04b6ada to 5f58b71 Compare December 16, 2025 14:39
@coveralls
Copy link
Copy Markdown

coveralls commented Dec 16, 2025

Pull Request Test Coverage Report for Build 23676867886

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 10 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.007%) to 78.215%

Files with Coverage Reduction New Missed Lines %
execute.go 10 76.64%
Totals Coverage Status
Change from base Build 23660098847: 0.007%
Covered Lines: 16408
Relevant Lines: 20978

💛 - Coveralls

@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 16, 2026
@vflaux vflaux force-pushed the fix_sigterm_handler branch from 5f58b71 to 781df55 Compare March 28, 2026 03:41
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 28, 2026
Fix a race condition between the signal handler setup and the signal
send in the test for handleSigterm()
@vflaux vflaux force-pushed the fix_sigterm_handler branch from 781df55 to 60ab953 Compare March 28, 2026 03:51
@vflaux vflaux marked this pull request as ready for review March 28, 2026 04:03
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2026
@ivankatliarchuk
Copy link
Copy Markdown
Member

Need to better understand the benefits.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 30, 2026
Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth to execute binary and terminate it. To validate behaviour. This sigterms, I'm not sure

case sig := <-sigChan:
assert.Equal(t, syscall.SIGTERM, sig)
case <-time.After(5 * time.Second):
t.Fatal("signal was not recieved")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Fatal("signal was not recieved")
t.Fatal("signal was not received")

Comment on lines +142 to +147
select {
case sig := <-sigChan:
assert.Equal(t, syscall.SIGTERM, sig)
case <-time.After(5 * time.Second):
t.Fatal("signal was not recieved")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case sig := <-sigChan:
      assert.Equal(t, syscall.SIGTERM, sig)  // Always true if sigChan fires

Comment on lines +132 to +133
defer signal.Reset(syscall.SIGTERM)
setupSigtermHandler(cancel)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goroutine leak in tests: The goroutine spawned by setupSigtermHandler is blocked on <-signals. After defer signal.Reset(syscall.SIGTERM) runs, nothing will ever send on that channel, so the goroutine leaks for the duration of the test binary

}
}

func TestSetupSigtermHandler(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe similar, hard to say

func TestSetupSigtermHandler(t *testing.T) {
      cancelCalled := make(chan bool, 1)
      cancel := func() { cancelCalled <- true }

      hook := logtest.LogsUnderTestWithLogLevel(log.InfoLevel, t)

      defer signal.Reset(syscall.SIGTERM)
      setupSigtermHandler(cancel)

      err := syscall.Kill(syscall.Getpid(), syscall.SIGTERM)
      assert.NoError(t, err)

      select {
      case <-cancelCalled:
          logtest.TestHelperLogContainsWithLogLevel("Received SIGTERM. Terminating...", log.InfoLevel, hook, t)
      case <-time.After(5 * time.Second):
          t.Fatal("cancel function was not called")
      }
}

<-signals
log.Info("Received SIGTERM. Terminating...")
cancel()
go func() {
Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine leaks a registration with the signal package. After receiving, we should deregister:

  go func() {
      <-signals
      signal.Stop(signals)
      log.Info("Received SIGTERM. Terminating...")
      cancel()
  }()

// setupSigtermHandler start a routine that listens for a SIGTERM signal and triggers the provided cancel function
// to gracefully terminate the application. It logs a message when the signal is received.
func handleSigterm(cancel func()) {
func setupSigtermHandler(cancel func()) {
Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not simply replace the whole logic with signal.NotifyContext?

in Execute()

// Before
  ctx, cancel := context.WithCancel(context.Background())
  setupSigtermHandler(cancel)

// After
 ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
 defer stop()

Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+

context.AfterFunc(ctx, func() {
      log.Info("Received SIGTERM. Terminating...")
  })

context.AfterFunc runs the callback in its own goroutine once the context is cancelled (for any reason — signal or otherwise), so the log line is preserved without a custom handler.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ivankatliarchuk. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. controller Issues or PRs related to the controller lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants