Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/agent/TestStartStopRace deadlocks #2601

Closed
yurishkuro opened this issue Oct 29, 2020 · 4 comments · Fixed by #2608
Closed

cmd/agent/TestStartStopRace deadlocks #2601

yurishkuro opened this issue Oct 29, 2020 · 4 comments · Fixed by #2608
Labels
bug help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

Describe the bug

The test deadlocks under travis-ci.com https://travis-ci.com/github/jaegertracing/jaeger/jobs/417602049#L486.

To Reproduce
Only seems to happen in Travis. The call to agent.Stop() deadlocks:

agent.Stop()

More specifically, processor.Stop() deadlocks:

func (s *ThriftProcessor) Stop() {
stopwatch := metrics.StartStopwatch(s.metrics.ProcessorCloseTimer)
s.server.Stop()
s.processing.Wait()
stopwatch.Stop()
}

It appears that the processor does not get a change to start in the goroutine before processor.Stop() is called from the test.

Panic dump:

=== RUN   TestStartStopRace
    logger.go:130: 2020-10-29T04:13:48.871Z	INFO	shutting down agent's HTTP server	{"addr": "[::]:5778"}
    logger.go:130: 2020-10-29T04:13:48.872Z	INFO	Starting jaeger-agent HTTP server	{"http-port": 5778}
    logger.go:130: 2020-10-29T04:13:48.873Z	INFO	agent's http server exiting
coverage: 59.1% of statements
panic: test timed out after 5m0s
goroutine 767 [running]:
testing.(*M).startAlarm.func1()
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1618 +0x11f
created by time.goFunc
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/time/sleep.go:167 +0x52
goroutine 1 [chan receive, 4 minutes]:
testing.(*T).Run(0xc000162300, 0x12376ad, 0x11, 0x126dee0, 0x1)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1169 +0x5f4
testing.runTests.func1(0xc000162300)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1439 +0xa7
testing.tRunner(0xc000162300, 0xc00010fc78)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1123 +0x203
testing.runTests(0xc000164ee0, 0x18a7700, 0xc, 0xc, 0xbfdeae88f2495920, 0x45da9bd743, 0x18b4c80, 0xc00017bba8)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1437 +0x613
testing.(*M).Run(0xc0000fcc80, 0x0)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1345 +0x3b4
main.main()
	_testmain.go:119 +0x357
goroutine 99 [semacquire, 4 minutes]:
sync.runtime_Semacquire(0xc000207ca8)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc000207ca0)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/sync/waitgroup.go:130 +0xe5
github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).Stop(0xc000207c70)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:109 +0x125
github.com/jaegertracing/jaeger/cmd/agent/app.(*Agent).Stop(0xc000201400)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent.go:98 +0x37f
github.com/jaegertracing/jaeger/cmd/agent/app.TestStartStopRace(0xc0001b1b00)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent_test.go:184 +0x6b7
testing.tRunner(0xc0001b1b00, 0x126dee0)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1123 +0x203
created by testing.(*T).Run
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/testing/testing.go:1168 +0x5bc
goroutine 90 [chan receive, 4 minutes]:
github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer(0xc000207c70)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:116 +0x126
github.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2(0xc000207c70)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87 +0x39
created by github.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:86 +0x32f
goroutine 94 [runnable]:
github.com/jaegertracing/jaeger/cmd/agent/app/servers.NewTBufferedServer.func1(0xc0001d7b00, 0x0)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/servers/tbuffered_server.go:73 +0x76
sync.(*Pool).Get(0xc0001d7b00, 0x1, 0xfde8)
	/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/sync/pool.go:148 +0xe2
github.com/jaegertracing/jaeger/cmd/agent/app/servers.(*TBufferedServer).Serve(0xc0001a3050)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/servers/tbuffered_server.go:91 +0x8d
github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).Serve(0xc000207c70)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:96 +0x51
created by github.com/jaegertracing/jaeger/cmd/agent/app.(*Agent).Run
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/agent.go:76 +0x271
FAIL	github.com/jaegertracing/jaeger/cmd/agent/app	300.044s
@yurishkuro yurishkuro added the bug label Oct 29, 2020
@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement and removed needs-triage labels Oct 29, 2020
@yurishkuro
Copy link
Member Author

@chlunde
Copy link
Contributor

chlunde commented Oct 30, 2020

Timing wise, could this have been introduced by #2583 ? 🤔

chlunde added a commit to chlunde/jaeger that referenced this issue Oct 30, 2020
Make a tiny state machine to detect the transition init -> stopped (not via serving)

Fixes jaegertracing#2601
Related jaegertracing#103

Signed-off-by: Carl Henrik Lunde <[email protected]>
@chlunde
Copy link
Contributor

chlunde commented Oct 30, 2020

Slightly related, the tests in app does not support -test.count=100 (to verify that flaky tests are gone), due to port reuse/missing .Stop.

A workaround is to use random ports, but I had hoped stop would work for this test now:

diff --git a/cmd/agent/app/builder_test.go b/cmd/agent/app/builder_test.go
index 49c2e9d..bb279c5 100644
--- a/cmd/agent/app/builder_test.go
+++ b/cmd/agent/app/builder_test.go
@@ -288,8 +288,10 @@ func TestPublishOpts(t *testing.T) {
        command.PersistentFlags().AddGoFlagSet(flags)
        v.BindPFlags(command.PersistentFlags())
        err := command.ParseFlags([]string{
-               "--http-server.host-port=:8080",
-               "--processor.jaeger-binary.server-host-port=:1111",
+               "--http-server.host-port=:0",
+               "--processor.zipkin-compact.server-host-port=:0",
+               "--processor.jaeger-binary.server-host-port=:0",
+               "--processor.jaeger-compact.server-host-port=:0",
                "--processor.jaeger-binary.server-max-packet-size=4242",
                "--processor.jaeger-binary.server-queue-size=24",
                "--processor.jaeger-binary.workers=42",
@@ -303,6 +305,7 @@ func TestPublishOpts(t *testing.T) {
        agent, err := cfg.CreateAgent(fakeCollectorProxy{}, zap.NewNop(), metricsFactory)
        assert.NoError(t, err)
        assert.NotNil(t, agent)
+       //defer agent.Stop()
 
        forkFactory.AssertGaugeMetrics(t, metricstest.ExpectedMetric{
                Name:  "internal.processor.jaeger-binary.server-max-packet-size",

@yurishkuro
Copy link
Member Author

Timing wise, could this have been introduced by #2583 ?

Could be, but the deadlock I observed was with processors not starting in time, not the server.

Slightly related, the tests in app does not support -test.count=100 (to verify that flaky tests are gone), due to port reuse/missing .Stop.

Just that specific TestPublishOpts? I guess you could do

$ go test -test.count=100 -run TestStartStopRace  ./cmd/agent/app
panic: Log in goroutine after TestStartStopRace has completed

(still fails, but not because of port conflict)

yurishkuro pushed a commit that referenced this issue Oct 30, 2020
Make a tiny state machine to detect the transition init -> stopped (not via serving)

Fixes #2601
Related #103

Signed-off-by: Carl Henrik Lunde <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants