Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Race condition in TestSpanCollectorHTTPS #4497

Closed
yurishkuro opened this issue Jun 4, 2023 · 8 comments · Fixed by #4546
Closed

[Bug]: Race condition in TestSpanCollectorHTTPS #4497

yurishkuro opened this issue Jun 4, 2023 · 8 comments · Fixed by #4546
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Jun 4, 2023

What happened?

Unit tests failed on a data race

Steps to reproduce

unknown, this seems spurious

Expected behavior

test should succeed

Relevant log output

=== RUN   TestSpanCollectorHTTPS/should_fail_with_TLS_client_to_untrusted_TLS_server
    logger.go:130: 2023-06-04T03:17:47.537Z	INFO	Starting jaeger-collector HTTP server	{"http host-port": ":14268"}
    logger.go:130: 2023-06-04T03:17:47.558Z	ERROR	http: TLS handshake error from [::1]:41006: remote error: tls: bad certificate
==================
WARNING: DATA RACE
Read at 0x00c000196863 by goroutine 88:
  testing.(*common).logDepth()
      /opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:992 +0xc4
  testing.(*common).log()
      /opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:985 +0xa4
  testing.(*common).Logf()
      /opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1036 +0x6a
  testing.(*T).Logf()
      <autogenerated>:1 +0x75
  go.uber.org/zap/zaptest.testingWriter.Write()
      /home/runner/go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0x12c
  go.uber.org/zap/zaptest.(*testingWriter).Write()
      <autogenerated>:1 +0x7e
  go.uber.org/zap/zapcore.(*ioCore).Write()
      /home/runner/go/pkg/mod/go.uber.org/[email protected]/zapcore/core.go:99 +0x199
  go.uber.org/zap/zapcore.(*CheckedEntry).Write()
      /home/runner/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:255 +0x2ce
  go.uber.org/zap.(*Logger).Error()
      /home/runner/go/pkg/mod/go.uber.org/[email protected]/logger.go:236 +0x6d
  go.uber.org/zap.(*Logger).Error-fm()
=== RUN   TestSpanCollectorHTTPS/should_pass_with_TLS_client_with_cert_to_trusted_TLS_server_requiring_cert
    logger.go:130: 2023-06-04T03:17:47.709Z	INFO	Starting jaeger-collector HTTP server	{"http host-port": ":14268"}
=== RUN   TestSpanCollectorHTTPS/should_fail_with_TLS_client_without_cert_to_trusted_TLS_server_requiring_cert_from_a_different_CA
    logger.go:130: 2023-06-04T03:17:47.788Z	INFO	Starting jaeger-collector HTTP server	{"http host-port": ":14268"}
    logger.go:130: 2023-06-04T03:17:47.832Z	ERROR	http: TLS handshake error from [::1]:41102: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Jaeger")
    logger.go:130: 2023-06-04T03:17:47.859Z	ERROR	http: TLS handshake error from [::1]:41114: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Jaeger")
=== NAME  TestSpanCollectorHTTPS
    testing.go:1446: race detected during execution of test
--- FAIL: TestSpanCollectorHTTPS (0.32s)

Screenshot

No response

Additional context

This looks similar to another issue that happened a few months ago. The race happens inside standard testing lib related to deferred behavior. My suspicion is that we are shutting down a server that is running in a different goroutine (or even leaking it) and it's writing a log via zap/zaptest into testing.T after the test is finished. We may either be missing defer server.close() in the test or the close() actually returns before the server gorouting is stopped

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

@yurishkuro yurishkuro added bug help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners labels Jun 4, 2023
@nitishchauhan0022
Copy link

hey @yurishkuro, can the race conditions be due to the use of zaptest.logger(t) and the concurrent execution of tests from different packages, which are utilizing zaptest? I apologize if this question seems trivial, but I value your perspective greatly.

@yurishkuro
Copy link
Member Author

We don't run tests concurrently, and zaptest is well designed to work in concurrent setting even if we did. I already posted my hypothesis for race condition in the ticket - doing logging after the test is finished.

@sbdtu5498
Copy link

@yurishkuro I wasn't able to replicate it in my system. It just worked fine, though I had some other tests breaking.
Screenshot from 2023-06-06 18-33-39

@Player256
Copy link

I also ran the test and obtained the same as @sbdtu5498 with all the test in TestSpanCollectorHTTPS in race condition passing.

@yurishkuro
Copy link
Member Author

@Player256 this is a spurious issue, it's difficult to reproduce just by running tests a few times. The the stack dump in the issue has enough information to investigate.

@mvrilo
Copy link

mvrilo commented Jun 16, 2023

Adding a -count N flag to the test run seems to trigger the issue more easily:

go test -count 50 -race ./cmd/collector/app/server/... -run TestSpanCollectorHTTPS/should_fail_with_TLS_client_to_untrusted_TLS_server
==================
WARNING: DATA RACE
Read at 0x00c000660ba3 by goroutine 972:
  testing.(*common).logDepth()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:992 +0x7c
  testing.(*common).log()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:985 +0x80
  testing.(*common).Logf()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:1036 +0x58
  testing.(*T).Logf()
      <autogenerated>:1 +0x5c
  go.uber.org/zap/zaptest.testingWriter.Write()
      /Users/mvrilo/.go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0xf4
  go.uber.org/zap/zaptest.(*testingWriter).Write()
      <autogenerated>:1 +0x64
  go.uber.org/zap/zapcore.(*ioCore).Write()
      /Users/mvrilo/.go/pkg/mod/go.uber.org/[email protected]/zapcore/core.go:99 +0x110
  go.uber.org/zap/zapcore.(*CheckedEntry).Write()
      /Users/mvrilo/.go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:255 +0x1e8
  go.uber.org/zap.(*Logger).Error()
      /Users/mvrilo/.go/pkg/mod/go.uber.org/[email protected]/logger.go:236 +0x5c
  go.uber.org/zap.(*Logger).Error-fm()
      <autogenerated>:1 +0x5c
  go.uber.org/zap.(*loggerWriter).Write()
      /Users/mvrilo/.go/pkg/mod/go.uber.org/[email protected]/global.go:167 +0x90
  log.(*Logger).Output()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/log/log.go:194 +0x344
  log.(*Logger).Printf()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/log/log.go:204 +0x84
  net/http.(*Server).logf()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/net/http/server.go:3215 +0x7c
  net/http.(*conn).serve()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/net/http/server.go:1886 +0x12b4
  net/http.(*Server).Serve.func3()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/net/http/server.go:3089 +0x4c

Previous write at 0x00c000660ba3 by goroutine 946:
  testing.tRunner.func1()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:1563 +0x570
  runtime.deferreturn()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/runtime/panic.go:476 +0x30
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:1629 +0x40

Goroutine 972 (running) created at:
  net/http.(*Server).Serve()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/net/http/server.go:3089 +0x620
  net/http.(*Server).ServeTLS()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/net/http/server.go:3129 +0x534
  github.com/jaegertracing/jaeger/cmd/collector/app/server.serveHTTP.func1()
      /Users/mvrilo/Development/jaeger/cmd/collector/app/server/http.go:105 +0x7c

Goroutine 946 (running) created at:
  testing.(*T).Run()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:1629 +0x5e4
  github.com/jaegertracing/jaeger/cmd/collector/app/server.TestSpanCollectorHTTPS()
      /Users/mvrilo/Development/jaeger/cmd/collector/app/server/http_test.go:204 +0x628
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:1576 +0x188
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/testing.go:1629 +0x40
==================
--- FAIL: TestSpanCollectorHTTPS (0.05s)
    logger.go:130: 2023-06-16T11:02:35.753-0300 ERROR   http: TLS handshake error from [::1]:61636: read tcp [::1]:14268->[::1]:61636: use of closed network connection
    testing.go:1446: race detected during execution of test
FAIL
FAIL    github.com/jaegertracing/jaeger/cmd/collector/app/server        3.350s
FAIL

@michaelnadar
Copy link

hi i'm new here can anyone help me start contributing to this project I'm good at backend .

yurishkuro added a commit that referenced this issue Jun 22, 2023
## Which problem is this PR solving?
- Resolves #4497

## Short description of the changes
- Do not use `zaptest.NewLogger(t)` because it causes race condition
shown in the ticket when the server goroutine tries to log something
that is being forwarded to `testing.T` while the test is being shutdown
due to panic.
- I was not able to get to the root cause why this happens, since the
test is properly shutting down the server. This may indicate an issue in
testing itself in how it handles panic.

Signed-off-by: Yuri Shkuro <[email protected]>
@yurishkuro
Copy link
Member Author

@mvrilo thanks for analysis. I put a bandaid fix #4546 that will prevent the issue going forward, but did not get to the root cause. Something odd is happening in testing when it handles panic.

kjschnei001 pushed a commit to kjschnei001/jaeger that referenced this issue Jun 29, 2023
…g#4546)

## Which problem is this PR solving?
- Resolves jaegertracing#4497

## Short description of the changes
- Do not use `zaptest.NewLogger(t)` because it causes race condition
shown in the ticket when the server goroutine tries to log something
that is being forwarded to `testing.T` while the test is being shutdown
due to panic.
- I was not able to get to the root cause why this happens, since the
test is properly shutting down the server. This may indicate an issue in
testing itself in how it handles panic.

Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: KevinSchneider <[email protected]>
afzal442 pushed a commit to Cloud-Hacks/jaeger that referenced this issue Jul 10, 2023
…g#4546)

## Which problem is this PR solving?
- Resolves jaegertracing#4497

## Short description of the changes
- Do not use `zaptest.NewLogger(t)` because it causes race condition
shown in the ticket when the server goroutine tries to log something
that is being forwarded to `testing.T` while the test is being shutdown
due to panic.
- I was not able to get to the root cause why this happens, since the
test is properly shutting down the server. This may indicate an issue in
testing itself in how it handles panic.

Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Afzal Ansari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants