Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race in v0.48.0 #4895

Closed
ash2k opened this issue Feb 7, 2024 · 14 comments · Fixed by #5080
Closed

Data race in v0.48.0 #4895

ash2k opened this issue Feb 7, 2024 · 14 comments · Fixed by #5080
Labels
area: instrumentation Related to an instrumentation package bug Something isn't working instrumentation: otelhttp

Comments

@ash2k
Copy link
Contributor

ash2k commented Feb 7, 2024

Description

WARNING: DATA RACE
Write at 0x00c000544258 by goroutine 603:
  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*bodyWrapper).Read()
      external/io_opentelemetry_go_contrib_instrumentation_net_http_otelhttp/wrap.go:40 +0xa4
  io.discard.ReadFrom()
      GOROOT/src/io/io.go:658 +0x91
  io.(*discard).ReadFrom()
      <autogenerated>:1 +0x4c
  io.copyBuffer()
      GOROOT/src/io/io.go:416 +0x1be
  io.Copy()
      GOROOT/src/io/io.go:389 +0x5c
  net/http.(*transferWriter).doBodyCopy()
      GOROOT/src/net/http/transfer.go:412 +0x38
  net/http.(*transferWriter).writeBody()
      GOROOT/src/net/http/transfer.go:375 +0x82f
  net/http.(*Request).write()
      GOROOT/src/net/http/request.go:738 +0x1393
  net/http.(*persistConn).writeLoop()
      GOROOT/src/net/http/transport.go:2424 +0x372
  net/http.(*Transport).dialConn.func6()
      GOROOT/src/net/http/transport.go:1777 +0x33
Previous read at 0x00c000544258 by goroutine 598:
  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Transport).RoundTrip()
      external/io_opentelemetry_go_contrib_instrumentation_net_http_otelhttp/transport.go:185 +0xec8
  net/http.send()
      GOROOT/src/net/http/client.go:260 +0x9c6
  net/http.(*Client).send()
      GOROOT/src/net/http/client.go:181 +0x14e
  net/http.(*Client).do()
      GOROOT/src/net/http/client.go:724 +0x11a8
  net/http.(*Client).Do()
      GOROOT/src/net/http/client.go:590 +0xa65
  github.com/hashicorp/go-retryablehttp.(*Client).Do()
      external/com_github_hashicorp_go_retryablehttp/client.go:653 +0xa11
...
Goroutine 603 (running) created at:
  net/http.(*Transport).dialConn()
      GOROOT/src/net/http/transport.go:1777 +0x266a
  net/http.(*Transport).dialConnFor()
      GOROOT/src/net/http/transport.go:1467 +0x129
  net/http.(*Transport).queueForDial.func3()
      GOROOT/src/net/http/transport.go:1448 +0x44
Goroutine 598 (running) created at:
  net/http.(*Server).Serve()
      GOROOT/src/net/http/server.go:3086 +0x86c
...

Looks like roundtripper can return before finishing sending the request. Perhaps when server has responded faster than the request was fully sent? I think in HTTP/2 mode stdlib HTTP client handles send/receive independently and in HTTP/1 it only starts reading the response once the request has been fully sent.

https://pkg.go.dev/net/http#RoundTripper says:

Callers should not mutate or reuse the request until the Response's Body has been closed.

Which is the case in the new code - it reads a field from the request body wrapper without waiting for the Close to be called.

Environment

  • OS: N/A
  • Architecture: N/A
  • Go Version: 1.21.7.
  • otelhttp version: v0.48.0

Steps To Reproduce

Run TestProxy_UserAccessHappyPath with -race. I don't have a small reproducer, sorry.

Expected behavior

No data race.

@ash2k ash2k added area: instrumentation Related to an instrumentation package bug Something isn't working instrumentation: otelhttp labels Feb 7, 2024
@stan-stately
Copy link

stan-stately commented Feb 7, 2024

+1 i've got it too in 0.48.0

@dmathieu
Copy link
Member

dmathieu commented Feb 7, 2024

There does seem to be a race issue with the counter of bytes read from the body that we increment and read with no regards for parallelism.

I'm looking into this and will open a PR soon.

@dmathieu
Copy link
Member

dmathieu commented Feb 7, 2024

So, no PR yet, as I can't create a reproduction test.
I also cloned the repository mentioned above, and couldn't reproduce the mentioned failure.

server/› 🔥 git:(master) 2M git diff
diff --git a/go.mod b/go.mod
index e8184ecc..996f5b22 100644
--- a/go.mod
+++ b/go.mod
@@ -25,7 +25,7 @@ require (
        github.com/stretchr/testify v1.8.4
        gitlab.com/gitlab-org/security-products/analyzers/report/v3 v3.7.1
        go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.48.0
-       go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.47.0
+       go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.48.0
        go.opentelemetry.io/otel v1.23.0
        go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.23.0
        go.opentelemetry.io/otel/exporters/prometheus v0.45.1
diff --git a/go.sum b/go.sum
index 8bfd3acf..88eb7265 100644
--- a/go.sum
+++ b/go.sum
@@ -340,8 +340,8 @@ go.opencensus.io v0.24.0 h1:y73uSU6J157QMP2kn2r30vwW1A2W2WFwSCGnAVxeaD0=
 go.opencensus.io v0.24.0/go.mod h1:vNK8G9p7aAivkbmorf4v+7Hgx+Zs0yY+0fOtgBfjQKo=
 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.48.0 h1:P+/g8GpuJGYbOp2tAdKrIPUX9JO02q8Q0YNlHolpibA=
 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.48.0/go.mod h1:tIKj3DbO8N9Y2xo52og3irLsPI4GW02DSMtrVgNMgxg=
-go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.47.0 h1:sv9kVfal0MK0wBMCOGr+HeJm9v803BkJxGrk2au7j08=
-go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.47.0/go.mod h1:SK2UL73Zy1quvRPonmOmRDiWk1KBV3LyIeeIxcEApWw=
+go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.48.0 h1:doUP+ExOpH3spVTLS0FcWGLnQrPct/hD/bCPbDRUEAU=
+go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.48.0/go.mod h1:rdENBZMT2OE6Ne/KLwpiXudnAsbdrdBaqBvTN8M8BgA=
 go.opentelemetry.io/otel v1.23.0 h1:Df0pqjqExIywbMCMTxkAwzjLZtRf+bBKLbUcpxO2C9E=
 go.opentelemetry.io/otel v1.23.0/go.mod h1:YCycw9ZeKhcJFrb34iVSkyT0iczq/zYDtZYFufObyB0=
 go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.23.0 h1:D/cXD+03/UOphyyT87NX6h+DlU+BnplN6/P6KJwsgGc=
server/› 🔥 git:(master) 2M go test -v -race ./... -run TestProxy_UserAccessHappyPath
=== RUN   TestProxy_UserAccessHappyPath
=== RUN   TestProxy_UserAccessHappyPath/auth_in_header-impersonate_agent
=== RUN   TestProxy_UserAccessHappyPath/auth_in_header-impersonate_user
=== RUN   TestProxy_UserAccessHappyPath/auth_in_params-impersonate_agent
=== RUN   TestProxy_UserAccessHappyPath/auth_in_params-impersonate_user
--- PASS: TestProxy_UserAccessHappyPath (0.03s)
    --- PASS: TestProxy_UserAccessHappyPath/auth_in_header-impersonate_agent (0.01s)
    --- PASS: TestProxy_UserAccessHappyPath/auth_in_header-impersonate_user (0.01s)
    --- PASS: TestProxy_UserAccessHappyPath/auth_in_params-impersonate_agent (0.01s)
    --- PASS: TestProxy_UserAccessHappyPath/auth_in_params-impersonate_user (0.01s)
PASS
ok      gitlab.com/gitlab-org/cluster-integration/gitlab-agent/v16/internal/module/kubernetes_api/server        (cached)

It seems the race is indeed on accessing read in bodyWrapper, and replacing the int64 with an atomic.Int64 should resolve the race.
I'll come back to this. In the mean time, if someone has a better reproduction scenario, feel free to link it.

@ash2k
Copy link
Contributor Author

ash2k commented Feb 7, 2024

Using atomic.Int64 will fix the data race but the logical race will still be there i.e. the number of read bytes will be incorrect because the reader goroutine might have not finished reading. My understanding is the value can only be read after Close() has been called on the body. And in that case a plain int64 should be fine.

@ash2k
Copy link
Contributor Author

ash2k commented Feb 7, 2024

Hm, I also cannot reproduce it. It happened only once in CI. Running it with -test.count=1000 to run it 1000 times doesn't help, hm.

@crobert-1
Copy link
Member

@crobert-1
Copy link
Member

@crobert-1
Copy link
Member

crobert-1 commented Feb 7, 2024

The OpenTelemetry Collector contrib-distro upgraded to v0.48.0 16 hours ago, and this data race has shown up in tests twice (that I've found).

@rnishtala-sumo
Copy link

@drshriveer
Copy link

Advanced apologies for any perceived cynicism, however I think it's wroth calling out: This sort of issue really degrades confidence in otel in general. There really isn't a defense for not having caught this in Otel's own tests -- it was certainly caught others' tests. I hope part of the solution here involves an improved release pipeline with tests that run with the -race flag enabled.

To elaborate a bit more, regardless of the race itself causing any actual production issues, this sort of thing does cause developer pain and delays for repositories which do adhere to strict testing standards. This is especially true in light of go modules' tendency to update packages on go mod tidy without permission. Which is, in short, precisely what happened to me. I'm not sure we will choose to continue keeping a dependency which does not hold itself to a similarly high bar with regards to its testing and release cycles.

@dmathieu
Copy link
Member

dmathieu commented Feb 8, 2024

hope part of the solution here involves an improved release pipeline with tests that run with the -race flag enabled.

Tests do run with the -race flag enabled, and we do adhere to strong testing principles. But tests, as extensive as they can be, are not a sure way to prevent any bugs.

This is indeed a regrettable regression, but it wasn't caught by any other tests. You are more than welcome to help us review further PRs so this doesn't happen again.
We also don't have a fix yet. So you're more than welcome to work on one as well.

@corani
Copy link

corani commented Feb 20, 2024

Any chance of a v0.48.1 release with the fix? I'm seeing this issue regularly in our automation tests, so have had to t.Skip() the corresponding tests for the moment.

@MrAlias
Copy link
Contributor

MrAlias commented Feb 20, 2024

Any chance of a v0.48.1 release with the fix? I'm seeing this issue regularly in our automation tests, so have had to t.Skip() the corresponding tests for the moment.

Likely by the end of this week or beginning of next.

rfay added a commit to deviantintegral/ddev that referenced this issue Mar 15, 2024
rfay added a commit to deviantintegral/ddev that referenced this issue Mar 16, 2024
mdelapenya added a commit to testcontainers/testcontainers-go that referenced this issue Apr 4, 2024
Fix a data race on container run by updating otelhttp:
open-telemetry/opentelemetry-go-contrib#4895

Co-authored-by: Manuel de la Peña <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: instrumentation Related to an instrumentation package bug Something isn't working instrumentation: otelhttp
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants