Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go #103

caldempsey · 2019-12-03T20:25:49Z

I've seen the panic "sync: WaitGroup is reused before the previous Wait has returned" in a routine at github.com/getsentry/sentry-go/transport.go:201. Not sure how to debug this or how to explore the conditions. I can't find anything in the documentation that can help me explore the conditions post-panic. The stack trace leads me to...

vendor/github.com/getsentry/sentry-go/transport.go:201 +0x31 
github.com/getsentry/sentry-go.(*HTTPTransport).Flush 
vendor/github.com/getsentry/sentry-go/transport.go:200 +0x7c

Any advice is appreciated on tools I can use to better break this open when it does occur!

The text was updated successfully, but these errors were encountered:

rhcarvalho · 2019-12-03T21:55:16Z

Hi @mmacheerpuppy! Thanks for reporting the issue!

As a quick reference, this is the code reported to be faulty:

sentry-go/transport.go

Lines 199 to 217 in 54d5222

    
           // Flush notifies when all the buffered events have been sent by returning `true` 
        
           // or `false` if timeout was reached. 
        
           func (t *HTTPTransport) Flush(timeout time.Duration) bool { 
        
           	c := make(chan struct{}) 
        
           	go func() { 
        
           		t.wg.Wait() 
        
           		close(c) 
        
           	}() 
        
           	select { 
        
           	case <-c: 
        
           		Logger.Println("Buffer flushed successfully.") 
        
           		return true 
        
           	case <-time.After(timeout): 
        
           		Logger.Println("Buffer flushing reached the timeout.") 
        
           		return false 
        
           	} 
        
           }

Indeed that usage of sync.WaitGroup is not conforming to a well behaved pattern: typically there is one goroutine that owns the WaitGroup and where the Wait method is called.

@mmacheerpuppy I imagine your code might be calling Flush concurrently from multiple goroutines and that reveals the problem. What can be done on your side is having a look to see if Flush is being called from multiple places concurrently. That is certainly unnecessary. In general, you'd only need to call sentry.Flush from the main goroutine before your program terminates in order to wait for whatever in-flight events to be sent.

It's a bit late here in my time zone, so I'll defer writing a test to expose the problem and a fix later. Thanks again for reporting the issue!

caldempsey · 2019-12-03T22:09:50Z

@rhcarvalho Oh thanks, that explains a lot, I have used Flush liberally as a de-facto method to ensure all errors have been sent from the application (I admit I kind of took it as an await statement originally). I suspected something was a bit fishy with the usage of sync.WaitGroup but honestly I'm a little new to it myself (coming from a Scala/Elixir background) so not aware of all the conditions that actually throw a panic. Appreciate your late response and I really appreciate your time, I'm also in EU time.

Thank you so much ♥

BonnieMilian · 2019-12-03T22:59:00Z

Thank you @rhcarvalho for the explanation, I had the same issue, was doing flush per each report.
I will try what you suggest thanks :D

panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 52062 [running]:
sync.(*WaitGroup).Wait(0xc00006a458)
	/usr/local/go/src/sync/waitgroup.go:132 +0xae
github.com/getsentry/sentry-go.(*HTTPTransport).Flush.func1(0xc00006a420, 0xc0022013e0)
	/go/pkg/mod/github.com/getsentry/[email protected]/transport.go:201 +0x31
created by github.com/getsentry/sentry-go.(*HTTPTransport).Flush
	/go/pkg/mod/github.com/getsentry/[email protected]/transport.go:200 +0x7c

rhcarvalho · 2020-01-14T21:14:04Z

This test can be used to reproduce the issue:

type testWriter testing.T

func (t *testWriter) Write(p []byte) (int, error) {
	t.Logf("%s", p)
	return len(p), nil
}

func TestHTTPTransportFlush(t *testing.T) {
	var counter uint64
	ts := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		t.Logf("[SERVER] received event: #%d", atomic.AddUint64(&counter, 1))
	}))
	defer ts.Close()

	Logger.SetOutput((*testWriter)(t))

	tr := NewHTTPTransport()
	tr.Configure(ClientOptions{
		Dsn:        fmt.Sprintf("https://user@%s/42", ts.Listener.Addr()),
		HTTPClient: ts.Client(),
	})

	var wg sync.WaitGroup
	for i := 0; i < 2; i++ {
		i := i
		wg.Add(1)
		go func() {
			defer wg.Done()
			for j := 0; j < 2; j++ {
				t.Logf("tr.SendEvent #%d from goroutine #%d", j, i)
				tr.SendEvent(NewEvent())
				ok := tr.Flush(100 * time.Millisecond)
				if !ok {
					t.Error("Flush() timed out")
				}
			}
		}()
	}
	wg.Wait()
}

Example run:

$ go test -failfast -run TestHTTPTransportFlush -count=100 -v
=== RUN   TestHTTPTransportFlush
--- PASS: TestHTTPTransportFlush (0.00s)
    transport_test.go:189: tr.SendEvent #0 from goroutine #1
    transport_test.go:189: tr.SendEvent #0 from goroutine #0
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:170: [SERVER] received event: #1
    transport_test.go:170: [SERVER] received event: #2
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
    transport_test.go:189: tr.SendEvent #1 from goroutine #0
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
    transport_test.go:189: tr.SendEvent #1 from goroutine #1
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Sending  event [] to 127.0.0.1 project: 42
    transport_test.go:170: [SERVER] received event: #3
    transport_test.go:170: [SERVER] received event: #4
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
    transport_test.go:163: [Sentry] 2020/01/14 22:07:51 Buffer flushed successfully.
=== RUN   TestHTTPTransportFlush
panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 15 [running]:
sync.(*WaitGroup).Wait(0xc0000204b8)
        /usr/lib/golang/src/sync/waitgroup.go:132 +0xad
github.com/getsentry/sentry-go.(*HTTPTransport).Flush.func1(0xc000020480, 0xc00002c8a0)
        .../sentry-go/transport.go:205 +0x31
created by github.com/getsentry/sentry-go.(*HTTPTransport).Flush
        .../sentry-go/transport.go:204 +0x7c
exit status 2
FAIL    github.com/getsentry/sentry-go  0.012s

Without instrumenting HTTPTransport.Flush to enforce an specific order of events, the test is flaky by nature. What's more, the panic is caused by an implementation detail. Since I'd rather be testing behavior, and I certainly do not want flaky tests, I am inclined not to add such test together with the fix.

Posting it here for transparency and future reference.

rhcarvalho · 2020-01-15T22:17:44Z

Running the test above with -race also reviews a data race involving calls to wg.Add and wg.Wait:

==================
WARNING: DATA RACE
Read at 0x00c00038c2c8 by goroutine 62:
  sync.(*WaitGroup).Add()
      /usr/lib/golang/src/internal/race/race.go:37 +0x18d
  github.com/getsentry/sentry-go.(*HTTPTransport).SendEvent()
      .../sentry-go/transport.go:184 +0x554
  github.com/getsentry/sentry-go.TestHTTPTransportFlush.func2()
      .../sentry-go/transport_test.go:190 +0x26c

Previous write at 0x00c00038c2c8 by goroutine 63:
  sync.(*WaitGroup).Wait()
      /usr/lib/golang/src/internal/race/race.go:41 +0xee
  github.com/getsentry/sentry-go.(*HTTPTransport).Flush.func1()
      .../sentry-go/transport.go:208 +0x3e

Goroutine 62 (running) created at:
  github.com/getsentry/sentry-go.TestHTTPTransportFlush()
      .../sentry-go/transport_test.go:186 +0x40e
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:909 +0x199

Goroutine 63 (finished) created at:
  github.com/getsentry/sentry-go.(*HTTPTransport).Flush()
      .../sentry-go/transport.go:206 +0x8c
  github.com/getsentry/sentry-go.TestHTTPTransportFlush.func2()
      .../sentry-go/transport_test.go:191 +0x286
==================

rhcarvalho · 2020-01-15T22:23:50Z

A small fix to prevent the panic "sync: WaitGroup is reused before previous Wait has returned" is to synchronize calls to wg.Wait() with a semaphore:

diff --git a/transport.go b/transport.go
index d0f39d9..90e7ec3 100644
--- a/transport.go
+++ b/transport.go
@@ -102,10 +102,12 @@ type HTTPTransport struct {
 	client    *http.Client
 	transport *http.Transport
 
-	buffer        chan *http.Request
+	buffer         chan *http.Request
+	wg             sync.WaitGroup // counter of buffered requests
+	flushSemaphore chan struct{}  // limit concurrent calls to Flush
+
 	disabledUntil time.Time
 
-	wg    sync.WaitGroup
 	start sync.Once
 
 	// Size of the transport buffer. Defaults to 30.
@@ -130,9 +132,10 @@ func (t *HTTPTransport) Configure(options ClientOptions) {
 		Logger.Printf("%v\n", err)
 		return
 	}
-
 	t.dsn = dsn
+
 	t.buffer = make(chan *http.Request, t.BufferSize)
+	t.flushSemaphore = make(chan struct{}, 1)
 
 	if options.HTTPTransport != nil {
 		t.transport = options.HTTPTransport
@@ -202,8 +205,10 @@ func (t *HTTPTransport) Flush(timeout time.Duration) bool {
 	c := make(chan struct{})
 
 	go func() {
+		t.flushSemaphore <- struct{}{}
 		t.wg.Wait()
 		close(c)
+		<-t.flushSemaphore
 	}()
 
 	select {

However, that doesn't fix the race between wg.Add and wg.Wait.

This type of race was discussed here: https://groups.google.com/d/msg/golang-nuts/W5fol0e4qt8/_XEPBGMQkNwJ.

[...] The issue is that there is no
happens-before relation between Waitgroup.Add and Waitgroup.Wait, so
your code does an Add after a Wait, the race detector will tell you.
You need to establish a synchronization to guarantee the Add does not
pass the Wait.
-rob

* test: HTTPTransport The ConcurrentSendAndFlush test reveals a data race in the old HTTPTransport implementation. * fix: HTTPTransport.Flush panic and data race Rewrite HTTPTransport internals to remove a data race and occasional panics. Prior to this, HTTPTransport used a `sync.WaitGroup` to track how many in-flight requests existed, and `Flush` waited until the observed number of in-flight requests reached zero. Unsynchronized access to the WaitGroup lead to panics (reuse before wg.Wait returns) and data races (undefined order of wg.Add and wg.Wait calls). The new implementation changes the `Flush` behavior to wait until the current in-flight requests are processed, inline with other SDKs and the Unified API. Fixes #103.

rhcarvalho · 2020-02-04T23:39:10Z

If anyone runs into this, please update to v0.5.0 or newer (go get -u github.com/getsentry/sentry-go), and please have a look at the README.md or example/basic/main.go for an example of an idiomatic way of using sentry.Flush.

caldempsey changed the title ~~sync: WaitGroup is reused before previous Wait has returned~~ Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go Dec 3, 2019

rhcarvalho added the Type: Bug label Dec 3, 2019

rhcarvalho self-assigned this Dec 3, 2019

rhcarvalho mentioned this issue Dec 3, 2019

Improve documentation of sentry.Flush #106

Closed

3 tasks

rhcarvalho pinned this issue Dec 17, 2019

rhcarvalho added this to the 0.5.0 milestone Dec 28, 2019

rhcarvalho mentioned this issue Jan 21, 2020

fix: HTTPTransport.Flush panic and data race #140

Merged

rhcarvalho closed this as completed in #140 Jan 22, 2020

rhcarvalho unpinned this issue Feb 4, 2020

stephanie-anderson removed the Type: Bug label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go #103

Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go #103

caldempsey commented Dec 3, 2019 •

edited

Loading

rhcarvalho commented Dec 3, 2019

caldempsey commented Dec 3, 2019 •

edited

Loading

BonnieMilian commented Dec 3, 2019

rhcarvalho commented Jan 14, 2020

rhcarvalho commented Jan 15, 2020

rhcarvalho commented Jan 15, 2020

rhcarvalho commented Feb 4, 2020

Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go #103

Intermittently seeing "sync: WaitGroup is reused before previous Wait has returned" in transport.go #103

Comments

caldempsey commented Dec 3, 2019 • edited Loading

rhcarvalho commented Dec 3, 2019

caldempsey commented Dec 3, 2019 • edited Loading

BonnieMilian commented Dec 3, 2019

rhcarvalho commented Jan 14, 2020

rhcarvalho commented Jan 15, 2020

rhcarvalho commented Jan 15, 2020

rhcarvalho commented Feb 4, 2020

caldempsey commented Dec 3, 2019 •

edited

Loading

caldempsey commented Dec 3, 2019 •

edited

Loading