Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable parts of batch_span_processor test as flakes #743

Merged
merged 12 commits into from
May 19, 2020

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented May 19, 2020

This:

  1. enqueue() uses the a defer/recover to avoid panicking on a closed queue
  2. processQueue() is factored apart from drainQueue() and exportSpans() (avoids named loop)
  3. The test now does not validate the outcome, only that the code completes. TODO: fix.

Part of #741.

@jmacd
Copy link
Contributor Author

jmacd commented May 19, 2020

@vmihailenco It turns out we didn't fix the flakes last time. I did two things here, one was to eliminate the leaked goroutine to throw-away spans (using the wait group), two was to wait for the drain to finish to ensure the test doesn't race with draining the queue.

@vmihailenco
Copy link
Contributor

vmihailenco commented May 19, 2020

I believe it has the same problem you've indicated previously - bsp.stopWait.Wait() in Shutdown and bsp.stopWait.Add(1) in enqueue are executed concurrently and therefore can panic. E,g, in this code

		close(bsp.stopCh)
		// Here processQueue can receive on stopCh and execute bsp.stopWait.Done()
		// Then bsp.stopWait can be 0 at this point and Wait can panic.
		bsp.stopWait.Wait()

What is worse

  • there is no reason for WaitGroup to panic in these situations - it just tries to ensure that people write "correct" code (I believe there is nothing wrong with how WaitGroup is used here)
  • there is also no reason for channel to panic on send when it is closed - again Go tries to make sure people write "good" code

As a result we spend hours trying to write "good" code and still have this monster. And I believe it is still not 100% correct. TBH I would just add defer + recover and move forward. It is correct, simple and fast as long as there are no panics.

@jmacd
Copy link
Contributor Author

jmacd commented May 19, 2020

The defer/recover addresses the safety issue, I think, but the test is forcing this "monster" as you call it. I'm ready to disable this test.

@jmacd jmacd closed this May 19, 2020
@jmacd jmacd changed the title Fix flaky batch_span_processor test Disable parts of batch_span_processor test as flakes May 19, 2020
@jmacd jmacd reopened this May 19, 2020
@jmacd
Copy link
Contributor Author

jmacd commented May 19, 2020

@vmihailenco Thanks for your help. I think this is an improvement and that we can postpone a proper test.

@jmacd jmacd merged commit 055e9c5 into open-telemetry:master May 19, 2020
@jmacd jmacd deleted the jmacd/batch_flake branch May 19, 2020 16:41
@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants