modelindexer: fix flush-on-close timing issues#7352
Merged
Conversation
119d4f4 to
cac05bf
Compare
Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests.
If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths.
cac05bf to
dc3ae44
Compare
💔 Build Failed
Expand to view the summary
Build stats
Steps errors
Expand to view the steps failures
|
Member
Author
|
Member
Author
|
/test |
marclop
reviewed
Feb 21, 2022
Contributor
|
This pull request is now in conflicts. Could you fix it @axw? 🙏 |
mergify Bot
pushed a commit
that referenced
this pull request
Feb 21, 2022
* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2)
mergify Bot
pushed a commit
that referenced
this pull request
Feb 21, 2022
* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc
mergify Bot
pushed a commit
that referenced
this pull request
Feb 21, 2022
* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc # model/modelindexer/indexer_test.go
axw
added a commit
that referenced
this pull request
Feb 22, 2022
…#7363) * modelindexer: fix flush-on-close timing issues (#7352) * model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc * Delete head.asciidoc Co-authored-by: Andrew Wilkins <axw@elastic.co>
axw
added a commit
that referenced
this pull request
Feb 22, 2022
* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) Co-authored-by: Andrew Wilkins <axw@elastic.co>
axw
added a commit
that referenced
this pull request
Feb 22, 2022
#7364) * modelindexer: fix flush-on-close timing issues (#7352) * model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc # model/modelindexer/indexer_test.go * Fix merge conflicts Co-authored-by: Andrew Wilkins <axw@elastic.co>
marclop
added a commit
to marclop/apm-server
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before elastic#7352 was introduced. These are the results: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
1 task
marclop
added a commit
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
mergify Bot
pushed a commit
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097)
mergify Bot
pushed a commit
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) # Conflicts: # changelogs/head.asciidoc
mergify Bot
pushed a commit
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) # Conflicts: # changelogs/head.asciidoc
marclop
added a commit
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>
marclop
added a commit
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) # Conflicts: # changelogs/head.asciidoc Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>
marclop
added a commit
that referenced
this pull request
Mar 24, 2022
This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) # Conflicts: # changelogs/head.asciidoc Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation/summary
If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed.
We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths.
This should fix the
TestTransactionAggregationShutdownfailures we occasionally see in CI.Checklist
- [ ] Update package changelog.yml (only if changes toapmpackagehave been made)- [ ] Documentation has been updatedHow to test these changes
This is not likely to happen in a real setup. Run
TestTransactionAggregationShutdowna bunch of times?Related issues
None