modelindexer: fix flush-on-close timing issues by axw · Pull Request #7352 · elastic/apm-server

axw · 2022-02-21T06:09:56Z

Motivation/summary

If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed.

We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths.

This should fix the TestTransactionAggregationShutdown failures we occasionally see in CI.

Checklist

Update CHANGELOG.asciidoc
~~- [ ] Update package changelog.yml (only if changes to apmpackage have been made)~~
~~- [ ] Documentation has been updated~~

How to test these changes

This is not likely to happen in a real setup. Run TestTransactionAggregationShutdown a bunch of times?

Related issues

None

Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests.

If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths.

ghost · 2022-02-21T06:18:36Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-02-21T12:48:05.625+0000
Duration: 66 min 30 sec

Test stats 🧪

Test	Results
Failed	0
Passed	5652
Skipped	19
Total	5671

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/hey-apm : Run the hey-apm benchmark.
/package : Generate and publish the docker images.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

ghost · 2022-02-21T06:18:38Z

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-02-21T06:10:10.828+0000
Duration: 8 min 4 sec

Steps errors

Expand to view the steps failures

`Check out from version control`

Took 0 min 15 sec . View more details here
Description: [2022-02-21T06:16:52.959Z] The recommended git tool is: git [2022-02-21T06:17:00.370Z] using creden

`Check out from version control`

Took 0 min 7 sec . View more details here
Description: [2022-02-21T06:17:28.114Z] The recommended git tool is: git [2022-02-21T06:17:28.128Z] using creden

`Check out from version control`

Took 0 min 7 sec . View more details here
Description: [2022-02-21T06:18:05.050Z] The recommended git tool is: git [2022-02-21T06:18:05.062Z] using creden

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

axw · 2022-02-21T06:45:12Z

[2022-02-21T06:41:43.797Z] Get "https://docker.elastic.co/v2/": dial tcp 34.68.230.202:443: connect: connection refused

axw · 2022-02-21T06:45:15Z

/test

mergify · 2022-02-21T12:28:49Z

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b modelindexer-close-flush upstream/modelindexer-close-flush
git merge upstream/main
git push upstream modelindexer-close-flush

marclop

Looks great!

* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2)

* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc

* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc # model/modelindexer/indexer_test.go

…#7363) * modelindexer: fix flush-on-close timing issues (#7352) * model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc * Delete head.asciidoc Co-authored-by: Andrew Wilkins <axw@elastic.co>

* model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) Co-authored-by: Andrew Wilkins <axw@elastic.co>

#7364) * modelindexer: fix flush-on-close timing issues (#7352) * model/modelindexer: speed up tests Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests. * model/modelindexer: fix flush-on-close If the background timer goroutine was started, but had not yet added to the errgroup, then Close could return before events enqueued by ProcessEvents were flushed. We fix this by ensuring the errgroup.Go call is made before ProcessEvents returned. If the timer is stopped (e.g. because Close is called), then we signal to the timer goroutine to flush immediately, instead of calling flushActive(Locked) from multiple code paths. * model/modelindexer: fix goroutine leak (cherry picked from commit 2161ab2) # Conflicts: # changelogs/head.asciidoc # model/modelindexer/indexer_test.go * Fix merge conflicts Co-authored-by: Andrew Wilkins <axw@elastic.co>

This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before elastic#7352 was introduced. These are the results: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097)

This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) # Conflicts: # changelogs/head.asciidoc

This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>

This patch unlocks the `i.activeMu` mutex in flushActive as soon as the `i.active` reference has been set to `nil`. This severely minimizes the lock contention and achives similar or higher indexing throughput when comparing the benchmarks before #7352 was introduced. The microbenchmark results seem to indicate that we're back to the previous indexing performance with this change: ```console $ go test -bench ... # Current main goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 4653790 2527 ns/op BenchmarkModelIndexer/BestSpeed-8 2909288 4051 ns/op BenchmarkModelIndexer/DefaultCompression-8 1691677 6674 ns/op BenchmarkModelIndexer/BestCompression-8 1234953 8334 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 70.585s ``` ```console $ go test -bench ... # This patch goos: darwin goarch: arm64 pkg: github.com/elastic/apm-server/model/modelindexer BenchmarkModelIndexer/NoCompression-8 8702388 1344 ns/op BenchmarkModelIndexer/BestSpeed-8 5097385 2238 ns/op BenchmarkModelIndexer/DefaultCompression-8 2639126 4821 ns/op BenchmarkModelIndexer/BestCompression-8 1586126 7350 ns/op PASS ok github.com/elastic/apm-server/model/modelindexer 64.933s ``` The contention is much worse when the APM Server is actually running and indexing against an Elasticsearch cluster since the lock is held for the entire `bulkIndexer.Flush` operation, which includes network latency. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit b951097) # Conflicts: # changelogs/head.asciidoc Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>

axw added bug v8.0.0 v8.1.0 backport-8.0 Automated backport with mergify v8.2.0 backport-8.1 Automated backport with mergify v7.17.0 backport-7.17 Automated backport with mergify to the 7.17 branch labels Feb 21, 2022

axw force-pushed the modelindexer-close-flush branch from 119d4f4 to cac05bf Compare February 21, 2022 06:11

axw added 2 commits February 21, 2022 14:11

model/modelindexer: speed up tests

d1923cc

Some tests may intentionally cause an error, which will lead to Elasticsearch client backoff. Set a fast backoff to not slow the tests.

axw force-pushed the modelindexer-close-flush branch from cac05bf to dc3ae44 Compare February 21, 2022 06:11

axw marked this pull request as ready for review February 21, 2022 06:24

axw requested a review from a team February 21, 2022 06:24

marclop reviewed Feb 21, 2022

View reviewed changes

Comment thread model/modelindexer/indexer.go

model/modelindexer: fix goroutine leak

fe91243

Merge branch 'main' into modelindexer-close-flush

8707f59

axw requested a review from marclop February 21, 2022 12:47

marclop approved these changes Feb 21, 2022

View reviewed changes

axw enabled auto-merge (squash) February 21, 2022 13:06

axw merged commit 2161ab2 into elastic:main Feb 21, 2022

mergify Bot mentioned this pull request Feb 21, 2022

[8.1] modelindexer: fix flush-on-close timing issues (backport #7352) #7362

Merged

mergify Bot mentioned this pull request Feb 21, 2022

[8.0] modelindexer: fix flush-on-close timing issues (backport #7352) #7363

Merged

mergify Bot mentioned this pull request Feb 21, 2022

[7.17] modelindexer: fix flush-on-close timing issues (backport #7352) #7364

Merged

axw deleted the modelindexer-close-flush branch February 22, 2022 02:27

marclop mentioned this pull request Mar 24, 2022

modelindexer: Reduce locking on flushActive #7649

Merged

1 task

marclop added the test-plan-skip label Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

modelindexer: fix flush-on-close timing issues#7352

modelindexer: fix flush-on-close timing issues#7352
axw merged 4 commits into
elastic:mainfrom
axw:modelindexer-close-flush

axw commented Feb 21, 2022 •

edited

Loading

Uh oh!

ghost commented Feb 21, 2022 •

edited by ghost

Loading

Build stats

Test stats 🧪

Uh oh!

ghost commented Feb 21, 2022

Build stats

`Check out from version control`

`Check out from version control`

`Check out from version control`

Uh oh!

axw commented Feb 21, 2022

Uh oh!

axw commented Feb 21, 2022

Uh oh!

Uh oh!

mergify Bot commented Feb 21, 2022

Uh oh!

marclop left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

axw commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation/summary

Checklist

How to test these changes

Related issues

Uh oh!

ghost commented Feb 21, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

Uh oh!

ghost commented Feb 21, 2022

💔 Build Failed

Build stats

Steps errors

Check out from version control

Check out from version control

Check out from version control

🤖 GitHub comments

Uh oh!

axw commented Feb 21, 2022

Uh oh!

axw commented Feb 21, 2022

Uh oh!

Uh oh!

mergify Bot commented Feb 21, 2022

Uh oh!

marclop left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

axw commented Feb 21, 2022 •

edited

Loading

ghost commented Feb 21, 2022 •

edited by ghost

Loading

`Check out from version control`

`Check out from version control`

`Check out from version control`