Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix https_batch deadlock due to golang timer changes #648

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nicklas-dohrn
Copy link
Contributor

@nicklas-dohrn nicklas-dohrn commented Dec 13, 2024

What is changed:

This changes the way the syslog batches are triggered.
The new implementation no longer uses Timers that need to be reset and checked, it just ticks once every second per http_batch drain to send a batch at least once a second.
The new logic is as follows:

  • If there is a log present in the queue, send it at last one second later with every log from the same second.
  • If there are more logs in one second than the 'batchSize' allows, send the batch immediately

The problem fixed by this change:

The old implementation was able to deadlock itself due to the way the time channel was reset. There are even more changes to this behaviour to be expected by the golang 1.23 update. (https://tip.golang.org/doc/go1.23#timer-changes)
-> Changed logic necessary to prevent further issues on further golang upgrades

Impact:

The new implementation will tick more often, but the overhead will be pretty minimal, due to not doing anything for empty drains.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Testing performed?

  • Unit tests
  • Tested in dev cluster

Checklist:

  • This PR is being made against the main branch, or relevant version branch
  • I have made corresponding changes to the documentation
  • I have added testing for my changes

@nicklas-dohrn nicklas-dohrn requested a review from a team as a code owner December 13, 2024 13:08
@nicklas-dohrn nicklas-dohrn force-pushed the main branch 3 times, most recently from 5e8242f to b73bec2 Compare December 13, 2024 13:27
@juergen-walter
Copy link

@ctlong can you check please

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklas-dohrn Thanks for the fix. Please take a look at the linting errors. There is no error checking after the sendHttpRequest call.

Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a failing test for the old logic that works for the new logic?

@nicklas-dohrn
Copy link
Contributor Author

I am currently in the process to rewrite the tests anyway, but I will make sure to have a test that proves the point, that the old implementation deadlocks/hangs in certain scenarios.

@nicklas-dohrn nicklas-dohrn force-pushed the main branch 2 times, most recently from 6a6a850 to f0d7e6c Compare December 29, 2024 11:19
@nicklas-dohrn nicklas-dohrn requested a review from ctlong December 29, 2024 11:20
@nicklas-dohrn
Copy link
Contributor Author

This change now contains a test case, that will not work with the old timer based implementation.
I added a test configurable defaultTime for the ticker, to keep test cases shorter than I would need with 1 second intervalls.

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but we'll have to do some integration tests as well before approving the PR.

chombium
chombium previously approved these changes Jan 23, 2025
Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nicklas-dohrn, thanks for the changes and the fix. It looks good to me and if @ctlong doesn't have any objections we can merge this.

@@ -13,6 +13,8 @@ import (

const BATCHSIZE = 256 * 1024

var DefaultSendInterval = 1 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to expose config parameters which can be set with envvars on Syslog Agent start.

I don't have anything against, that we merge this as is and we open up the config parameters later.

Any thoughts on this @ctlong?

@nicklas-dohrn
Copy link
Contributor Author

Unfortunately, I cannot see the failures due to the concourse build check.
It would be nice to know, what I should fix here.

@juergen-walter
Copy link

Hi @ameowlia @ctlong not having access nor any information about failing concourse-ci/tests is discouraging and a huge delay factor for open source contributors such as @nicklas-dohrn. Any ideas on how to improve? Next PR is already in the pipeline #617

ctlong
ctlong previously approved these changes Feb 18, 2025
@ctlong
Copy link
Member

ctlong commented Feb 18, 2025

Hi @ameowlia @ctlong not having access nor any information about failing concourse-ci/tests is discouraging and a huge delay factor for open source contributors such as @nicklas-dohrn. Any ideas on how to improve? Next PR is already in the pipeline #617

Sorry that is blocking progress here. I'm aware that the lack of visibility with these PR tests are not ideal, but there's no quick fixes to expose this pipeline or move it to an open source concourse at the moment.

I'm happy to discuss in Slack the path I'm attempting to take to open source the pipelines. I'll take any help that I can get to implement the plan 😄

@ctlong
Copy link
Member

ctlong commented Feb 18, 2025

Here's the failure:

$ go run github.com/onsi/ginkgo/v2/ginkgo -r --randomize-all --randomize-suites --fail-on-pending --keep-going --race --trace
...
------------------------------
• [FAILED] [0.146 seconds]
HTTPS_batch [It] test dispatching for batches before timewindow is finished
/tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:100

  Timeline >>
  2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
  2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
  2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
  2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
  [FAILED] in [It] - /tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:105 @ 02/18/25 01:05:36.995
  << Timeline

  [FAILED] Expected
      <int>: 257
  to equal
      <int>: 256
  In [It] at: /tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:105 @ 02/18/25 01:05:36.995

  Full Stack Trace
    code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress/syslog_test.init.func3.4()
    	/tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:105 +0x4fe

Concourse container is an x86_64 Linux OS with 16 cores.

@chombium
Copy link
Contributor

I've run the test now few times on my machine with Ubuntu 22.04 and from four runs I've got three errors and one successful run.. It seems to me that as the code plays with time ranges and the randomness that we run the test suite with, makes problems. The tests should be more robust, I guess... @nicklas-dohrn wdyt?

@chombium
Copy link
Contributor

chombium commented Feb 18, 2025

I'm happy to discuss in Slack the path I'm attempting to take to open source the pipelines. I'll take any help that I can get to implement the plan 😄

@ctlong I would gladly help you with this ;)

@ctlong
Copy link
Member

ctlong commented Feb 18, 2025

It seems like the "test dispatching for batches before timewindow is finished" test assumes that the following lines can be executed within one second:

env1 := buildLogEnvelope("APP", "1", "string to get log to 1024 characters:"+string_to_1024_chars, loggregator_v2.Log_OUT)
for i := 0; i < 300; i++ {
    Expect(writer.Write(env1)).To(Succeed())
}
Expect(drain.getMessagesSize()).Should(Equal(256))

If they take longer than one second, then more envelopes may be present in the drain than just 256.

Why not make the DefaultSendInterval very large just for this test and remove the sleep and final expectation?

@nicklas-dohrn
Copy link
Contributor Author

nicklas-dohrn commented Feb 18, 2025

I took your inspiration for the testcase:
"test dispatching for batches before timewindow is finished"

but did it kind of the other way around:
if reducing the overall size of the batch, the timings should be less tight, so the fast execution time should work.
I also changed the bigger sized logs to not hit the buffer size limit directly, to account for inconsistencies in object to string materialisation (not sure if this can happen, but now it is also really clear, when the buffer limit is breached)

I hope that with the other changes I introduced will make this completely consistently succeed

@chombium
Copy link
Contributor

This looks good. I'll do few more unit test runs tomorrow, so that we are sure that the problem is fixed.

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nicklas-dohrn, thank you very much for the quick reponse.

I have few more small objections and than I will test the changes once again. Could you please also squash your commits?

@@ -11,7 +11,8 @@ import (
"code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress"
)

const BATCHSIZE = 256 * 1024
var DefaultBatchSize = 256 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make these two as constants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They cannot be constants, as they need to be set in the Tests.
There seems to be no easy way around doing it like this, so can we not keep it this way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases like this we're using the so called Functional options pattern. An object is created with default values and then options are applied.Sorry for not mentioning it earlier...
Here is an example definition and [usage](https://github.com/cloudfoundry/loggregator-agent-release/blob/main/src/cmd/loggregator-agent/app/app_v2.go#L173] from this project.

You can find more info about this pattern in the following blog posts:
https://dev.to/kittipat1413/understanding-the-options-pattern-in-go-390c
https://golang.cafe/blog/golang-functional-options-pattern.html

This way, you will be able to have the default values configured and can be overwritten if needed.

@ctlong ctlong assigned chombium and unassigned ctlong Feb 20, 2025
@nicklas-dohrn
Copy link
Contributor Author

@ctlong if this one runs through, you can merge it now.
should be stable now.

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nicklas-dohrn, I have a suggestion how to restructure the code for setting he default values and I was able to make the test break again.

I've run the Ginkgo test 10 times one after another:

 for i in {1..10}; do go run github.com/onsi/ginkgo/v2/ginkgo -r --randomize-all --randomize-suites --fail-on-pending --keep-going --race --trace >> /tmp/nicklas.log; done

The broken tests are:

[1740406427] Syslog Suite - 84/84 specs ••••••••••••
------------------------------
• [FAILED] [0.165 seconds]
HTTPS_batch [It] testing simple appending of one log
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:62

  Timeline >>
  2025/02/24 15:13:56 Dropped 10000 dropping logs for aggregate drain with url dropping:
  2025/02/24 15:13:56 Dropped 10000 dropping logs for test-source-id's app drain with url dropping://my-drain:8080/path
  [FAILED] in [It] - /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69 @ 02/24/25 15:13:56.394
  << Timeline

  [FAILED] Expected
      <int>: 0
  to equal
      <int>: 2
  In [It] at: /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69 @ 02/24/25 15:13:56.394

  Full Stack Trace
    code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress/syslog_test.init.func3.2()
    	/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69 +0x87e
------------------------------
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
------------------------------
• [FAILED] [0.193 seconds]
HTTPS_batch [It] test batch dispatching with all logs in a given timeframe
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:96

  Timeline >>
  2025/02/24 15:13:59 Dropped 10000 dropping logs for aggregate drain with url dropping:
  2025/02/24 15:13:59 Dropped 10000 dropping logs for test-source-id's app drain with url dropping://my-drain:8080/path
  2025/02/24 15:13:59 Dropped 10000 dropping logs for app-id's app drain with url dropping:
  2025/02/24 15:13:59 Dropped 10000 dropping logs for aggregate drain with url dropping:
  2025/02/24 15:13:59 Dropped 10000 dropping logs for test-source-id's app drain with url dropping://my-drain:8080/path
  2025/02/24 15:13:59 Dropped 10000 dropping logs for app-id's app drain with url dropping:
  [FAILED] in [It] - /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102 @ 02/24/25 15:13:59.6
  << Timeline

  [FAILED] Timed out after 0.189s.
  Expected
      <int>: 0
  to equal
      <int>: 10
  In [It] at: /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102 @ 02/24/25 15:13:59.6

  Full Stack Trace
    code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress/syslog_test.init.func3.3()
    	/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102 +0x668
------------------------------
••••

Summarizing 2 Failures:
  [FAIL] HTTPS_batch [It] testing simple appending of one log
  /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69
  [FAIL] HTTPS_batch [It] test batch dispatching with all logs in a given timeframe
  /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102

Ran 84 of 84 Specs in 3.763 seconds
FAIL! -- 82 Passed | 2 Failed | 0 Pending | 0 Skipped

These two tests fail randomly.

Please squash your commits as well.

@@ -11,7 +11,8 @@ import (
"code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress"
)

const BATCHSIZE = 256 * 1024
var DefaultBatchSize = 256 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases like this we're using the so called Functional options pattern. An object is created with default values and then options are applied.Sorry for not mentioning it earlier...
Here is an example definition and [usage](https://github.com/cloudfoundry/loggregator-agent-release/blob/main/src/cmd/loggregator-agent/app/app_v2.go#L173] from this project.

You can find more info about this pattern in the following blog posts:
https://dev.to/kittipat1413/understanding-the-options-pattern-in-go-390c
https://golang.cafe/blog/golang-functional-options-pattern.html

This way, you will be able to have the default values configured and can be overwritten if needed.

@nicklas-dohrn nicklas-dohrn force-pushed the main branch 3 times, most recently from e7e9fba to 5d67c23 Compare February 26, 2025 17:01
It makes tests hopefully more robust

It also replaces most sleeps with Consistently and Eventually.
It makes the timings more forgiving.

This should make it reliable on weak hardware.

Add functional options pattern to allow test configuration
Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @nicklas-dohrn thanks for your contribution and the cooperation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pending Merge | Prioritized
Development

Successfully merging this pull request may close these issues.

4 participants