-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix https_batch deadlock due to golang timer changes #648
base: main
Are you sure you want to change the base?
Conversation
5e8242f
to
b73bec2
Compare
@ctlong can you check please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicklas-dohrn Thanks for the fix. Please take a look at the linting errors. There is no error checking after the sendHttpRequest
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a failing test for the old logic that works for the new logic?
I am currently in the process to rewrite the tests anyway, but I will make sure to have a test that proves the point, that the old implementation deadlocks/hangs in certain scenarios. |
6a6a850
to
f0d7e6c
Compare
This change now contains a test case, that will not work with the old timer based implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but we'll have to do some integration tests as well before approving the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicklas-dohrn, thanks for the changes and the fix. It looks good to me and if @ctlong doesn't have any objections we can merge this.
src/pkg/egress/syslog/https_batch.go
Outdated
@@ -13,6 +13,8 @@ import ( | |||
|
|||
const BATCHSIZE = 256 * 1024 | |||
|
|||
var DefaultSendInterval = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to expose config parameters which can be set with envvars on Syslog Agent start.
I don't have anything against, that we merge this as is and we open up the config parameters later.
Any thoughts on this @ctlong?
Unfortunately, I cannot see the failures due to the concourse build check. |
Hi @ameowlia @ctlong not having access nor any information about failing concourse-ci/tests is discouraging and a huge delay factor for open source contributors such as @nicklas-dohrn. Any ideas on how to improve? Next PR is already in the pipeline #617 |
Sorry that is blocking progress here. I'm aware that the lack of visibility with these PR tests are not ideal, but there's no quick fixes to expose this pipeline or move it to an open source concourse at the moment. I'm happy to discuss in Slack the path I'm attempting to take to open source the pipelines. I'll take any help that I can get to implement the plan 😄 |
Here's the failure: $ go run github.com/onsi/ginkgo/v2/ginkgo -r --randomize-all --randomize-suites --fail-on-pending --keep-going --race --trace
...
------------------------------
• [FAILED] [0.146 seconds]
HTTPS_batch [It] test dispatching for batches before timewindow is finished
/tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:100
Timeline >>
2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
2025/02/18 01:05:36 Dropped 10000 dropping logs for app-id's app drain with url dropping:
[FAILED] in [It] - /tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:105 @ 02/18/25 01:05:36.995
<< Timeline
[FAILED] Expected
<int>: 257
to equal
<int>: 256
In [It] at: /tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:105 @ 02/18/25 01:05:36.995
Full Stack Trace
code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress/syslog_test.init.func3.4()
/tmp/build/f541ec31/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:105 +0x4fe Concourse container is an x86_64 Linux OS with 16 cores. |
I've run the test now few times on my machine with Ubuntu 22.04 and from four runs I've got three errors and one successful run.. It seems to me that as the code plays with time ranges and the randomness that we run the test suite with, makes problems. The tests should be more robust, I guess... @nicklas-dohrn wdyt? |
@ctlong I would gladly help you with this ;) |
It seems like the "test dispatching for batches before timewindow is finished" test assumes that the following lines can be executed within one second: env1 := buildLogEnvelope("APP", "1", "string to get log to 1024 characters:"+string_to_1024_chars, loggregator_v2.Log_OUT)
for i := 0; i < 300; i++ {
Expect(writer.Write(env1)).To(Succeed())
}
Expect(drain.getMessagesSize()).Should(Equal(256)) If they take longer than one second, then more envelopes may be present in the drain than just 256. Why not make the |
I took your inspiration for the testcase: but did it kind of the other way around: I hope that with the other changes I introduced will make this completely consistently succeed |
This looks good. I'll do few more unit test runs tomorrow, so that we are sure that the problem is fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicklas-dohrn, thank you very much for the quick reponse.
I have few more small objections and than I will test the changes once again. Could you please also squash your commits?
src/pkg/egress/syslog/https_batch.go
Outdated
@@ -11,7 +11,8 @@ import ( | |||
"code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress" | |||
) | |||
|
|||
const BATCHSIZE = 256 * 1024 | |||
var DefaultBatchSize = 256 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please make these two as constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They cannot be constants, as they need to be set in the Tests.
There seems to be no easy way around doing it like this, so can we not keep it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cases like this we're using the so called Functional options pattern
. An object is created with default values and then options are applied.Sorry for not mentioning it earlier...
Here is an example definition and [usage](https://github.com/cloudfoundry/loggregator-agent-release/blob/main/src/cmd/loggregator-agent/app/app_v2.go#L173] from this project.
You can find more info about this pattern in the following blog posts:
https://dev.to/kittipat1413/understanding-the-options-pattern-in-go-390c
https://golang.cafe/blog/golang-functional-options-pattern.html
This way, you will be able to have the default values configured and can be overwritten if needed.
@ctlong if this one runs through, you can merge it now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicklas-dohrn, I have a suggestion how to restructure the code for setting he default values and I was able to make the test break again.
I've run the Ginkgo test 10 times one after another:
for i in {1..10}; do go run github.com/onsi/ginkgo/v2/ginkgo -r --randomize-all --randomize-suites --fail-on-pending --keep-going --race --trace >> /tmp/nicklas.log; done
The broken tests are:
[1740406427] Syslog Suite - 84/84 specs ••••••••••••
------------------------------
• [FAILED] [0.165 seconds]
HTTPS_batch [It] testing simple appending of one log
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:62
Timeline >>
2025/02/24 15:13:56 Dropped 10000 dropping logs for aggregate drain with url dropping:
2025/02/24 15:13:56 Dropped 10000 dropping logs for test-source-id's app drain with url dropping://my-drain:8080/path
[FAILED] in [It] - /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69 @ 02/24/25 15:13:56.394
<< Timeline
[FAILED] Expected
<int>: 0
to equal
<int>: 2
In [It] at: /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69 @ 02/24/25 15:13:56.394
Full Stack Trace
code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress/syslog_test.init.func3.2()
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69 +0x87e
------------------------------
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
------------------------------
• [FAILED] [0.193 seconds]
HTTPS_batch [It] test batch dispatching with all logs in a given timeframe
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:96
Timeline >>
2025/02/24 15:13:59 Dropped 10000 dropping logs for aggregate drain with url dropping:
2025/02/24 15:13:59 Dropped 10000 dropping logs for test-source-id's app drain with url dropping://my-drain:8080/path
2025/02/24 15:13:59 Dropped 10000 dropping logs for app-id's app drain with url dropping:
2025/02/24 15:13:59 Dropped 10000 dropping logs for aggregate drain with url dropping:
2025/02/24 15:13:59 Dropped 10000 dropping logs for test-source-id's app drain with url dropping://my-drain:8080/path
2025/02/24 15:13:59 Dropped 10000 dropping logs for app-id's app drain with url dropping:
[FAILED] in [It] - /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102 @ 02/24/25 15:13:59.6
<< Timeline
[FAILED] Timed out after 0.189s.
Expected
<int>: 0
to equal
<int>: 10
In [It] at: /tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102 @ 02/24/25 15:13:59.6
Full Stack Trace
code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress/syslog_test.init.func3.3()
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102 +0x668
------------------------------
••••
Summarizing 2 Failures:
[FAIL] HTTPS_batch [It] testing simple appending of one log
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:69
[FAIL] HTTPS_batch [It] test batch dispatching with all logs in a given timeframe
/tmp/loggregator-agent-release/src/pkg/egress/syslog/https_batch_test.go:102
Ran 84 of 84 Specs in 3.763 seconds
FAIL! -- 82 Passed | 2 Failed | 0 Pending | 0 Skipped
These two tests fail randomly.
Please squash your commits as well.
src/pkg/egress/syslog/https_batch.go
Outdated
@@ -11,7 +11,8 @@ import ( | |||
"code.cloudfoundry.org/loggregator-agent-release/src/pkg/egress" | |||
) | |||
|
|||
const BATCHSIZE = 256 * 1024 | |||
var DefaultBatchSize = 256 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cases like this we're using the so called Functional options pattern
. An object is created with default values and then options are applied.Sorry for not mentioning it earlier...
Here is an example definition and [usage](https://github.com/cloudfoundry/loggregator-agent-release/blob/main/src/cmd/loggregator-agent/app/app_v2.go#L173] from this project.
You can find more info about this pattern in the following blog posts:
https://dev.to/kittipat1413/understanding-the-options-pattern-in-go-390c
https://golang.cafe/blog/golang-functional-options-pattern.html
This way, you will be able to have the default values configured and can be overwritten if needed.
e7e9fba
to
5d67c23
Compare
It makes tests hopefully more robust It also replaces most sleeps with Consistently and Eventually. It makes the timings more forgiving. This should make it reliable on weak hardware. Add functional options pattern to allow test configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @nicklas-dohrn thanks for your contribution and the cooperation.
What is changed:
This changes the way the syslog batches are triggered.
The new implementation no longer uses Timers that need to be reset and checked, it just ticks once every second per http_batch drain to send a batch at least once a second.
The new logic is as follows:
The problem fixed by this change:
The old implementation was able to deadlock itself due to the way the time channel was reset. There are even more changes to this behaviour to be expected by the golang 1.23 update. (https://tip.golang.org/doc/go1.23#timer-changes)
-> Changed logic necessary to prevent further issues on further golang upgrades
Impact:
The new implementation will tick more often, but the overhead will be pretty minimal, due to not doing anything for empty drains.
Type of change
Testing performed?
Checklist:
main
branch, or relevant version branch