Skip to content

Conversation

@mycrEEpy
Copy link

What does this PR do?

Add two parameters MetricsChannelCapacity and DropMetricsAtCapacity.

This will allow the AddMetric function to be truly non-blocking even if we risk loosing some metrics.

The default will remain at a channel capacity of 2000 and blocking to wait for capacity.

Another try after #130 got closed due to stale activity.

Motivation

Following the DataDog outage of 2023.03.08 in EU central some of our Lambdas were blocking due to the blocking nature of the AddMetric function in cases where the metricsChan is at capacity.

Testing Guidelines

Added test case TestProcessorBatchesDropMetricsAtCapacity

Additional Notes

Types of changes

  • Bug fix
  • New feature
  • Breaking change
  • Misc (docs, refactoring, dependency upgrade, etc.)

Checklist

  • This PR's description is comprehensive
  • This PR contains breaking changes that are documented in the description
  • This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
  • This PR impacts documentation, and it has been updated (or a ticket has been logged)
  • This PR's changes are covered by the automated tests
  • This PR collects user input/sensitive content into Datadog

mycrEEpy and others added 2 commits March 9, 2023 19:12
This will allow the AddMetric function to be truly non-blocking even if
we risk loosing some metrics.

Signed-off-by: Tobias Germer <[email protected]>
@mycrEEpy mycrEEpy requested a review from a team as a code owner March 16, 2024 13:14
@purple4reina
Copy link
Contributor

Hey @mycrEEpy, thanks for the details and explanation of why this change is important. I'm taking a look now and will get back to you with questions/comments.

@purple4reina
Copy link
Contributor

Hi @mycrEEpy,

The code here all looks great. I'm even thinking we should enable this feature by default.

One thing though, I am having a really tough time trying to reproduce the issue based on the information you reported. I am testing by adding a time.Sleep(time.Hour) or setting err = errors.New("oops") after the line of code where we are sending the metrics.

Using your branch, I would assume based on your description that my test function would hang. However, it is not. I tested this on v1.9.0, which was the most recent version available when you opened your original PR #130.

I believe it's a high priority for us to ensure that you do not have another incident like the one it sounds we caused for you last year. However, I'm just not yet convinced this PR will solve the problem.

Do you have any ideas on how we can recreate your issue?

~ Rey

@mycrEEpy
Copy link
Author

I'll see if I can reproduce it, but the sleep should do the trick since the Do method was timing out during the outage.

@mycrEEpy
Copy link
Author

As a reference, we were producing about 200 metric points per second, so the channel was full after 8-10 seconds that day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants