Fix TestAlertmanagerSharding integration test. #4083

stevesg · 2021-04-15T17:38:42Z

In rare cases the test can fail when creating the silence with the
message:

creating the silence failed with status 404 and error the Alertmanager is not configured

The checking of the metric later on also fails:

unable to find metrics [cortex_alertmanager_silences] ...

This appears to be because the alertmanager has not fully configured the
user yet. The tenants_owned metric is polled before beginning the
test, but this metric gets updated before the call to setConfig(),
where the Alertmanager instance for the user is created.

To try and make the test more reliable, we can intead check the
last_reload_successful metric. This metric is per-user, so does not
exist for until setConfig() has completed, when it is set to 1.

Additionally, the failure is very hard to reproduce due to the test
trying multiple alertmanager instances to create the silence (because
the user may not exist in the first instance). This "retrying" will hide
the fact the alertmanager has not been configured yet, and often cover
up the failure. Therefore, the change makes the issue easier to
reproduce by attempting to create the silence in all three instances,
and expecting exactly one of them to return an error. This has the
side-effect that the silence is created twice.

Signed-off-by: Steve Simpson [email protected]

In rare cases the test can fail when creating the silence with the message: > `creating the silence failed with status 404 and error the Alertmanager is not configured` The checking of the metric later on also fails: > ```unable to find metrics [cortex_alertmanager_silences] ...``` This appears to be because the alertmanager has not fully configured the user yet. The `tenants_owned` metric is polled before beginning the test, but this metric gets updated _before_ the call to `setConfig()`, where the `Alertmanager` instance for the user is created. To try and make the test more reliable, we can intead check the `last_reload_successful` metric. This metric is per-user, so does not exist for until `setConfig()` has completed, when it is set to `1`. Additionally, the failure is very hard to reproduce due to the test trying multiple alertmanager instances to create the silence (because the user may not exist in the first instance). This "retrying" will hide the fact the alertmanager has not been configured yet, and often cover up the failure. Therefore, the change makes the issue easier to reproduce by attempting to create the silence in all three instances, and expecting exactly one of them to return an error. This has the side-effect that the silence is created twice. Signed-off-by: Steve Simpson <[email protected]>

pracucci

Good job spotting and fixing it! 👏

pull-request-size bot added the size/S label Apr 15, 2021

stevesg mentioned this pull request Apr 16, 2021

Add minimal support for missing routes in alertmanager distributor. #4084

Merged

3 tasks

pstibrany approved these changes Apr 16, 2021

View reviewed changes

pracucci approved these changes Apr 16, 2021

View reviewed changes

pracucci merged commit 588e47a into cortexproject:master Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TestAlertmanagerSharding integration test. #4083

Fix TestAlertmanagerSharding integration test. #4083

Uh oh!

stevesg commented Apr 15, 2021

Uh oh!

pracucci left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix TestAlertmanagerSharding integration test. #4083

Fix TestAlertmanagerSharding integration test. #4083

Uh oh!

Conversation

stevesg commented Apr 15, 2021

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants