Skip to content

Conversation

stevesg
Copy link
Contributor

@stevesg stevesg commented Apr 15, 2021

In rare cases the test can fail when creating the silence with the
message:

creating the silence failed with status 404 and error the Alertmanager is not configured

The checking of the metric later on also fails:

unable to find metrics [cortex_alertmanager_silences] ...

This appears to be because the alertmanager has not fully configured the
user yet. The tenants_owned metric is polled before beginning the
test, but this metric gets updated before the call to setConfig(),
where the Alertmanager instance for the user is created.

To try and make the test more reliable, we can intead check the
last_reload_successful metric. This metric is per-user, so does not
exist for until setConfig() has completed, when it is set to 1.

Additionally, the failure is very hard to reproduce due to the test
trying multiple alertmanager instances to create the silence (because
the user may not exist in the first instance). This "retrying" will hide
the fact the alertmanager has not been configured yet, and often cover
up the failure. Therefore, the change makes the issue easier to
reproduce by attempting to create the silence in all three instances,
and expecting exactly one of them to return an error. This has the
side-effect that the silence is created twice.

Signed-off-by: Steve Simpson [email protected]

In rare cases the test can fail when creating the silence with the
message:

> `creating the silence failed with status 404 and error the Alertmanager is not configured`

The checking of the metric later on also fails:

> ```unable to find metrics [cortex_alertmanager_silences] ...```

This appears to be because the alertmanager has not fully configured the
user yet. The `tenants_owned` metric is polled before beginning the
test, but this metric gets updated _before_ the call to `setConfig()`,
where the `Alertmanager` instance for the user is created.

To try and make the test more reliable, we can intead check the
`last_reload_successful` metric. This metric is per-user, so does not
exist for until `setConfig()` has completed, when it is set to `1`.

Additionally, the failure is very hard to reproduce due to the test
trying multiple alertmanager instances to create the silence (because
the user may not exist in the first instance). This "retrying" will hide
the fact the alertmanager has not been configured yet, and often cover
up the failure. Therefore, the change makes the issue easier to
reproduce by attempting to create the silence in all three instances,
and expecting exactly one of them to return an error. This has the
side-effect that the silence is created twice.

Signed-off-by: Steve Simpson <[email protected]>
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job spotting and fixing it! 👏

@pracucci pracucci merged commit 588e47a into cortexproject:master Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants