This document contains a complete reference of all alerts in Sourcegraph's monitoring, and next steps for when you find alerts that are firing. If your alert isn't mentioned here, or if the next steps don't help, contact us for assistance.
To learn more about Sourcegraph's alerting and how to set up alerts, see our alerting guide.
99th percentile successful search request duration over 5m
Descriptions
- warning frontend: 20s+ 99th percentile successful search request duration over 5m
Next steps
- Get details on the exact queries that are slow by configuring
"observability.logSlowSearches": 20,
in the site configuration and looking forfrontend
warning logs prefixed withslow search request
for additional details. - Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the
indexed-search.Deployment.yaml
if regularly hitting max CPU utilization. - Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_99th_percentile_search_request_duration"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.99, sum by (le) (rate(src_search_streaming_latency_seconds_bucket{source="browser"}[5m])))) >= 20)
90th percentile successful search request duration over 5m
Descriptions
- warning frontend: 15s+ 90th percentile successful search request duration over 5m
Next steps
- Get details on the exact queries that are slow by configuring
"observability.logSlowSearches": 15,
in the site configuration and looking forfrontend
warning logs prefixed withslow search request
for additional details. - Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the
indexed-search.Deployment.yaml
if regularly hitting max CPU utilization. - Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_90th_percentile_search_request_duration"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_search_streaming_latency_seconds_bucket{source="browser"}[5m])))) >= 15)
hard timeout search responses every 5m
Descriptions
- warning frontend: 2%+ hard timeout search responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_hard_timeout_search_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max(((sum(increase(src_graphql_search_response{request_name!="CodeIntelSearch",source="browser",status="timeout"}[5m])) + sum(increase(src_graphql_search_response{alert_type="timed_out",request_name!="CodeIntelSearch",source="browser",status="alert"}[5m]))) / sum(increase(src_graphql_search_response{request_name!="CodeIntelSearch",source="browser"}[5m])) * 100) >= 2)
hard error search responses every 5m
Descriptions
- warning frontend: 2%+ hard error search responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_hard_error_search_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{request_name!="CodeIntelSearch",source="browser",status=~"error"}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response{request_name!="CodeIntelSearch",source="browser"}[5m])) * 100) >= 2)
partial timeout search responses every 5m
Descriptions
- warning frontend: 5%+ partial timeout search responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_partial_timeout_search_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{request_name!="CodeIntelSearch",source="browser",status="partial_timeout"}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response{request_name!="CodeIntelSearch",source="browser"}[5m])) * 100) >= 5)
search alert user suggestions shown every 5m
Descriptions
- warning frontend: 5%+ search alert user suggestions shown every 5m for 15m0s
Next steps
- This indicates your user`s are making syntax errors or similar user errors.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_search_alert_user_suggestions"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (alert_type) (increase(src_graphql_search_response{alert_type!~"timed_out|no_results__suggest_quotes",request_name!="CodeIntelSearch",source="browser",status="alert"}[5m])) / ignoring (alert_type) group_left () sum(increase(src_graphql_search_response{request_name!="CodeIntelSearch",source="browser"}[5m])) * 100) >= 5)
90th percentile page load latency over all routes over 10m
Descriptions
- warning frontend: 2s+ 90th percentile page load latency over all routes over 10m
Next steps
- Confirm that the Sourcegraph frontend has enough CPU/memory using the provisioning panels.
- Investigate potential sources of latency by selecting Explore and modifying the
sum by(le)
section to include additional labels: for example,sum by(le, job)
orsum by (le, instance)
. - Trace a request to see what the slowest part is: https://docs.sourcegraph.com/admin/observability/tracing
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_page_load_latency"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_http_request_duration_seconds_bucket{route!="blob",route!="raw",route!~"graphql.*"}[10m])))) >= 2)
90th percentile blob load latency over 10m
Descriptions
- critical frontend: 5s+ 90th percentile blob load latency over 10m
Next steps
- Confirm that the Sourcegraph frontend has enough CPU/memory using the provisioning panels.
- Trace a request to see what the slowest part is: https://docs.sourcegraph.com/admin/observability/tracing
- Check that gitserver containers have enough CPU/memory and are not getting throttled.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_frontend_blob_load_latency"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: max((histogram_quantile(0.9, sum by (le) (rate(src_http_request_duration_seconds_bucket{route="blob"}[10m])))) >= 5)
99th percentile code-intel successful search request duration over 5m
Descriptions
- warning frontend: 20s+ 99th percentile code-intel successful search request duration over 5m
Next steps
- Get details on the exact queries that are slow by configuring
"observability.logSlowSearches": 20,
in the site configuration and looking forfrontend
warning logs prefixed withslow search request
for additional details. - Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the
indexed-search.Deployment.yaml
if regularly hitting max CPU utilization. - Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization. - This alert may indicate that your instance is struggling to process symbols queries on a monorepo, learn more here.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_99th_percentile_search_codeintel_request_duration"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.99, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",request_name="CodeIntelSearch",source="browser",type="Search"}[5m])))) >= 20)
90th percentile code-intel successful search request duration over 5m
Descriptions
- warning frontend: 15s+ 90th percentile code-intel successful search request duration over 5m
Next steps
- Get details on the exact queries that are slow by configuring
"observability.logSlowSearches": 15,
in the site configuration and looking forfrontend
warning logs prefixed withslow search request
for additional details. - Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the
indexed-search.Deployment.yaml
if regularly hitting max CPU utilization. - Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization. - This alert may indicate that your instance is struggling to process symbols queries on a monorepo, learn more here.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_90th_percentile_search_codeintel_request_duration"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",request_name="CodeIntelSearch",source="browser",type="Search"}[5m])))) >= 15)
hard timeout search code-intel responses every 5m
Descriptions
- warning frontend: 2%+ hard timeout search code-intel responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_hard_timeout_search_codeintel_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max(((sum(increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser",status="timeout"}[5m])) + sum(increase(src_graphql_search_response{alert_type="timed_out",request_name="CodeIntelSearch",source="browser",status="alert"}[5m]))) / sum(increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser"}[5m])) * 100) >= 2)
hard error search code-intel responses every 5m
Descriptions
- warning frontend: 2%+ hard error search code-intel responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_hard_error_search_codeintel_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser",status=~"error"}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser"}[5m])) * 100) >= 2)
partial timeout search code-intel responses every 5m
Descriptions
- warning frontend: 5%+ partial timeout search code-intel responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_partial_timeout_search_codeintel_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser",status="partial_timeout"}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser",status="partial_timeout"}[5m])) * 100) >= 5)
search code-intel alert user suggestions shown every 5m
Descriptions
- warning frontend: 5%+ search code-intel alert user suggestions shown every 5m for 15m0s
Next steps
- This indicates a bug in Sourcegraph, please open an issue.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_search_codeintel_alert_user_suggestions"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (alert_type) (increase(src_graphql_search_response{alert_type!~"timed_out",request_name="CodeIntelSearch",source="browser",status="alert"}[5m])) / ignoring (alert_type) group_left () sum(increase(src_graphql_search_response{request_name="CodeIntelSearch",source="browser"}[5m])) * 100) >= 5)
99th percentile successful search API request duration over 5m
Descriptions
- warning frontend: 50s+ 99th percentile successful search API request duration over 5m
Next steps
- Get details on the exact queries that are slow by configuring
"observability.logSlowSearches": 20,
in the site configuration and looking forfrontend
warning logs prefixed withslow search request
for additional details. - Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the
indexed-search.Deployment.yaml
if regularly hitting max CPU utilization. - Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_99th_percentile_search_api_request_duration"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.99, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",source="other",type="Search"}[5m])))) >= 50)
90th percentile successful search API request duration over 5m
Descriptions
- warning frontend: 40s+ 90th percentile successful search API request duration over 5m
Next steps
- Get details on the exact queries that are slow by configuring
"observability.logSlowSearches": 15,
in the site configuration and looking forfrontend
warning logs prefixed withslow search request
for additional details. - Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the
indexed-search.Deployment.yaml
if regularly hitting max CPU utilization. - Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_90th_percentile_search_api_request_duration"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (rate(src_graphql_field_seconds_bucket{error="false",field="results",source="other",type="Search"}[5m])))) >= 40)
hard error search API responses every 5m
Descriptions
- warning frontend: 2%+ hard error search API responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_hard_error_search_api_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (status) (increase(src_graphql_search_response{source="other",status=~"error"}[5m])) / ignoring (status) group_left () sum(increase(src_graphql_search_response{source="other"}[5m]))) >= 2)
partial timeout search API responses every 5m
Descriptions
- warning frontend: 5%+ partial timeout search API responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_partial_timeout_search_api_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum(increase(src_graphql_search_response{source="other",status="partial_timeout"}[5m])) / sum(increase(src_graphql_search_response{source="other"}[5m]))) >= 5)
search API alert user suggestions shown every 5m
Descriptions
- warning frontend: 5%+ search API alert user suggestions shown every 5m
Next steps
- This indicates your user`s search API requests have syntax errors or a similar user error. Check the responses the API sends back for an explanation.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_search_api_alert_user_suggestions"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (alert_type) (increase(src_graphql_search_response{alert_type!~"timed_out|no_results__suggest_quotes",source="other",status="alert"}[5m])) / ignoring (alert_type) group_left () sum(increase(src_graphql_search_response{source="other",status="alert"}[5m]))) >= 5)
internal indexed search error responses every 5m
Descriptions
- warning frontend: 5%+ internal indexed search error responses every 5m for 15m0s
Next steps
- Check the Zoekt Web Server dashboard for indications it might be unhealthy.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_internal_indexed_search_error_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (code) (increase(src_zoekt_request_duration_seconds_count{code!~"2.."}[5m])) / ignoring (code) group_left () sum(increase(src_zoekt_request_duration_seconds_count[5m])) * 100) >= 5)
internal unindexed search error responses every 5m
Descriptions
- warning frontend: 5%+ internal unindexed search error responses every 5m for 15m0s
Next steps
- Check the Searcher dashboard for indications it might be unhealthy.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_internal_unindexed_search_error_responses"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum by (code) (increase(searcher_service_request_total{code!~"2.."}[5m])) / ignoring (code) group_left () sum(increase(searcher_service_request_total[5m])) * 100) >= 5)
internal API error responses every 5m by route
Descriptions
- warning frontend: 5%+ internal API error responses every 5m by route for 15m0s
Next steps
- May not be a substantial issue, check the
frontend
logs for potential causes. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_internalapi_error_responses"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2.."}[5m])) / ignoring (code) group_left () sum(increase(src_frontend_internal_request_duration_seconds_count[5m])) * 100) >= 5)
99th percentile successful gitserver query duration over 5m
Descriptions
- warning frontend: 20s+ 99th percentile successful gitserver query duration over 5m
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_99th_percentile_gitserver_duration"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.99, sum by (le, category) (rate(src_gitserver_request_duration_seconds_bucket{job=~"(sourcegraph-)?frontend"}[5m])))) >= 20)
gitserver error responses every 5m
Descriptions
- warning frontend: 5%+ gitserver error responses every 5m for 15m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_gitserver_error_responses"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_gitserver_request_duration_seconds_count{code!~"2..",job=~"(sourcegraph-)?frontend"}[5m])) / ignoring (code) group_left () sum by (category) (increase(src_gitserver_request_duration_seconds_count{job=~"(sourcegraph-)?frontend"}[5m])) * 100) >= 5)
warning test alert metric
Descriptions
- warning frontend: 1+ warning test alert metric
Next steps
- This alert is triggered via the
triggerObservabilityTestAlert
GraphQL endpoint, and will automatically resolve itself. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_observability_test_alert_warning"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (owner) (observability_test_metric_warning)) >= 1)
critical test alert metric
Descriptions
- critical frontend: 1+ critical test alert metric
Next steps
- This alert is triggered via the
triggerObservabilityTestAlert
GraphQL endpoint, and will automatically resolve itself. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_frontend_observability_test_alert_critical"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: max((max by (owner) (observability_test_metric_critical)) >= 1)
cryptographic requests to Cloud KMS every 1m
Descriptions
- warning frontend: 15000+ cryptographic requests to Cloud KMS every 1m for 5m0s
- critical frontend: 30000+ cryptographic requests to Cloud KMS every 1m for 5m0s
Next steps
- Revert recent commits that cause extensive listing from "external_services" and/or "user_external_accounts" tables.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_cloudkms_cryptographic_requests",
"critical_frontend_cloudkms_cryptographic_requests"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((sum(increase(src_cloudkms_cryptographic_total[1m]))) >= 15000)
Generated query for critical alert: max((sum(increase(src_cloudkms_cryptographic_total[1m]))) >= 30000)
mean blocked seconds per conn request
Descriptions
- warning frontend: 0.05s+ mean blocked seconds per conn request for 10m0s
- critical frontend: 0.1s+ mean blocked seconds per conn request for 15m0s
Next steps
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory / cpus See our scaling guide
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_mean_blocked_seconds_per_conn_request",
"critical_frontend_mean_blocked_seconds_per_conn_request"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="frontend"}[5m]))) >= 0.05)
Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="frontend"}[5m]))) >= 0.1)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning frontend: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the (frontend|sourcegraph-frontend) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_container_cpu_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^(frontend|sourcegraph-frontend).*"}) >= 99)
container memory usage by instance
Descriptions
- warning frontend: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of (frontend|sourcegraph-frontend) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_container_memory_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^(frontend|sourcegraph-frontend).*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning frontend: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the (frontend|sourcegraph-frontend) service. - Docker Compose: Consider increasing
cpus:
of the (frontend|sourcegraph-frontend) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^(frontend|sourcegraph-frontend).*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning frontend: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the (frontend|sourcegraph-frontend) service. - Docker Compose: Consider increasing
memory:
of the (frontend|sourcegraph-frontend) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(frontend|sourcegraph-frontend).*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning frontend: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the (frontend|sourcegraph-frontend) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^(frontend|sourcegraph-frontend).*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning frontend: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of (frontend|sourcegraph-frontend) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(frontend|sourcegraph-frontend).*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning frontend: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of (frontend|sourcegraph-frontend) container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_container_oomkill_events_total"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^(frontend|sourcegraph-frontend).*"})) >= 1)
maximum active goroutines
Descriptions
- warning frontend: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_go_goroutines"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*(frontend|sourcegraph-frontend)"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning frontend: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_go_gc_duration_seconds"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*(frontend|sourcegraph-frontend)"})) >= 2)
percentage pods available
Descriptions
- critical frontend: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod (frontend|sourcegraph-frontend)
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p (frontend|sourcegraph-frontend)
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_frontend_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*(frontend|sourcegraph-frontend)"}) / count by (app) (up{app=~".*(frontend|sourcegraph-frontend)"}) * 100) <= 90)
email delivery failures every 30 minutes
Descriptions
- warning frontend: 1+ email delivery failures every 30 minutes
- critical frontend: 2+ email delivery failures every 30 minutes
Next steps
- Check your SMTP configuration in site configuration.
- Check frontend logs for more detailed error messages.
- Check your SMTP provider for more detailed error messages.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_email_delivery_failures",
"critical_frontend_email_delivery_failures"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum(increase(src_email_send{success="false"}[30m]))) >= 1)
Generated query for critical alert: max((sum(increase(src_email_send{success="false"}[30m]))) >= 2)
mean successful sentinel search duration over 2h
Descriptions
- warning frontend: 5s+ mean successful sentinel search duration over 2h for 15m0s
- critical frontend: 8s+ mean successful sentinel search duration over 2h for 30m0s
Next steps
- Look at the breakdown by query to determine if a specific query type is being affected
- Check for high CPU usage on zoekt-webserver
- Check Honeycomb for unusual activity
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_mean_successful_sentinel_duration_over_2h",
"critical_frontend_mean_successful_sentinel_duration_over_2h"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum(rate(src_search_response_latency_seconds_sum{source=~"searchblitz.*",status="success"}[2h])) / sum(rate(src_search_response_latency_seconds_count{source=~"searchblitz.*",status="success"}[2h]))) >= 5)
Generated query for critical alert: max((sum(rate(src_search_response_latency_seconds_sum{source=~"searchblitz.*",status="success"}[2h])) / sum(rate(src_search_response_latency_seconds_count{source=~"searchblitz.*",status="success"}[2h]))) >= 8)
mean successful sentinel stream latency over 2h
Descriptions
- warning frontend: 2s+ mean successful sentinel stream latency over 2h for 15m0s
- critical frontend: 3s+ mean successful sentinel stream latency over 2h for 30m0s
Next steps
- Look at the breakdown by query to determine if a specific query type is being affected
- Check for high CPU usage on zoekt-webserver
- Check Honeycomb for unusual activity
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_mean_sentinel_stream_latency_over_2h",
"critical_frontend_mean_sentinel_stream_latency_over_2h"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((sum(rate(src_search_streaming_latency_seconds_sum{source=~"searchblitz.*"}[2h])) / sum(rate(src_search_streaming_latency_seconds_count{source=~"searchblitz.*"}[2h]))) >= 2)
Generated query for critical alert: max((sum(rate(src_search_streaming_latency_seconds_sum{source=~"searchblitz.*"}[2h])) / sum(rate(src_search_streaming_latency_seconds_count{source=~"searchblitz.*"}[2h]))) >= 3)
90th percentile successful sentinel search duration over 2h
Descriptions
- warning frontend: 5s+ 90th percentile successful sentinel search duration over 2h for 15m0s
- critical frontend: 10s+ 90th percentile successful sentinel search duration over 2h for 3h30m0s
Next steps
- Look at the breakdown by query to determine if a specific query type is being affected
- Check for high CPU usage on zoekt-webserver
- Check Honeycomb for unusual activity
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_90th_percentile_successful_sentinel_duration_over_2h",
"critical_frontend_90th_percentile_successful_sentinel_duration_over_2h"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (label_replace(rate(src_search_response_latency_seconds_bucket{source=~"searchblitz.*",status="success"}[2h]), "source", "$1", "source", "searchblitz_(.*)")))) >= 5)
Generated query for critical alert: max((histogram_quantile(0.9, sum by (le) (label_replace(rate(src_search_response_latency_seconds_bucket{source=~"searchblitz.*",status="success"}[2h]), "source", "$1", "source", "searchblitz_(.*)")))) >= 10)
90th percentile successful sentinel stream latency over 2h
Descriptions
- warning frontend: 4s+ 90th percentile successful sentinel stream latency over 2h for 15m0s
- critical frontend: 6s+ 90th percentile successful sentinel stream latency over 2h for 3h30m0s
Next steps
- Look at the breakdown by query to determine if a specific query type is being affected
- Check for high CPU usage on zoekt-webserver
- Check Honeycomb for unusual activity
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_frontend_90th_percentile_sentinel_stream_latency_over_2h",
"critical_frontend_90th_percentile_sentinel_stream_latency_over_2h"
]
Managed by the Sourcegraph Search team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.9, sum by (le) (label_replace(rate(src_search_streaming_latency_seconds_bucket{source=~"searchblitz.*"}[2h]), "source", "$1", "source", "searchblitz_(.*)")))) >= 4)
Generated query for critical alert: max((histogram_quantile(0.9, sum by (le) (label_replace(rate(src_search_streaming_latency_seconds_bucket{source=~"searchblitz.*"}[2h]), "source", "$1", "source", "searchblitz_(.*)")))) >= 6)
disk space remaining by instance
Descriptions
- warning gitserver: less than 15% disk space remaining by instance
- critical gitserver: less than 10% disk space remaining by instance for 10m0s
Next steps
- On a warning alert, you may want to provision more disk space: Sourcegraph may be about to start evicting repositories due to disk pressure, which may result in decreased performance, users having to wait for repositories to clone, etc.
- On a critical alert, you need to provision more disk space: Sourcegraph should be evicting repositories from disk, but is either filling up faster than it can evict, or there is an issue with the janitor job.
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_disk_space_remaining",
"critical_gitserver_disk_space_remaining"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 15)
Generated query for critical alert: min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 10)
git commands running on each gitserver instance
Descriptions
- warning gitserver: 50+ git commands running on each gitserver instance for 2m0s
- critical gitserver: 100+ git commands running on each gitserver instance for 5m0s
Next steps
- Check if the problem may be an intermittent and temporary peak using the "Container monitoring" section at the bottom of the Git Server dashboard.
- Single container deployments: Consider upgrading to a Docker Compose deployment which offers better scalability and resource isolation.
- Kubernetes and Docker Compose: Check that you are running a similar number of git server replicas and that their CPU/memory limits are allocated according to what is shown in the Sourcegraph resource estimator.
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_running_git_commands",
"critical_gitserver_running_git_commands"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((sum by (instance, cmd) (src_gitserver_exec_running)) >= 50)
Generated query for critical alert: max((sum by (instance, cmd) (src_gitserver_exec_running)) >= 100)
repository clone queue size
Descriptions
- warning gitserver: 25+ repository clone queue size
Next steps
- If you just added several repositories, the warning may be expected.
- Check which repositories need cloning, by visiting e.g. https://sourcegraph.example.com/site-admin/repositories?filter=not-cloned
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_repository_clone_queue_size"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((sum(src_gitserver_clone_queue)) >= 25)
repository existence check queue size
Descriptions
- warning gitserver: 25+ repository existence check queue size
Next steps
- Check the code host status indicator for errors: on the Sourcegraph app homepage, when signed in as an admin click the cloud icon in the top right corner of the page.
- Check if the issue continues to happen after 30 minutes, it may be temporary.
- Check the gitserver logs for more information.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_repository_existence_check_queue_size"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((sum(src_gitserver_lsremote_queue)) >= 25)
frontend-internal API error responses every 5m by route
Descriptions
- warning gitserver: 2%+ frontend-internal API error responses every 5m by route for 5m0s
Next steps
- Single-container deployments: Check
docker logs $CONTAINER_ID
for logs starting withrepo-updater
that indicate requests to the frontend service are failing. - Kubernetes:
- Confirm that
kubectl get pods
shows thefrontend
pods are healthy. - Check
kubectl logs gitserver
for logs indicate request failures tofrontend
orfrontend-internal
.
- Confirm that
- Docker Compose:
- Confirm that
docker ps
shows thefrontend-internal
container is healthy. - Check
docker logs gitserver
for logs indicating request failures tofrontend
orfrontend-internal
.
- Confirm that
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_frontend_internal_api_error_responses"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2..",job="gitserver"}[5m])) / ignoring (category) group_left () sum(increase(src_frontend_internal_request_duration_seconds_count{job="gitserver"}[5m]))) >= 2)
mean blocked seconds per conn request
Descriptions
- warning gitserver: 0.05s+ mean blocked seconds per conn request for 10m0s
- critical gitserver: 0.1s+ mean blocked seconds per conn request for 15m0s
Next steps
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory / cpus See our scaling guide
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_mean_blocked_seconds_per_conn_request",
"critical_gitserver_mean_blocked_seconds_per_conn_request"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="gitserver"}[5m]))) >= 0.05)
Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="gitserver"}[5m]))) >= 0.1)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning gitserver: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the gitserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_container_cpu_usage"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^gitserver.*"}) >= 99)
container memory usage by instance
Descriptions
- warning gitserver: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of gitserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_container_memory_usage"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^gitserver.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning gitserver: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the gitserver service. - Docker Compose: Consider increasing
cpus:
of the gitserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^gitserver.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning gitserver: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the gitserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^gitserver.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning gitserver: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of gitserver container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_container_oomkill_events_total"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^gitserver.*"})) >= 1)
maximum active goroutines
Descriptions
- warning gitserver: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_go_goroutines"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*gitserver"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning gitserver: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_gitserver_go_gc_duration_seconds"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*gitserver"})) >= 2)
percentage pods available
Descriptions
- critical gitserver: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod gitserver
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p gitserver
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_gitserver_pods_available_percentage"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*gitserver"}) / count by (app) (up{app=~".*gitserver"}) * 100) <= 90)
number of requests waiting on the global mutex
Descriptions
- warning github-proxy: 100+ number of requests waiting on the global mutex for 5m0s
Next steps
-
- **Check github-proxy logs for network connection issues. - **Check github status.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_github_proxy_waiting_requests"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max(github_proxy_waiting_requests)) >= 100)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning github-proxy: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the github-proxy container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_container_cpu_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^github-proxy.*"}) >= 99)
container memory usage by instance
Descriptions
- warning github-proxy: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of github-proxy container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_container_memory_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^github-proxy.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning github-proxy: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the github-proxy service. - Docker Compose: Consider increasing
cpus:
of the github-proxy container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^github-proxy.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning github-proxy: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the github-proxy service. - Docker Compose: Consider increasing
memory:
of the github-proxy container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^github-proxy.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning github-proxy: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the github-proxy container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^github-proxy.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning github-proxy: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of github-proxy container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^github-proxy.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning github-proxy: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of github-proxy container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_container_oomkill_events_total"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^github-proxy.*"})) >= 1)
maximum active goroutines
Descriptions
- warning github-proxy: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_go_goroutines"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*github-proxy"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning github-proxy: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_github-proxy_go_gc_duration_seconds"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*github-proxy"})) >= 2)
percentage pods available
Descriptions
- critical github-proxy: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod github-proxy
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p github-proxy
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_github-proxy_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*github-proxy"}) / count by (app) (up{app=~".*github-proxy"}) * 100) <= 90)
active connections
Descriptions
- warning postgres: less than 5 active connections for 5m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_connections"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: min((sum by (job) (pg_stat_activity_count{datname!~"template.*|postgres|cloudsqladmin"}) or sum by (job) (pg_stat_activity_count{datname!~"template.*|cloudsqladmin",job="codeinsights-db"})) <= 5)
connection in use
Descriptions
- warning postgres: 80%+ connection in use for 5m0s
- critical postgres: 100%+ connection in use for 5m0s
Next steps
- Consider increasing max_connections of the database instance, learn more
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_usage_connections_percentage",
"critical_postgres_usage_connections_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (job) (pg_stat_activity_count) / (sum by (job) (pg_settings_max_connections) - sum by (job) (pg_settings_superuser_reserved_connections)) * 100) >= 80)
Generated query for critical alert: max((sum by (job) (pg_stat_activity_count) / (sum by (job) (pg_settings_max_connections) - sum by (job) (pg_settings_superuser_reserved_connections)) * 100) >= 100)
maximum transaction durations
Descriptions
- warning postgres: 0.3s+ maximum transaction durations for 5m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_transaction_durations"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (job) (pg_stat_activity_max_tx_duration{datname!~"template.*|postgres|cloudsqladmin",job!="codeintel-db"}) or sum by (job) (pg_stat_activity_max_tx_duration{datname!~"template.*|cloudsqladmin",job="codeinsights-db"})) >= 0.3)
database availability
Descriptions
- critical postgres: less than 0 database availability for 5m0s
Next steps
- Kubernetes:
- Determine if the pod was OOM killed using
kubectl describe pod (pgsql|codeintel-db|codeinsights)
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p (pgsql|codeintel-db|codeinsights)
. - Check if there is any OOMKILL event using the provisioning panels
- Check kernel logs using
dmesg
for OOMKILL events on worker nodes
- Determine if the pod was OOM killed using
- Docker Compose:
- Determine if the pod was OOM killed using
docker inspect -f '{{json .State}}' (pgsql|codeintel-db|codeinsights)
(look for"OOMKilled":true
) and, if so, consider increasing the memory limit of the (pgsql|codeintel-db|codeinsights) container indocker-compose.yml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingdocker logs (pgsql|codeintel-db|codeinsights)
(note this will include logs from the previous and currently running container). - Check if there is any OOMKILL event using the provisioning panels
- Check kernel logs using
dmesg
for OOMKILL events
- Determine if the pod was OOM killed using
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_postgres_postgres_up"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((pg_up) <= 0)
invalid indexes (unusable by the query planner)
Descriptions
- critical postgres: 1+ invalid indexes (unusable by the query planner)
Next steps
- Drop and re-create the invalid trigger - please contact Sourcegraph to supply the trigger definition.
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_postgres_invalid_indexes"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: sum((max by (relname) (pg_invalid_index_count)) >= 1)
errors scraping postgres exporter
Descriptions
- warning postgres: 1+ errors scraping postgres exporter for 5m0s
Next steps
- Ensure the Postgres exporter can access the Postgres database. Also, check the Postgres exporter logs for errors.
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_pg_exporter_err"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((pg_exporter_last_scrape_error) >= 1)
active schema migration
Descriptions
- critical postgres: 1+ active schema migration for 5m0s
Next steps
- The database migration has been in progress for 5 or more minutes - please contact Sourcegraph if this persists.
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_postgres_migration_in_progress"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: max((pg_sg_migration_status) >= 1)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning postgres: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the (pgsql|codeintel-db|codeinsights) service. - Docker Compose: Consider increasing
cpus:
of the (pgsql|codeintel-db|codeinsights) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^(pgsql|codeintel-db|codeinsights).*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning postgres: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the (pgsql|codeintel-db|codeinsights) service. - Docker Compose: Consider increasing
memory:
of the (pgsql|codeintel-db|codeinsights) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(pgsql|codeintel-db|codeinsights).*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning postgres: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the (pgsql|codeintel-db|codeinsights) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^(pgsql|codeintel-db|codeinsights).*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning postgres: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of (pgsql|codeintel-db|codeinsights) container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^(pgsql|codeintel-db|codeinsights).*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning postgres: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of (pgsql|codeintel-db|codeinsights) container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_postgres_container_oomkill_events_total"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^(pgsql|codeintel-db|codeinsights).*"})) >= 1)
percentage pods available
Descriptions
- critical postgres: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod (pgsql|codeintel-db|codeinsights)
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p (pgsql|codeintel-db|codeinsights)
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_postgres_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*(pgsql|codeintel-db|codeinsights)"}) / count by (app) (up{app=~".*(pgsql|codeintel-db|codeinsights)"}) * 100) <= 90)
unprocessed upload record queue longest time in queue
Descriptions
- critical precise-code-intel-worker: 18000s+ unprocessed upload record queue longest time in queue
Next steps
- An alert here could be indicative of a few things: an upload surfacing a pathological performance characteristic, precise-code-intel-worker being underprovisioned for the required upload processing throughput, or a higher replica count being required for the volume of uploads.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_precise-code-intel-worker_codeintel_upload_queued_max_age"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for critical alert: max((max(src_codeintel_upload_queued_duration_seconds_total{job=~"^precise-code-intel-worker.*"})) >= 18000)
frontend-internal API error responses every 5m by route
Descriptions
- warning precise-code-intel-worker: 2%+ frontend-internal API error responses every 5m by route for 5m0s
Next steps
- Single-container deployments: Check
docker logs $CONTAINER_ID
for logs starting withrepo-updater
that indicate requests to the frontend service are failing. - Kubernetes:
- Confirm that
kubectl get pods
shows thefrontend
pods are healthy. - Check
kubectl logs precise-code-intel-worker
for logs indicate request failures tofrontend
orfrontend-internal
.
- Confirm that
- Docker Compose:
- Confirm that
docker ps
shows thefrontend-internal
container is healthy. - Check
docker logs precise-code-intel-worker
for logs indicating request failures tofrontend
orfrontend-internal
.
- Confirm that
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_frontend_internal_api_error_responses"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2..",job="precise-code-intel-worker"}[5m])) / ignoring (category) group_left () sum(increase(src_frontend_internal_request_duration_seconds_count{job="precise-code-intel-worker"}[5m]))) >= 2)
mean blocked seconds per conn request
Descriptions
- warning precise-code-intel-worker: 0.05s+ mean blocked seconds per conn request for 10m0s
- critical precise-code-intel-worker: 0.1s+ mean blocked seconds per conn request for 15m0s
Next steps
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory / cpus See our scaling guide
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_mean_blocked_seconds_per_conn_request",
"critical_precise-code-intel-worker_mean_blocked_seconds_per_conn_request"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="precise-code-intel-worker"}[5m]))) >= 0.05)
Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="precise-code-intel-worker"}[5m]))) >= 0.1)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning precise-code-intel-worker: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the precise-code-intel-worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_container_cpu_usage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^precise-code-intel-worker.*"}) >= 99)
container memory usage by instance
Descriptions
- warning precise-code-intel-worker: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of precise-code-intel-worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_container_memory_usage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^precise-code-intel-worker.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning precise-code-intel-worker: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the precise-code-intel-worker service. - Docker Compose: Consider increasing
cpus:
of the precise-code-intel-worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^precise-code-intel-worker.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning precise-code-intel-worker: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the precise-code-intel-worker service. - Docker Compose: Consider increasing
memory:
of the precise-code-intel-worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^precise-code-intel-worker.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning precise-code-intel-worker: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the precise-code-intel-worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^precise-code-intel-worker.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning precise-code-intel-worker: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of precise-code-intel-worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^precise-code-intel-worker.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning precise-code-intel-worker: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of precise-code-intel-worker container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_container_oomkill_events_total"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^precise-code-intel-worker.*"})) >= 1)
maximum active goroutines
Descriptions
- warning precise-code-intel-worker: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_go_goroutines"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*precise-code-intel-worker"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning precise-code-intel-worker: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_go_gc_duration_seconds"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*precise-code-intel-worker"})) >= 2)
percentage pods available
Descriptions
- critical precise-code-intel-worker: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod precise-code-intel-worker
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p precise-code-intel-worker
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_precise-code-intel-worker_pods_available_percentage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*precise-code-intel-worker"}) / count by (app) (up{app=~".*precise-code-intel-worker"}) * 100) <= 90)
redis-store availability
Descriptions
- critical redis: less than 1 redis-store availability for 10s
Next steps
- Ensure redis-store is running
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_redis_redis-store_up"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((redis_up{app="redis-store"}) < 1)
redis-cache availability
Descriptions
- critical redis: less than 1 redis-cache availability for 10s
Next steps
- Ensure redis-cache is running
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_redis_redis-cache_up"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((redis_up{app="redis-cache"}) < 1)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning redis: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the redis-cache service. - Docker Compose: Consider increasing
cpus:
of the redis-cache container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^redis-cache.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning redis: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the redis-cache service. - Docker Compose: Consider increasing
memory:
of the redis-cache container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-cache.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning redis: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the redis-cache container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^redis-cache.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning redis: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of redis-cache container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-cache.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning redis: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of redis-cache container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_container_oomkill_events_total"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^redis-cache.*"})) >= 1)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning redis: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the redis-store service. - Docker Compose: Consider increasing
cpus:
of the redis-store container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^redis-store.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning redis: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the redis-store service. - Docker Compose: Consider increasing
memory:
of the redis-store container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-store.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning redis: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the redis-store container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^redis-store.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning redis: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of redis-store container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^redis-store.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning redis: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of redis-store container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_redis_container_oomkill_events_total"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^redis-store.*"})) >= 1)
percentage pods available
Descriptions
- critical redis: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod redis-cache
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p redis-cache
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_redis_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*redis-cache"}) / count by (app) (up{app=~".*redis-cache"}) * 100) <= 90)
percentage pods available
Descriptions
- critical redis: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod redis-store
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p redis-store
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_redis_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*redis-store"}) / count by (app) (up{app=~".*redis-store"}) * 100) <= 90)
number of worker instances running the codeintel-upload-janitor job
Descriptions
- warning worker: less than 1 number of worker instances running the codeintel-upload-janitor job for 1m0s
- critical worker: less than 1 number of worker instances running the codeintel-upload-janitor job for 5m0s
Next steps
- Ensure your instance defines a worker container such that:
WORKER_JOB_ALLOWLIST
contains "codeintel-upload-janitor" (or "all"), andWORKER_JOB_BLOCKLIST
does not contain "codeintel-upload-janitor"
- Ensure that such a container is not failing to start or stay active
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_worker_job_codeintel-upload-janitor_count",
"critical_worker_worker_job_codeintel-upload-janitor_count"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: (min((sum(src_worker_jobs{job="worker",job_name="codeintel-upload-janitor"})) < 1)) or (absent(sum(src_worker_jobs{job="worker",job_name="codeintel-upload-janitor"})) == 1)
Generated query for critical alert: (min((sum(src_worker_jobs{job="worker",job_name="codeintel-upload-janitor"})) < 1)) or (absent(sum(src_worker_jobs{job="worker",job_name="codeintel-upload-janitor"})) == 1)
number of worker instances running the codeintel-commitgraph-updater job
Descriptions
- warning worker: less than 1 number of worker instances running the codeintel-commitgraph-updater job for 1m0s
- critical worker: less than 1 number of worker instances running the codeintel-commitgraph-updater job for 5m0s
Next steps
- Ensure your instance defines a worker container such that:
WORKER_JOB_ALLOWLIST
contains "codeintel-commitgraph-updater" (or "all"), andWORKER_JOB_BLOCKLIST
does not contain "codeintel-commitgraph-updater"
- Ensure that such a container is not failing to start or stay active
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_worker_job_codeintel-commitgraph-updater_count",
"critical_worker_worker_job_codeintel-commitgraph-updater_count"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: (min((sum(src_worker_jobs{job="worker",job_name="codeintel-commitgraph-updater"})) < 1)) or (absent(sum(src_worker_jobs{job="worker",job_name="codeintel-commitgraph-updater"})) == 1)
Generated query for critical alert: (min((sum(src_worker_jobs{job="worker",job_name="codeintel-commitgraph-updater"})) < 1)) or (absent(sum(src_worker_jobs{job="worker",job_name="codeintel-commitgraph-updater"})) == 1)
number of worker instances running the codeintel-autoindexing-scheduler job
Descriptions
- warning worker: less than 1 number of worker instances running the codeintel-autoindexing-scheduler job for 1m0s
- critical worker: less than 1 number of worker instances running the codeintel-autoindexing-scheduler job for 5m0s
Next steps
- Ensure your instance defines a worker container such that:
WORKER_JOB_ALLOWLIST
contains "codeintel-autoindexing-scheduler" (or "all"), andWORKER_JOB_BLOCKLIST
does not contain "codeintel-autoindexing-scheduler"
- Ensure that such a container is not failing to start or stay active
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_worker_job_codeintel-autoindexing-scheduler_count",
"critical_worker_worker_job_codeintel-autoindexing-scheduler_count"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: (min((sum(src_worker_jobs{job="worker",job_name="codeintel-autoindexing-scheduler"})) < 1)) or (absent(sum(src_worker_jobs{job="worker",job_name="codeintel-autoindexing-scheduler"})) == 1)
Generated query for critical alert: (min((sum(src_worker_jobs{job="worker",job_name="codeintel-autoindexing-scheduler"})) < 1)) or (absent(sum(src_worker_jobs{job="worker",job_name="codeintel-autoindexing-scheduler"})) == 1)
repository queue longest time in queue
Descriptions
- critical worker: 3600s+ repository queue longest time in queue
Next steps
- An alert here is generally indicative of either underprovisioned worker instance(s) and/or an underprovisioned main postgres instance.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_worker_codeintel_commit_graph_queued_max_age"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for critical alert: max((max(src_codeintel_commit_graph_queued_duration_seconds_total{job=~"^worker.*"})) >= 3600)
insights queue size that is not utilized (not processing)
Descriptions
- warning worker: 0+ insights queue size that is not utilized (not processing) for 30m0s
Next steps
- Verify code insights worker job has successfully started. Restart worker service and monitoring startup logs, looking for worker panics.
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_insights_queue_unutilized_size"
]
Managed by the Sourcegraph Code Insights team.
Technical details
Generated query for warning alert: max((max(src_query_runner_worker_total{job=~"^worker.*"}) > 0 and on (job) sum by (op) (increase(src_workerutil_dbworker_store_insights_query_runner_jobs_store_total{job=~"^worker.*",op="Dequeue"}[5m])) < 1) > 0)
frontend-internal API error responses every 5m by route
Descriptions
- warning worker: 2%+ frontend-internal API error responses every 5m by route for 5m0s
Next steps
- Single-container deployments: Check
docker logs $CONTAINER_ID
for logs starting withrepo-updater
that indicate requests to the frontend service are failing. - Kubernetes:
- Confirm that
kubectl get pods
shows thefrontend
pods are healthy. - Check
kubectl logs worker
for logs indicate request failures tofrontend
orfrontend-internal
.
- Confirm that
- Docker Compose:
- Confirm that
docker ps
shows thefrontend-internal
container is healthy. - Check
docker logs worker
for logs indicating request failures tofrontend
orfrontend-internal
.
- Confirm that
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_frontend_internal_api_error_responses"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2..",job="worker"}[5m])) / ignoring (category) group_left () sum(increase(src_frontend_internal_request_duration_seconds_count{job="worker"}[5m]))) >= 2)
mean blocked seconds per conn request
Descriptions
- warning worker: 0.05s+ mean blocked seconds per conn request for 10m0s
- critical worker: 0.1s+ mean blocked seconds per conn request for 15m0s
Next steps
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory / cpus See our scaling guide
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_mean_blocked_seconds_per_conn_request",
"critical_worker_mean_blocked_seconds_per_conn_request"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="worker"}[5m]))) >= 0.05)
Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="worker"}[5m]))) >= 0.1)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning worker: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_container_cpu_usage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^worker.*"}) >= 99)
container memory usage by instance
Descriptions
- warning worker: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_container_memory_usage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^worker.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning worker: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the worker service. - Docker Compose: Consider increasing
cpus:
of the worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^worker.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning worker: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the worker service. - Docker Compose: Consider increasing
memory:
of the worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^worker.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning worker: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^worker.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning worker: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of worker container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^worker.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning worker: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of worker container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_container_oomkill_events_total"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^worker.*"})) >= 1)
maximum active goroutines
Descriptions
- warning worker: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_go_goroutines"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*worker"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning worker: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_worker_go_gc_duration_seconds"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*worker"})) >= 2)
percentage pods available
Descriptions
- critical worker: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod worker
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p worker
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_worker_pods_available_percentage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*worker"}) / count by (app) (up{app=~".*worker"}) * 100) <= 90)
time since oldest sync
Descriptions
- critical repo-updater: 32400s+ time since oldest sync for 10m0s
Next steps
- An alert here indicates that no code host connections have synced in at least 9h0m0s. This indicates that there could be a configuration issue with your code hosts connections or networking issues affecting communication with your code hosts.
- Check the code host status indicator (cloud icon in top right of Sourcegraph homepage) for errors.
- Make sure external services do not have invalid tokens by navigating to them in the web UI and clicking save. If there are no errors, they are valid.
- Check the repo-updater logs for errors about syncing.
- Confirm that outbound network connections are allowed where repo-updater is deployed.
- Check back in an hour to see if the issue has resolved itself.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_src_repoupdater_max_sync_backoff"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: max((max(src_repoupdater_max_sync_backoff)) >= 32400)
site level external service sync error rate
Descriptions
- warning repo-updater: 0.5+ site level external service sync error rate for 10m0s
- critical repo-updater: 1+ site level external service sync error rate for 10m0s
Next steps
- An alert here indicates errors syncing site level repo metadata with code hosts. This indicates that there could be a configuration issue with your code hosts connections or networking issues affecting communication with your code hosts.
- Check the code host status indicator (cloud icon in top right of Sourcegraph homepage) for errors.
- Make sure external services do not have invalid tokens by navigating to them in the web UI and clicking save. If there are no errors, they are valid.
- Check the repo-updater logs for errors about syncing.
- Confirm that outbound network connections are allowed where repo-updater is deployed.
- Check back in an hour to see if the issue has resolved itself.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_src_repoupdater_syncer_sync_errors_total",
"critical_repo-updater_src_repoupdater_syncer_sync_errors_total"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max by (family) (rate(src_repoupdater_syncer_sync_errors_total{owner!="user",reason!="internal_rate_limit",reason!="invalid_npm_path"}[5m]))) > 0.5)
Generated query for critical alert: max((max by (family) (rate(src_repoupdater_syncer_sync_errors_total{owner!="user",reason!="internal_rate_limit",reason!="invalid_npm_path"}[5m]))) > 1)
repo metadata sync was started
Descriptions
- warning repo-updater: less than 0 repo metadata sync was started for 9h0m0s
Next steps
- Check repo-updater logs for errors.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_syncer_sync_start"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max by (family) (rate(src_repoupdater_syncer_start_sync{family="Syncer.SyncExternalService"}[9h]))) <= 0)
95th repositories sync duration
Descriptions
- warning repo-updater: 30s+ 95th repositories sync duration for 5m0s
Next steps
- Check the network latency is reasonable (<50ms) between the Sourcegraph and the code host
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_syncer_sync_duration"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.95, max by (le, family, success) (rate(src_repoupdater_syncer_sync_duration_seconds_bucket[1m])))) >= 30)
95th repositories source duration
Descriptions
- warning repo-updater: 30s+ 95th repositories source duration for 5m0s
Next steps
- Check the network latency is reasonable (<50ms) between the Sourcegraph and the code host
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_source_duration"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.95, max by (le) (rate(src_repoupdater_source_duration_seconds_bucket[1m])))) >= 30)
repositories synced
Descriptions
- warning repo-updater: less than 0 repositories synced for 9h0m0s
Next steps
- Check network connectivity to code hosts
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_syncer_synced_repos"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max(rate(src_repoupdater_syncer_synced_repos_total[1m]))) <= 0)
repositories sourced
Descriptions
- warning repo-updater: less than 0 repositories sourced for 9h0m0s
Next steps
- Check network connectivity to code hosts
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_sourced_repos"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max(rate(src_repoupdater_source_repos_total[1m]))) <= 0)
repositories purge failed
Descriptions
- warning repo-updater: 0+ repositories purge failed for 5m0s
Next steps
- Check repo-updater`s connectivity with gitserver and gitserver logs
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_purge_failed"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max(rate(src_repoupdater_purge_failed[1m]))) > 0)
repositories scheduled due to hitting a deadline
Descriptions
- warning repo-updater: less than 0 repositories scheduled due to hitting a deadline for 9h0m0s
Next steps
- Check repo-updater logs.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_sched_auto_fetch"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max(rate(src_repoupdater_sched_auto_fetch[1m]))) <= 0)
repositories managed by the scheduler
Descriptions
- warning repo-updater: less than 0 repositories managed by the scheduler for 10m0s
Next steps
- Check repo-updater logs. This is expected to fire if there are no user added code hosts
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_sched_known_repos"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max(src_repoupdater_sched_known_repos)) <= 0)
rate of growth of update queue length over 5 minutes
Descriptions
- critical repo-updater: 0+ rate of growth of update queue length over 5 minutes for 2h0m0s
Next steps
- Check repo-updater logs for indications that the queue is not being processed. The queue length should trend downwards over time as items are sent to GitServer
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_sched_update_queue_length"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: max((max(deriv(src_repoupdater_sched_update_queue_length[5m]))) > 0)
scheduler loops
Descriptions
- warning repo-updater: less than 0 scheduler loops for 9h0m0s
Next steps
- Check repo-updater logs for errors. This is expected to fire if there are no user added code hosts
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_sched_loops"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max(rate(src_repoupdater_sched_loops[1m]))) <= 0)
repos that haven't been fetched in more than 8 hours
Descriptions
- warning repo-updater: 1+ repos that haven't been fetched in more than 8 hours for 25m0s
Next steps
-
Check repo-updater logs for errors. Check for rows in gitserver_repos where LastError is not an empty string.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_src_repoupdater_stale_repos"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max(src_repoupdater_stale_repos)) >= 1)
repositories schedule error rate
Descriptions
- critical repo-updater: 1+ repositories schedule error rate for 25m0s
Next steps
- Check repo-updater logs for errors
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_sched_error"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: max((max(rate(src_repoupdater_sched_error[1m]))) >= 1)
time gap between least and most up to date permissions
Descriptions
- warning repo-updater: 259200s+ time gap between least and most up to date permissions for 5m0s
Next steps
- Increase the API rate limit to GitHub, GitLab or Bitbucket Server.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_perms_syncer_perms"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((max by (type) (src_repoupdater_perms_syncer_perms_gap_seconds)) >= 259200)
number of entities with stale permissions
Descriptions
- warning repo-updater: 100+ number of entities with stale permissions for 5m0s
Next steps
- Increase the API rate limit to GitHub, GitLab or Bitbucket Server.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_perms_syncer_stale_perms"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((max by (type) (src_repoupdater_perms_syncer_stale_perms)) >= 100)
number of entities with no permissions
Descriptions
- warning repo-updater: 100+ number of entities with no permissions for 5m0s
Next steps
- Enabled permissions for the first time: Wait for few minutes and see if the number goes down.
- Otherwise: Increase the API rate limit to GitHub, GitLab or Bitbucket Server.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_perms_syncer_no_perms"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((max by (type) (src_repoupdater_perms_syncer_no_perms)) >= 100)
number of entities with outdated permissions
Descriptions
- warning repo-updater: 100+ number of entities with outdated permissions for 5m0s
Next steps
- Enabled permissions for the first time: Wait for few minutes and see if the number goes down.
- Otherwise: Increase the API rate limit to GitHub, GitLab or Bitbucket Server.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_perms_syncer_outdated_perms"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((max by (type) (src_repoupdater_perms_syncer_outdated_perms)) >= 100)
95th permissions sync duration
Descriptions
- warning repo-updater: 30s+ 95th permissions sync duration for 5m0s
Next steps
- Check the network latency is reasonable (<50ms) between the Sourcegraph and the code host.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_perms_syncer_sync_duration"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((histogram_quantile(0.95, max by (le, type) (rate(src_repoupdater_perms_syncer_sync_duration_seconds_bucket[1m])))) >= 30)
permissions sync queued items
Descriptions
- warning repo-updater: 100+ permissions sync queued items for 5m0s
Next steps
- Enabled permissions for the first time: Wait for few minutes and see if the number goes down.
- Otherwise: Increase the API rate limit to GitHub, GitLab or Bitbucket Server.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_perms_syncer_queue_size"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for warning alert: max((max(src_repoupdater_perms_syncer_queue_size)) >= 100)
permissions sync error rate
Descriptions
- critical repo-updater: 1+ permissions sync error rate for 1m0s
Next steps
- Check the network connectivity the Sourcegraph and the code host.
- Check if API rate limit quota is exhausted on the code host.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_perms_syncer_sync_errors"
]
Managed by the Sourcegraph Identity and Access Management team.
Technical details
Generated query for critical alert: max((max by (type) (ceil(rate(src_repoupdater_perms_syncer_sync_errors_total[1m])))) >= 1)
the total number of external services
Descriptions
- critical repo-updater: 20000+ the total number of external services for 1h0m0s
Next steps
- Check for spikes in external services, could be abuse
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_src_repoupdater_external_services_total"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: max((max(src_repoupdater_external_services_total)) >= 20000)
the total number of queued sync jobs
Descriptions
- warning repo-updater: 100+ the total number of queued sync jobs for 1h0m0s
Next steps
- Check if jobs are failing to sync: "SELECT * FROM external_service_sync_jobs WHERE state =
errored
"; - Increase the number of workers using the
repoConcurrentExternalServiceSyncers
site config. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_repoupdater_queued_sync_jobs_total"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max(src_repoupdater_queued_sync_jobs_total)) >= 100)
the total number of completed sync jobs
Descriptions
- warning repo-updater: 100000+ the total number of completed sync jobs for 1h0m0s
Next steps
- Check repo-updater logs. Jobs older than 1 day should have been removed.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_repoupdater_completed_sync_jobs_total"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max(src_repoupdater_completed_sync_jobs_total)) >= 100000)
the percentage of external services that have failed their most recent sync
Descriptions
- warning repo-updater: 10%+ the percentage of external services that have failed their most recent sync for 1h0m0s
Next steps
- Check repo-updater logs. Check code host connectivity
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_repoupdater_errored_sync_jobs_percentage"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max(src_repoupdater_errored_sync_jobs_percentage)) > 10)
remaining calls to GitHub graphql API before hitting the rate limit
Descriptions
- warning repo-updater: less than 250 remaining calls to GitHub graphql API before hitting the rate limit
Next steps
- Consider creating a new token for the indicated resource (the
name
label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_github_graphql_rate_limit_remaining"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max by (name) (src_github_rate_limit_remaining_v2{resource="graphql"})) <= 250)
remaining calls to GitHub rest API before hitting the rate limit
Descriptions
- warning repo-updater: less than 250 remaining calls to GitHub rest API before hitting the rate limit
Next steps
- Consider creating a new token for the indicated resource (the
name
label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_github_rest_rate_limit_remaining"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max by (name) (src_github_rate_limit_remaining_v2{resource="rest"})) <= 250)
remaining calls to GitHub search API before hitting the rate limit
Descriptions
- warning repo-updater: less than 5 remaining calls to GitHub search API before hitting the rate limit
Next steps
- Consider creating a new token for the indicated resource (the
name
label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_github_search_rate_limit_remaining"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: min((max by (name) (src_github_rate_limit_remaining_v2{resource="search"})) <= 5)
remaining calls to GitLab rest API before hitting the rate limit
Descriptions
- critical repo-updater: less than 30 remaining calls to GitLab rest API before hitting the rate limit
Next steps
- Try restarting the pod to get a different public IP.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_gitlab_rest_rate_limit_remaining"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: min((max by (name) (src_gitlab_rate_limit_remaining{resource="rest"})) <= 30)
frontend-internal API error responses every 5m by route
Descriptions
- warning repo-updater: 2%+ frontend-internal API error responses every 5m by route for 5m0s
Next steps
- Single-container deployments: Check
docker logs $CONTAINER_ID
for logs starting withrepo-updater
that indicate requests to the frontend service are failing. - Kubernetes:
- Confirm that
kubectl get pods
shows thefrontend
pods are healthy. - Check
kubectl logs repo-updater
for logs indicate request failures tofrontend
orfrontend-internal
.
- Confirm that
- Docker Compose:
- Confirm that
docker ps
shows thefrontend-internal
container is healthy. - Check
docker logs repo-updater
for logs indicating request failures tofrontend
orfrontend-internal
.
- Confirm that
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_frontend_internal_api_error_responses"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2..",job="repo-updater"}[5m])) / ignoring (category) group_left () sum(increase(src_frontend_internal_request_duration_seconds_count{job="repo-updater"}[5m]))) >= 2)
mean blocked seconds per conn request
Descriptions
- warning repo-updater: 0.05s+ mean blocked seconds per conn request for 10m0s
- critical repo-updater: 0.1s+ mean blocked seconds per conn request for 15m0s
Next steps
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory / cpus See our scaling guide
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_mean_blocked_seconds_per_conn_request",
"critical_repo-updater_mean_blocked_seconds_per_conn_request"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="repo-updater"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="repo-updater"}[5m]))) >= 0.05)
Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="repo-updater"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="repo-updater"}[5m]))) >= 0.1)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning repo-updater: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the repo-updater container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_container_cpu_usage"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^repo-updater.*"}) >= 99)
container memory usage by instance
Descriptions
- critical repo-updater: 90%+ container memory usage by instance for 10m0s
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of repo-updater container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_container_memory_usage"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^repo-updater.*"}) >= 90)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning repo-updater: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the repo-updater service. - Docker Compose: Consider increasing
cpus:
of the repo-updater container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^repo-updater.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning repo-updater: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the repo-updater service. - Docker Compose: Consider increasing
memory:
of the repo-updater container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^repo-updater.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning repo-updater: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the repo-updater container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^repo-updater.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning repo-updater: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of repo-updater container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^repo-updater.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning repo-updater: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of repo-updater container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_container_oomkill_events_total"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^repo-updater.*"})) >= 1)
maximum active goroutines
Descriptions
- warning repo-updater: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_go_goroutines"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*repo-updater"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning repo-updater: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_repo-updater_go_gc_duration_seconds"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*repo-updater"})) >= 2)
percentage pods available
Descriptions
- critical repo-updater: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod repo-updater
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p repo-updater
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_repo-updater_pods_available_percentage"
]
Managed by the Sourcegraph Repo Management team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*repo-updater"}) / count by (app) (up{app=~".*repo-updater"}) * 100) <= 90)
unindexed search request errors every 5m by code
Descriptions
- warning searcher: 5%+ unindexed search request errors every 5m by code for 5m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_unindexed_search_request_errors"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((sum by (code) (increase(searcher_service_request_total{code!="200",code!="canceled"}[5m])) / ignoring (code) group_left () sum(increase(searcher_service_request_total[5m])) * 100) >= 5)
requests per second over 10m
Descriptions
- warning searcher: 5+ requests per second over 10m
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_replica_traffic"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((sum by (instance) (rate(searcher_service_request_total[10m]))) >= 5)
mean blocked seconds per conn request
Descriptions
- warning searcher: 0.05s+ mean blocked seconds per conn request for 10m0s
- critical searcher: 0.1s+ mean blocked seconds per conn request for 15m0s
Next steps
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory / cpus See our scaling guide
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_mean_blocked_seconds_per_conn_request",
"critical_searcher_mean_blocked_seconds_per_conn_request"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="searcher"}[5m]))) >= 0.05)
Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="searcher"}[5m]))) >= 0.1)
frontend-internal API error responses every 5m by route
Descriptions
- warning searcher: 2%+ frontend-internal API error responses every 5m by route for 5m0s
Next steps
- Single-container deployments: Check
docker logs $CONTAINER_ID
for logs starting withrepo-updater
that indicate requests to the frontend service are failing. - Kubernetes:
- Confirm that
kubectl get pods
shows thefrontend
pods are healthy. - Check
kubectl logs searcher
for logs indicate request failures tofrontend
orfrontend-internal
.
- Confirm that
- Docker Compose:
- Confirm that
docker ps
shows thefrontend-internal
container is healthy. - Check
docker logs searcher
for logs indicating request failures tofrontend
orfrontend-internal
.
- Confirm that
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_frontend_internal_api_error_responses"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2..",job="searcher"}[5m])) / ignoring (category) group_left () sum(increase(src_frontend_internal_request_duration_seconds_count{job="searcher"}[5m]))) >= 2)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning searcher: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the searcher container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_container_cpu_usage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^searcher.*"}) >= 99)
container memory usage by instance
Descriptions
- warning searcher: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of searcher container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_container_memory_usage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^searcher.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning searcher: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the searcher service. - Docker Compose: Consider increasing
cpus:
of the searcher container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^searcher.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning searcher: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the searcher service. - Docker Compose: Consider increasing
memory:
of the searcher container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^searcher.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning searcher: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the searcher container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^searcher.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning searcher: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of searcher container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^searcher.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning searcher: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of searcher container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_container_oomkill_events_total"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^searcher.*"})) >= 1)
maximum active goroutines
Descriptions
- warning searcher: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_go_goroutines"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*searcher"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning searcher: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_searcher_go_gc_duration_seconds"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*searcher"})) >= 2)
percentage pods available
Descriptions
- critical searcher: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod searcher
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p searcher
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_searcher_pods_available_percentage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*searcher"}) / count by (app) (up{app=~".*searcher"}) * 100) <= 90)
mean blocked seconds per conn request
Descriptions
- warning symbols: 0.05s+ mean blocked seconds per conn request for 10m0s
- critical symbols: 0.1s+ mean blocked seconds per conn request for 15m0s
Next steps
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory / cpus See our scaling guide
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_mean_blocked_seconds_per_conn_request",
"critical_symbols_mean_blocked_seconds_per_conn_request"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="symbols"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="symbols"}[5m]))) >= 0.05)
Generated query for critical alert: max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="symbols"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="symbols"}[5m]))) >= 0.1)
frontend-internal API error responses every 5m by route
Descriptions
- warning symbols: 2%+ frontend-internal API error responses every 5m by route for 5m0s
Next steps
- Single-container deployments: Check
docker logs $CONTAINER_ID
for logs starting withrepo-updater
that indicate requests to the frontend service are failing. - Kubernetes:
- Confirm that
kubectl get pods
shows thefrontend
pods are healthy. - Check
kubectl logs symbols
for logs indicate request failures tofrontend
orfrontend-internal
.
- Confirm that
- Docker Compose:
- Confirm that
docker ps
shows thefrontend-internal
container is healthy. - Check
docker logs symbols
for logs indicating request failures tofrontend
orfrontend-internal
.
- Confirm that
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_frontend_internal_api_error_responses"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((sum by (category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2..",job="symbols"}[5m])) / ignoring (category) group_left () sum(increase(src_frontend_internal_request_duration_seconds_count{job="symbols"}[5m]))) >= 2)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning symbols: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the symbols container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_container_cpu_usage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^symbols.*"}) >= 99)
container memory usage by instance
Descriptions
- warning symbols: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of symbols container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_container_memory_usage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^symbols.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning symbols: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the symbols service. - Docker Compose: Consider increasing
cpus:
of the symbols container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^symbols.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning symbols: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the symbols service. - Docker Compose: Consider increasing
memory:
of the symbols container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^symbols.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning symbols: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the symbols container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^symbols.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning symbols: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of symbols container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^symbols.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning symbols: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of symbols container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_container_oomkill_events_total"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^symbols.*"})) >= 1)
maximum active goroutines
Descriptions
- warning symbols: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_go_goroutines"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (instance) (go_goroutines{job=~".*symbols"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning symbols: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_symbols_go_gc_duration_seconds"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (instance) (go_gc_duration_seconds{job=~".*symbols"})) >= 2)
percentage pods available
Descriptions
- critical symbols: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod symbols
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p symbols
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_symbols_pods_available_percentage"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*symbols"}) / count by (app) (up{app=~".*symbols"}) * 100) <= 90)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning syntect-server: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the syntect-server container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_syntect-server_container_cpu_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^syntect-server.*"}) >= 99)
container memory usage by instance
Descriptions
- warning syntect-server: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of syntect-server container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_syntect-server_container_memory_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^syntect-server.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning syntect-server: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the syntect-server service. - Docker Compose: Consider increasing
cpus:
of the syntect-server container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_syntect-server_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^syntect-server.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning syntect-server: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the syntect-server service. - Docker Compose: Consider increasing
memory:
of the syntect-server container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_syntect-server_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^syntect-server.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning syntect-server: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the syntect-server container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_syntect-server_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^syntect-server.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning syntect-server: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of syntect-server container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_syntect-server_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^syntect-server.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning syntect-server: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of syntect-server container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_syntect-server_container_oomkill_events_total"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^syntect-server.*"})) >= 1)
percentage pods available
Descriptions
- critical syntect-server: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod syntect-server
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p syntect-server
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_syntect-server_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*syntect-server"}) / count by (app) (up{app=~".*syntect-server"}) * 100) <= 90)
average resolve revision duration over 5m
Descriptions
- warning zoekt: 15s+ average resolve revision duration over 5m
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_average_resolve_revision_duration"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((sum(rate(resolve_revision_seconds_sum[5m])) / sum(rate(resolve_revision_seconds_count[5m]))) >= 15)
the number of repositories we failed to get indexing options over 5m
Descriptions
- warning zoekt: 100+ the number of repositories we failed to get indexing options over 5m for 5m0s
- critical zoekt: 100+ the number of repositories we failed to get indexing options over 5m for 35m0s
Next steps
- View error rates on gitserver and frontend to identify root cause.
- Rollback frontend/gitserver deployment if due to a bad code change.
- View error logs for
getIndexOptions
via net/trace debug interface. For example click on aindexed-search-indexer-
on https://sourcegraph.com/-/debug/. Then click on Traces. Replace sourcegraph.com with your instance address. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_get_index_options_error_increase",
"critical_zoekt_get_index_options_error_increase"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((sum(increase(get_index_options_error_total[5m]))) >= 100)
Generated query for critical alert: max((sum(increase(get_index_options_error_total[5m]))) >= 100)
indexed search request errors every 5m by code
Descriptions
- warning zoekt: 5%+ indexed search request errors every 5m by code for 5m0s
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_indexed_search_request_errors"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((sum by (code) (increase(src_zoekt_request_duration_seconds_count{code!~"2.."}[5m])) / ignoring (code) group_left () sum(increase(src_zoekt_request_duration_seconds_count[5m])) * 100) >= 5)
process memory map areas percentage used (per instance)
Descriptions
- warning zoekt: 60%+ process memory map areas percentage used (per instance)
- critical zoekt: 80%+ process memory map areas percentage used (per instance)
Next steps
-
If you are running out of memory map areas, you could resolve this by:
- Creating additional Zoekt replicas: This spreads all the shards out amongst more replicas, which means that each individual replica will have fewer shards. This, in turn, decreases the amount of memory map areas that a single replica can create (in order to load the shards into memory).
- Increase the virtual memory subsystem`s "max_map_count" parameter which defines the upper limit of memory areas a process can use. The exact instructions for tuning this parameter can differ depending on your environment. See https://kernel.org/doc/Documentation/sysctl/vm.txt for more information.
-
More help interpreting this metric is available in the dashboards reference.
-
Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_memory_map_areas_percentage_used",
"critical_zoekt_memory_map_areas_percentage_used"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max(((proc_metrics_memory_map_current_count / proc_metrics_memory_map_max_limit) * 100) >= 60)
Generated query for critical alert: max(((proc_metrics_memory_map_current_count / proc_metrics_memory_map_max_limit) * 100) >= 80)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning zoekt: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the zoekt-indexserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_container_cpu_usage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-indexserver.*"}) >= 99)
container memory usage by instance
Descriptions
- warning zoekt: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of zoekt-indexserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_container_memory_usage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^zoekt-indexserver.*"}) >= 99)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning zoekt: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_container_cpu_usage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-webserver.*"}) >= 99)
container memory usage by instance
Descriptions
- warning zoekt: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of zoekt-webserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_container_memory_usage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^zoekt-webserver.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning zoekt: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the zoekt-indexserver service. - Docker Compose: Consider increasing
cpus:
of the zoekt-indexserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-indexserver.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning zoekt: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the zoekt-indexserver service. - Docker Compose: Consider increasing
memory:
of the zoekt-indexserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^zoekt-indexserver.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning zoekt: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the zoekt-indexserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-indexserver.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning zoekt: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of zoekt-indexserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^zoekt-indexserver.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning zoekt: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of zoekt-indexserver container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_container_oomkill_events_total"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^zoekt-indexserver.*"})) >= 1)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning zoekt: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the zoekt-webserver service. - Docker Compose: Consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-webserver.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning zoekt: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the zoekt-webserver service. - Docker Compose: Consider increasing
memory:
of the zoekt-webserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^zoekt-webserver.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning zoekt: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^zoekt-webserver.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning zoekt: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of zoekt-webserver container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^zoekt-webserver.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning zoekt: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of zoekt-webserver container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_zoekt_container_oomkill_events_total"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^zoekt-webserver.*"})) >= 1)
percentage pods available
Descriptions
- critical zoekt: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod indexed-search
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p indexed-search
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_zoekt_pods_available_percentage"
]
Managed by the Sourcegraph Search Core team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*indexed-search"}) / count by (app) (up{app=~".*indexed-search"}) * 100) <= 90)
average prometheus rule group evaluation duration over 10m by rule group
Descriptions
- warning prometheus: 30s+ average prometheus rule group evaluation duration over 10m by rule group
Next steps
- Check the Container monitoring (not available on server) panels and try increasing resources for Prometheus if necessary.
- If the rule group taking a long time to evaluate belongs to
/sg_prometheus_addons
, try reducing the complexity of any custom Prometheus rules provided. - If the rule group taking a long time to evaluate belongs to
/sg_config_prometheus
, please open an issue. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_prometheus_rule_eval_duration"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (rule_group) (avg_over_time(prometheus_rule_group_last_duration_seconds[10m]))) >= 30)
failed prometheus rule evaluations over 5m by rule group
Descriptions
- warning prometheus: 0+ failed prometheus rule evaluations over 5m by rule group
Next steps
- Check Prometheus logs for messages related to rule group evaluation (generally with log field
component="rule manager"
). - If the rule group failing to evaluate belongs to
/sg_prometheus_addons
, ensure any custom Prometheus configuration provided is valid. - If the rule group taking a long time to evaluate belongs to
/sg_config_prometheus
, please open an issue. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_prometheus_rule_eval_failures"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (rule_group) (rate(prometheus_rule_evaluation_failures_total[5m]))) > 0)
alertmanager notification latency over 1m by integration
Descriptions
- warning prometheus: 1s+ alertmanager notification latency over 1m by integration
Next steps
- Check the Container monitoring (not available on server) panels and try increasing resources for Prometheus if necessary.
- Ensure that your
observability.alerts
configuration (in site configuration) is valid. - Check if the relevant alert integration service is experiencing downtime or issues.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_alertmanager_notification_latency"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (integration) (rate(alertmanager_notification_latency_seconds_sum[1m]))) >= 1)
failed alertmanager notifications over 1m by integration
Descriptions
- warning prometheus: 0+ failed alertmanager notifications over 1m by integration
Next steps
- Ensure that your
observability.alerts
configuration (in site configuration) is valid. - Check if the relevant alert integration service is experiencing downtime or issues.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_alertmanager_notification_failures"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (integration) (rate(alertmanager_notifications_failed_total[1m]))) > 0)
prometheus configuration reload status
Descriptions
- warning prometheus: less than 1 prometheus configuration reload status
Next steps
- Check Prometheus logs for messages related to configuration loading.
- Ensure any custom configuration you have provided Prometheus is valid.
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_prometheus_config_status"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: min((prometheus_config_last_reload_successful) < 1)
alertmanager configuration reload status
Descriptions
- warning prometheus: less than 1 alertmanager configuration reload status
Next steps
- Ensure that your
observability.alerts
configuration (in site configuration) is valid. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_alertmanager_config_status"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: min((alertmanager_config_last_reload_successful) < 1)
prometheus tsdb failures by operation over 1m by operation
Descriptions
- warning prometheus: 0+ prometheus tsdb failures by operation over 1m by operation
Next steps
- Check Prometheus logs for messages related to the failing operation.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_prometheus_tsdb_op_failure"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((increase(label_replace({__name__=~"prometheus_tsdb_(.*)_failed_total"}, "operation", "$1", "__name__", "(.+)s_failed_total")[5m:1m])) > 0)
prometheus scrapes that exceed the sample limit over 10m
Descriptions
- warning prometheus: 0+ prometheus scrapes that exceed the sample limit over 10m
Next steps
- Check Prometheus logs for messages related to target scrape failures.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_prometheus_target_sample_exceeded"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m])) > 0)
prometheus scrapes rejected due to duplicate timestamps over 10m
Descriptions
- warning prometheus: 0+ prometheus scrapes rejected due to duplicate timestamps over 10m
Next steps
- Check Prometheus logs for messages related to target scrape failures.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_prometheus_target_sample_duplicate"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[10m])) > 0)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning prometheus: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the prometheus container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_container_cpu_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^prometheus.*"}) >= 99)
container memory usage by instance
Descriptions
- warning prometheus: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of prometheus container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_container_memory_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^prometheus.*"}) >= 99)
container cpu usage total (90th percentile over 1d) across all cores by instance
Descriptions
- warning prometheus: 80%+ container cpu usage total (90th percentile over 1d) across all cores by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the
Deployment.yaml
for the prometheus service. - Docker Compose: Consider increasing
cpus:
of the prometheus container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_provisioning_container_cpu_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((quantile_over_time(0.9, cadvisor_container_cpu_usage_percentage_total{name=~"^prometheus.*"}[1d])) >= 80)
container memory usage (1d maximum) by instance
Descriptions
- warning prometheus: 80%+ container memory usage (1d maximum) by instance for 336h0m0s
Next steps
- Kubernetes: Consider increasing memory limits in the
Deployment.yaml
for the prometheus service. - Docker Compose: Consider increasing
memory:
of the prometheus container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_provisioning_container_memory_usage_long_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^prometheus.*"}[1d])) >= 80)
container cpu usage total (5m maximum) across all cores by instance
Descriptions
- warning prometheus: 90%+ container cpu usage total (5m maximum) across all cores by instance for 30m0s
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the prometheus container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_provisioning_container_cpu_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_cpu_usage_percentage_total{name=~"^prometheus.*"}[5m])) >= 90)
container memory usage (5m maximum) by instance
Descriptions
- warning prometheus: 90%+ container memory usage (5m maximum) by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of prometheus container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_provisioning_container_memory_usage_short_term"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max_over_time(cadvisor_container_memory_usage_percentage_total{name=~"^prometheus.*"}[5m])) >= 90)
container OOMKILL events total by instance
Descriptions
- warning prometheus: 1+ container OOMKILL events total by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of prometheus container indocker-compose.yml
. - More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_prometheus_container_oomkill_events_total"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((max by (name) (container_oom_events_total{name=~"^prometheus.*"})) >= 1)
percentage pods available
Descriptions
- critical prometheus: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod prometheus
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p prometheus
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_prometheus_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*prometheus"}) / count by (app) (up{app=~".*prometheus"}) * 100) <= 90)
executor active handlers
Descriptions
- critical executor: 0 active executor handlers and > 0 queue size for 5m0s
Next steps
- Check to see the state of any compute VMs, they may be taking longer than expected to boot.
- Make sure the executors appear under Site Admin > Executors.
- Check the Grafana dashboard section for APIClient, it should do frequent requests to Dequeue and Heartbeat and those must not fail.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_executor_executor_handlers"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Custom query for critical alert: min(((sum(src_executor_processor_handlers{sg_job=~"^sourcegraph-executors.*"}) or vector(0)) == 0 and (sum by (queue) (src_executor_total{job=~"^sourcegraph-executors.*"})) > 0) <= 0)
executor operation error rate over 5m
Descriptions
- critical executor: 100%+ executor operation error rate over 5m for 1h0m0s
Next steps
- Determine the cause of failure from the auto-indexing job logs in the site-admin page.
- This alert fires if all executor jobs have been failing for the past hour. The alert will continue for up to 5 hours until the error rate is no longer 100%, even if there are no running jobs in that time, as the problem is not know to be resolved until jobs start succeeding again.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_executor_executor_processor_error_rate"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Custom query for critical alert: max((last_over_time(sum(increase(src_executor_processor_errors_total{sg_job=~"^sourcegraph-executors.*"}[5m]))[5h:]) / (last_over_time(sum(increase(src_executor_processor_total{sg_job=~"^sourcegraph-executors.*"}[5m]))[5h:]) + last_over_time(sum(increase(src_executor_processor_errors_total{sg_job=~"^sourcegraph-executors.*"}[5m]))[5h:])) * 100) >= 100)
maximum active goroutines
Descriptions
- warning executor: 10000+ maximum active goroutines for 10m0s
Next steps
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_executor_go_goroutines"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (sg_instance) (go_goroutines{sg_job=~".*sourcegraph-executors"})) >= 10000)
maximum go garbage collection duration
Descriptions
- warning executor: 2s+ maximum go garbage collection duration
Next steps
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_executor_go_gc_duration_seconds"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for warning alert: max((max by (sg_instance) (go_gc_duration_seconds{sg_job=~".*sourcegraph-executors"})) >= 2)
repository queue longest time in queue
Descriptions
- critical codeintel-uploads: 3600s+ repository queue longest time in queue
Next steps
- An alert here is generally indicative of either underprovisioned worker instance(s) and/or an underprovisioned main postgres instance.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_codeintel-uploads_codeintel_commit_graph_queued_max_age"
]
Managed by the Sourcegraph Code intelligence team.
Technical details
Generated query for critical alert: max((max(src_codeintel_commit_graph_queued_duration_seconds_total)) >= 3600)
usage data exporter operation error rate over 5m
Descriptions
- warning telemetry: 0%+ usage data exporter operation error rate over 5m for 30m0s
Next steps
- Involved cloud team to inspect logs of the managed instance to determine error sources.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_telemetry_telemetry_job_error_rate"
]
Managed by the Sourcegraph Data & Analytics team.
Technical details
Generated query for warning alert: max((sum by (op) (increase(src_telemetry_job_errors_total{job=~"^worker.*"}[5m])) / (sum by (op) (increase(src_telemetry_job_total{job=~"^worker.*"}[5m])) + sum by (op) (increase(src_telemetry_job_errors_total{job=~"^worker.*"}[5m]))) * 100) > 0)
utilized percentage of maximum throughput
Descriptions
- warning telemetry: 90%+ utilized percentage of maximum throughput for 30m0s
Next steps
- Throughput utilization is high. This could be a signal that this instance is producing too many events for the export job to keep up. Configure more throughput using the maxBatchSize option.
- Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_telemetry_telemetry_job_utilized_throughput"
]
Managed by the Sourcegraph Data & Analytics team.
Technical details
Generated query for warning alert: max((rate(src_telemetry_job_total{op="SendEvents"}[1h]) / on () group_right () src_telemetry_job_max_throughput * 100) > 90)
spans refused per receiver
Descriptions
- warning otel-collector: 1+ spans refused per receiver for 5m0s
Next steps
- Check logs of the collector and configuration of the receiver
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_otel-collector_otel_span_refused"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (receiver) (rate(otelcol_receiver_refused_spans[1m]))) > 1)
span export failures by exporter
Descriptions
- warning otel-collector: 1+ span export failures by exporter for 5m0s
Next steps
- Check the configuration of the exporter and if the service being exported is up
- More help interpreting this metric is available in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_otel-collector_otel_span_export_failures"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((sum by (exporter) (rate(otelcol_exporter_send_failed_spans[1m]))) > 1)
container cpu usage total (1m average) across all cores by instance
Descriptions
- warning otel-collector: 99%+ container cpu usage total (1m average) across all cores by instance
Next steps
- Kubernetes: Consider increasing CPU limits in the the relevant
Deployment.yaml
. - Docker Compose: Consider increasing
cpus:
of the otel-collector container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_otel-collector_container_cpu_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_cpu_usage_percentage_total{name=~"^otel-collector.*"}) >= 99)
container memory usage by instance
Descriptions
- warning otel-collector: 99%+ container memory usage by instance
Next steps
- Kubernetes: Consider increasing memory limit in relevant
Deployment.yaml
. - Docker Compose: Consider increasing
memory:
of otel-collector container indocker-compose.yml
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"warning_otel-collector_container_memory_usage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for warning alert: max((cadvisor_container_memory_usage_percentage_total{name=~"^otel-collector.*"}) >= 99)
percentage pods available
Descriptions
- critical otel-collector: less than 90% percentage pods available for 10m0s
Next steps
- Determine if the pod was OOM killed using
kubectl describe pod otel-collector
(look forOOMKilled: true
) and, if so, consider increasing the memory limit in the relevantDeployment.yaml
. - Check the logs before the container restarted to see if there are
panic:
messages or similar usingkubectl logs -p otel-collector
. - Learn more about the related dashboard panel in the dashboards reference.
- Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
"observability.silenceAlerts": [
"critical_otel-collector_pods_available_percentage"
]
Managed by the Sourcegraph Cloud DevOps team.
Technical details
Generated query for critical alert: min((sum by (app) (up{app=~".*otel-collector"}) / count by (app) (up{app=~".*otel-collector"}) * 100) <= 90)