Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dvue/k8s otel alert policies #2558

Merged
merged 8 commits into from
Oct 17, 2024

Conversation

douavue
Copy link
Contributor

@douavue douavue commented Sep 17, 2024

Summary

A concise description of the changes being introduced. Please review the pre-merge checklist section to validate this pull request is ready for review and merge. If it is not ready, please mark the pull request as a draft.

The owners of this repo are not experts in the subject matter of the quickstarts. We review for the quickstart to be functional and for security risks. If you are seeking feedback on the content of the quickstart, please seek out a subject matter expert. If you are not an internal NR contributor, we can do our best to connect you with a content reviewer.

Pre merge checklist

  • Did you check you NRQL syntax? - Does it work?
  • Did you include a Data source and Documentation reference?
  • Are all documentation links publicly accessible?
  • Did you check your descriptive content for voice and tone?
  • Did you check your descriptive content for spelling and grammar errors?
  • Did you review your content with a subject matter expert? (e.g. a Browser agent quickstart is reviewed with a member of the Browser Agent team)

Dashboards

  • Does the PR contain a screenshot for each of your dashboards?
  • Do your screenshots show data?
  • Has the sanitization script been run on each dashboard?

Alerts

  • Did you check that your alerts actually work?
  • Are you trying to create standalone alerts? Standalone alerts are deprecated. They should only be included in quickstarts.

@CLAassistant
Copy link

CLAassistant commented Sep 17, 2024

CLA assistant check
All committers have signed the CLA.

@douavue
Copy link
Contributor Author

douavue commented Sep 17, 2024

Marking as Ready, but pending additional changes.
DO NOT MERGE.

@douavue douavue marked this pull request as ready for review September 17, 2024 19:28
Copy link
Contributor

@sjyothi54 sjyothi54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@douavue I have added some comments and Could you please ensure consistency in the title "Kubernetes (OpenTelemetry)" throughout the quickstart.

Kubernetes is an open-source container-orchestration system for automating
computer application deployment, scaling, and management.
url: >-
https://docs.newrelic.com/docs/integrations/host-integrations/host-integrations-list/kubernetes-monitoring-integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
https://docs.newrelic.com/docs/integrations/host-integrations/host-integrations-list/kubernetes-monitoring-integration
https://docs.newrelic.com/docs/kubernetes-pixie/kubernetes-integration/installation/k8s-otel/#install

I noticed that we have a Kubernetes OpenTelemetry-specific document.

url: >-
https://docs.newrelic.com/docs/integrations/host-integrations/host-integrations-list/kubernetes-monitoring-integration
dataSourceIds:
- kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- kubernetes
- kubernetes-opentelemetry

Since this quickstart is different from Kubernetes, you need to create a new datasource: kubernetes-opentelemetry. Here is the example: https://github.com/newrelic/newrelic-quickstarts/tree/release/data-sources/apigee-api

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sjyothi54. Do the dataSourceId and dashboards properties require a value or can I remove them from the config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @douavue, dashboards are optional, but it is best practice to provide them in order to show the customer what are the metrics they get in dashboard. Since this is a new quickstart, datasourceIds is required. Please provide datasourceIds for the Kubernets opentelemetry, it is used to create install steps in production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjyothi54 - I pushed an update here. We are not including a dashboard for now. Can you please review again? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjyothi54 we don't provide a dashboard, because the K8s OTel instrumentation enables all the regular K8s UIs so no additional dashboards are needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjyothi54 we don't provide a dashboard, because the K8s OTel instrumentation enables all the regular K8s UIs so no additional dashboards are needed.

@mangulonr @douavue Thanks for the clarification

Comment on lines 51 to 52
dashboards:
- kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dashboards:
- kubernetes
dashboards:
- kubernetes

Since you attached the existing Kubernetes dashboard, will this dashboard include the metrics that we receive in Kubernetes OpenTelemetry?

sjyothi54
sjyothi54 previously approved these changes Oct 8, 2024
Copy link
Contributor

@sjyothi54 sjyothi54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@douavue Looks good.

Copy link
Contributor

@RamanaReddy8801 RamanaReddy8801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @douavue, please add the icon attribute in the file quickstarts/kubernetes-opentelemetry/config.yml and make sure the logo image is related to Kubernetes.

primary:
link:
url: https://docs.newrelic.com/docs/kubernetes-pixie/kubernetes-integration/installation/k8s-otel/#install
icon: logo.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
icon: logo.png
icon: logo.svg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@douavue
Copy link
Contributor Author

douavue commented Oct 8, 2024

@sjyothi54 - Updated the logo here.

What's required to for this step? Thanks for all your help.

# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 90
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says Alert when container is being throttled > 25% but the threshold value is 90. I think it should be 25?


# NRQL query
nrql:
query: "from Metric max(container_memory_working_set_bytes) / filter(max(kube_pod_container_resource_limits), where resource = 'memory') where k8s.cluster.name in ('YOUR_CLUSTER_NAME') and k8s.namespace.name in ('YOUR_NAMESPACE_NAME') facet k8s.container.name, k8s.pod.name, k8s.namespace.name, k8s.cluster.name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing select. It should be from Metric select max(container_memory_working_set_bytes)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get this one to work because I can't find Metrics where they contain container_memory_working_set_bytes and k8s.container.name at the same time. It could be a skill issue tho.


# NRQL query
nrql:
query: "from Metric select latest(k8s.node.cpu.utilization) where k8s.cluster.name in ('YOUR_CLUSTER_NAME') facet k8s.node.name, k8s.cluster.name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe latest(k8s.node.cpu.utilization) would get you the last CPU utilization for the 5 minute window, which could be a brief spike where the other 4 minutes were relatively low. You might want average(k8s.node.cpu.utilization) based on the description.


# NRQL query
nrql:
query: "from Metric select latest(k8s.node.memory.working_set) / filter(latest(kube_node_status_allocatable), WHERE resource = 'memory') where k8s.cluster.name in ('YOUR_CLUSTER_NAME') facet k8s.node.name, k8s.cluster.name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as CPU. Do you want average(k8s.node.memory.working_set)?

# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 90
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query returns a value 0 to 1. We would need to multiply the select value by 100 or divide this by 100.


# NRQL query
nrql:
query: "from Metric select max(k8s.node.filesystem.usage) / max(k8s.node.filesystem.capacity) where k8s.cluster.name in ('YOUR_CLUSTER_NAME') facet k8s.node.name, k8s.cluster.name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says average but the query uses max.


# NRQL query
nrql:
query: "from Metric if(latest(condition) = 'Ready', 0, 1) where k8s.cluster.name in ('YOUR_CLUSTER_NAME') facet k8s.node.name, k8s.cluster.name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing a select in from Metric select if(latest(condition)


# NRQL query
nrql:
query: "FROM Metric select filter(uniqueCount(k8s.pod.name), where phase = 'Running' AND (metricName = 'kube_pod_status_phase' AND kube_pod_status_phase ['latest'] = 1) and created_by_kind != 'Job' ) / filter(latest(kube_node_status_allocatable), WHERE resource = 'cpu' ) * 100 as 'Pod Capacity %' where k8s.node.name != '' and k8s.node.name is not null and k8s.cluster.name in ('YOUR_CLUSTER_NAME') facet k8s.node.name, k8s.cluster.name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe WHERE resource = 'cpu' should be WHERE resource = 'pods'

# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says Alert when more than 5 pods so I think this should be 5.


# NRQL query
nrql:
query: "from Metric latest(kube_statefulset_replicas) - latest(kube_statefulset_status_replicas_ready) where k8s.cluster.name in ('YOUR_CLUSTER_NAME') and k8s.namespace.name in ('YOUR_NAMESPACE_NAME') facet k8s.statefulset.name, k8s.namespace.name, k8s.cluster.name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing select in from Metric latest(kube_statefulset_replicas)

Copy link
Contributor

@RamanaReddy8801 RamanaReddy8801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nr-mlosier nr-mlosier merged commit c5ed531 into newrelic:release Oct 17, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants