Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hardware-metrics to semantic conventions #2518

Merged
merged 7 commits into from
Aug 9, 2022
Merged

Add hardware-metrics to semantic conventions #2518

merged 7 commits into from
Aug 9, 2022

Conversation

bertysentry
Copy link
Contributor

@bertysentry bertysentry commented May 2, 2022

Changes

Added hardware metrics to semantic conventions

Related to issue #2340

@bertysentry bertysentry requested review from a team May 2, 2022 17:35
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bertysentry Thank you for the PR.

The linked issue has not been discussed yet. I think we likely want to have hardware metrics in the spec, but I would want to hear from other spec approvers if they agree that it is a good idea to have semantic conventions for the hardware metrics.

If the community feels these do belong to the spec then we will need a detailed review which may take some time due to the PR's size. Unless there is a reason that all of these must be merged together it may be easier to make progress if this is presented as a few separate PRs, one per hardware area (once there is a general agreement that we do want hardware metrics in the spec).

A few generic questions/comments:

  • Please rebase from main and fix the build.
  • Do we want hw. as a prefix and not system.? The linked issue talks about adding e.g. system.sensor.temp. Also, do we want abbreviated hw. or hardware.? Any arguments in favour of against any of these approaches?
  • How does this proposal compare with any existing industry standards for hardware metrics reporting (if any exist)?

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major suggestions:

  1. Use standard units.
  2. Consider/Explore putting metadata as Resource instead of Attributes (most of the data won't change - unless the hardware changed).
  3. Reference to other standards (e.g. S.M.A.R.T.).

I've approved the PR because all of the above suggestions can be addressed in follow up PRs. The overall direction looks good to me.

@reyang reyang added the area:semantic-conventions Related to semantic conventions label May 2, 2022
@bertysentry
Copy link
Contributor Author

Thank you for the time you spent on this review and your comments, @tigrannajaryan

Let me answer your questions:

Do we want hw. as a prefix and not system.? The linked issue talks about adding e.g. system.sensor.temp.

Having something separate from system. is important because system represents the operating system and its performance metrics, which probably runs in a virtual machine while hw represents a physical host only.

Example: at the hardware level, you don't consider CPU time utilization (as it's purely an OS metric). So system.cpu.* and hw.cpu. are 2 very different beasts. system.cpu represents a CPU core, while hw.cpu represents the "real" CPU package (which includes several cores).

If we mix system and hw, we will be mixing things that are different in nature and the metrics will be difficult to leverage in dashboards.

Other examples are disks (the physical disk behind a RAID controller are different from the disks seen by the operating system) and network interfaces (the OS has loopback and plenty of virtual adapters for aggregation, VPNs, etc.).

Do we want abbreviated hw. or hardware.? Any arguments in favour of against any of these approaches?

Currently, we have system (full word) and db (abbreviation), so both are acceptable. I tend to prefer hw because it's shorter while still perfectly understandable. Since it's shorter, it consumes less energy and thus emits less carbon 😊

How does this proposal compare with any existing industry standards for hardware metrics reporting (if any exist)?

My company has been developing and maintaining a hardware monitoring software for 18 years, that covers all models from all manufacturers (Brocade, Cisco, Dell-EMC, HP, Huawei, HP, NetApp, Oracle, Pure, Synology, etc. I can't list them all). We've dealt with all sorts of hardware instrumentation standards, from specific SNMP MIBs to DMTF's WBEM, to IPMI, to SMI-S, to Redfish and SMASH, to proprietary REST APIs, you name it. See https://www.sentrysoftware.com/docs/hardware-connectors/23/platform-requirements.html for a full list of we got covered.

We've rewritten our proprietary hardware monitoring software as an OpenTelemetry Collector distribution (available for free, not yet open source because this takes some time). We've taken some time to redefine the metrics that can be collected following OpenTelemetry semantic convention, and that's why we're proposing this.

With our experience, we are pretty confident that the proposed metrics cover 95% of the cases and are good enough to represent any system available out there. You can trust us on this 😎 (although we're open to discuss, obviously, and that's what this PR is for).

Note: The current version of our product doesn't strictly follow the proposed convention. We want to agree on something before we spend time updating our code and all dashboards that rely on these metrics.

@bertysentry
Copy link
Contributor Author

@reyang Thank you for your time and the review!

Let me reply to a few comments:

Use standard units

Done!

Consider/Explore putting metadata as Resource instead of Attributes

You are right and I would happily add that to the convention, but I don't think we can have a Resource belonging to another Resource. In the current state of the semantic convention, all metrics are supposed to be attached to a host Resource (with the host.name attribute, notably). If we wanted to attach the disk metrics to a disk Resource, we would need to duplicate all the attributes of the host Resource into the disk Resource. Is this something that you recommend? In which case, I'll update the doc accordingly (and our code as well).

Reference to other standards (e.g. S.M.A.R.T.).

Done for S.M.A.R.T.

@arminru arminru added the spec:metrics Related to the specification/metrics directory label May 3, 2022
@bertysentry
Copy link
Contributor Author

@tigrannajaryan Could you please re-execute the workflow? All checks should pass now. Thank you!

@bertysentry
Copy link
Contributor Author

Thank you @tigrannajaryan for running the checks.

Unfortunately, markdown-link-check fails with this error:

FILE: ./specification/metrics/semantic_conventions/hardware-metrics.md
  [✖] https://schemas.dmtf.org/wbem/cim-html/2.49.0/CIM_Battery.html

  ERROR: 1 dead links found!

  23 links checked.
  [✖] https://schemas.dmtf.org/wbem/cim-html/2.49.0/CIM_Battery.html → Status: 0
make: *** [Makefile:31: markdown-link-check] Error 1
Error: Process completed with exit code 2.

But this check runs just fine in my environment and the offending link is valid! So I'm not sure what to do here. Re-run the checks, hoping for better luck? 🤷‍♂️

@tigrannajaryan
Copy link
Member

But this check runs just fine in my environment and the offending link is valid! So I'm not sure what to do here. Re-run the checks, hoping for better luck? 🤷‍♂️

Network error maybe. I will re-run.

@bertysentry
Copy link
Contributor Author

@tigrannajaryan @pirgeo Thank you for your comments. I pushed an update to remove the non-standard unit (and moved it to the description). Please let me know if you are okay with this change.

@bertysentry
Copy link
Contributor Author

@tigrannajaryan @pirgeo Please let me know if you're okay with the latest changes so I can resolve the corresponding conversations. Also, who else should I ask for a review? Thank you! 😊

@tigrannajaryan
Copy link
Member

I think the main complication we have with this PR is the enum metrics which are modelled as Gauge. We have a few options to move forward:

  1. We can decide that representing enums as Gauges is good enough for now. Add some "TODO" comments for the metrics to call attention that these are very likely to be changed. The entire document is marked "Experimental" anyway but drawing attention to the parts most likely to be broken is still useful.
  2. We can delete the enum metrics from this PR and merge the rest. Wait until Does OpenTelemetry need "enum" metric type? #1711 is resolved and then add the enum metrics in a separate PR. This is a good option if we believe the non-enum metrics are still useful enough to warrant being in the spec even if the enum ones are not.
  3. We can wait until Does OpenTelemetry need "enum" metric type? #1711 is resolved and block this PR until then. This is probably the least desirable option since it will block the PR for ages.

I am personally leaning towards option 3.

@bertysentry
Copy link
Contributor Author

bertysentry commented May 11, 2022

@tigrannajaryan Thank you for your comment (and again, sorry I haven't seen it earlier, I'm not getting any notification, it's weird).

I vote for option 1. As you said, the document is still in experimental state, so breaking changes are expected (it regularly happens with the Otel SDKs). We can add a comment for all Enum-like metrics; "HIGHLY SUBJECT TO CHANGE".

I personally commit to updating this semantic convention once the Enum metric #1711 issue is settled.

@tigrannajaryan
Copy link
Member

@open-telemetry/specs-metrics-approvers we need you opinion on this PR, especially on the question about enums.

@pirgeo
Copy link
Member

pirgeo commented May 13, 2022

I am not sure if there is a way of introducing an Enum data type without changing the data type here and breaking the semantic conventions (see my comment on #1711 (comment)). Given that, I don't think we can mark these semantic conventions as stable before the Enum data type is out. Therefore, I am also leaning towards 3, since these semantic conventions will have to stay experimental while we wait until we can break them.

@bertysentry
Copy link
Contributor Author

@pirgeo @tigrannajaryan I updated the hardware-metrics semantic convention to remove the offending Enum metrics, and replaced with proper Attributes.

Let me know of any remaining issue! Thank you again for the discussion, comments and suggestions!

Copy link
Member

@pirgeo pirgeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. I see your points for Enums, and agree that they would be a better fit here. Until they are available, however, I believe this is a valid solution that works with the current state of the specification and is in line with the other semantic conventions. Thank you!

@github-actions github-actions bot removed the Stale label Aug 1, 2022
@bertysentry bertysentry requested a review from a team August 1, 2022 10:09
@bertysentry
Copy link
Contributor Author

@jack-berg PR updated with your suggestion. Please approve, thank you!

@bertysentry
Copy link
Contributor Author

@jack-berg Thank you for the approval! As a member of the open-telemetry/instr-wg group, do you have a way to approve as this group?

image

@jack-berg
Copy link
Member

Unfortunately I'm not a memeber!

@bertysentry
Copy link
Contributor Author

@bertysentry
Copy link
Contributor Author

@tigrannajaryan How can we fix this? @jack-berg approved the PR, but as he's not really a member of the instr-wg group, his approval doesn't unlock the merge of this PR. Thanks!

CHANGELOG.md Outdated Show resolved Hide resolved
@bertysentry
Copy link
Contributor Author

@tigrannajaryan PR ready to be merged, thank you all for your help, reviews and suggestions! 😊

@tigrannajaryan
Copy link
Member

@open-telemetry/specs-approvers This PR formally has the required number of approvals. Since the PR is non-trivial let's going to keep it open for 2 more days to give time to object and if we don't hear anything let's merge.

@bogdandrutu since you are now back please comment if you have any additional thoughts.

@bertysentry
Copy link
Contributor Author

2 days later... 😁

Copy link
Member

@arminru arminru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some non-blocking comments asking for clarification below.
Sorry for being late to the party - feel free to address them in a follow-up PR instead.

@bertysentry
Copy link
Contributor Author

@arminru All suggestions have been taken into account, and branch rebased on main. Please review the few items open for discussion, thank you! 😊

@arminru arminru merged commit 943243e into open-telemetry:main Aug 9, 2022
@arminru arminru mentioned this pull request Aug 9, 2022
@bertysentry bertysentry deleted the add-hardware-metrics-semantic branch August 9, 2022 17:06
@bertysentry
Copy link
Contributor Author

😭 <-- tears of joy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants