Skip to content

Commit

Permalink
Add specs for host.id and profiler registration message (#853)
Browse files Browse the repository at this point in the history

---------

Co-authored-by: Joel Höner <[email protected]>
  • Loading branch information
JonasKunz and athre0z authored Mar 27, 2024
1 parent 36a5abd commit 6bd4119
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 5 deletions.
8 changes: 8 additions & 0 deletions specs/agents/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The process for proposing new metadata fields is detailed
System metadata relates to the host/container in which the service being monitored is running:

- hostname
- host.id
- architecture
- operating system
- container ID
Expand Down Expand Up @@ -75,6 +76,13 @@ hostname if `configured_hostname` is not provided.
Agents that are APM-Server-version-aware, or that are compatible only with versions >= 7.4, should
use the new fields wherever applicable.

#### Host.id

APM agents MAY collect the `host.id` as an unique identifier for the host.
If they collect it, it MUST be conformant to the [OpenTelemetry SemConv for `host.id`](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/).

If the APM agent performs correlation of its spans/transactions with universal profiling data, it MUST send the `host.id` (see the [profiling integration spec](universal-profiling-integration.md#profiler-registration-message)) as part of the metadata. The APM agent MAY solely rely on the `host.id` provided by the profiling host agent in that case.

#### Container/Kubernetes metadata

On Linux, the container ID and some of the Kubernetes metadata can be extracted by parsing `/proc/self/cgroup`. For each line in the file, we split the line according to the format "hierarchy-ID:controller-list:cgroup-path", extracting the "cgroup-path" part. We then attempt to extract information according to the following algorithm:
Expand Down
25 changes: 20 additions & 5 deletions specs/agents/universal-profiling-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ transaction-id | uint8[8]
* *span-id*: The W3C trace id of the currently active span
* *transaction-id*: The W3C span id of the currently active transaction (=the local root span)

APM-agents MAY start populating the thread-local storage only after receiving a host agent [registration message](#profiler-registration-message)

### Concurrency-safe Updates

The profiler might interrupt a thread and take a profiling sample while that thread is in the process of updating the contents of the shared thread local storage. Fortunately, we have the following guarantees about this interruption:
Expand Down Expand Up @@ -136,10 +138,6 @@ And here how to read a messages in a non-blocking way:
```c
size_t readProfilerSocketMessages(uint8_t* outputBuffer, size_t bufferSize) {
if(profilerSocket == -1) {
return raiseExceptionAndReturn(jniEnv, -1, "No profiler socket active!");
}
int n = recv(profilerSocket, outputBuffer, bufferSize, 0);
if (n == -1) {
if(errno == EAGAIN || errno == EWOULDBLOCK) {
Expand Down Expand Up @@ -173,6 +171,23 @@ All messages have the following layout:
* *message-type* : An ID uniquely identifying the type (and therefore payload structure) of the message.
* *minor-version* : The version number for the given *message-type*. This value is incremented when new fields are added to the payload while preserving the *message-type* (non breaking changes). For breaking changes a new *message-type* must be used.

## Profiler Registration Message

Whenever the profiling host agent starts communicating for the first time with a process running an APM Agent, it MUST send this message.
This message is used to let the APM-agent know that a profiler is actually active on the current host. Note that an APM-agent may receive this message zero, one or several times: this may happen if no host agent is active, if one is active or if a host agent is restarted during the lifetime of the APM-agent respectively.

The *message-type* is `2` and the current *minor-version* is `1`.

The payload layout is as follows:
Name | Data type
--------------------- | -------------
samples-delay-ms | uint32
host-id | utf8-str

* *samples-delay-ms*: A sane upper bound of the usual time taken in milliseconds by the profiling host agent between the collection of a stacktrace and it being written to the apm-agent via the [messaging socket](#cpu-profiler-trace-correlation-message). The APM-agent will assume that all profiling data related to a span has been written to the socket if a span ended at least the provided duration ago. Note that this value doesn't need to be a hard a guarantee, but it should be the 99% case so that profiling data isn't distorted in the expected case.
* *host-id*: The [`host.id` resource attribute](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/) used for the profiling data by this profiling host agent. If an APM-agent is already sending a `host.id` it MUST print a warning if the `host.id` is different and otherwise ignore the value received by the host agent. A mismatch will lead to certain correlation features (e.g. cost and CO2 consumption) not working. If an agent does not collect the `host.id` by itself, it MUST start sending the `host.id` after receiving it from the profiler host agent to ensure aforementioned correlation features work correctly.


## CPU Profiler Trace Correlation Message

Whenever the profiler is able to correlate a taken CPU stacktrace sample with an APM trace (see [this section](#thread-local-storage-layout)). It sends the ID of the stacktrace back to the APM agent.
Expand All @@ -188,6 +203,6 @@ stack-trace-id | uint8[16]
count | uint16

* *trace-id*: The APM W3C trace id of the trace which was active for the given profiling samples
* *trace-id*: The APM W3C transaction id of the transaction which was active for the given profiling samples
* *transaction-id*: The APM W3C span id of the transaction which was active for the given profiling samples
* *stack-trace-id*: The unique ID for the stacktrace captured assigned by the profiler. This ID is stored in elasticsearch in base64 URL safe encoding by the universal profiling solution.
* The number of samples observed since the last report for the (*trace-id*, *transaction-id*, *stack-trace-id*) combination.

0 comments on commit 6bd4119

Please sign in to comment.