snmp_packet stats are all zeros #950

candlerb · 2023-08-16T19:26:23Z

Host operating system: output of `uname -a`

Linux prometheus 5.4.0-153-generic #170-Ubuntu SMP Fri Jun 16 13:43:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

snmp_exporter version: output of `snmp_exporter -version`

snmp_exporter, version 0.23.0 (branch: HEAD, revision: 20657d2f486a1ac237238827beed97fa67eb8611)
  build user:       root@f2310b8dd92c
  build date:       20230816-09:56:45
  go version:       go1.20.7
  platform:         linux/amd64
  tags:             netgo

What device/snmpwalk OID are you using?

If this is a new device, please link to the MIB(s).

What did you do that produced an error?

Simple scrape (do it multiple times)

What did you expect to see?

The histogram snmp_packet_duration_seconds_bucket to increment; the counter snmp_packets_total to increment

(Note: these appear to be new. I tried with 0.21 and these metrics don't exist at all)

What did you see instead?

All metrics are zero, even after repeated scrapes:

# HELP snmp_packet_duration_seconds A histogram of latencies for SNMP packets.
# TYPE snmp_packet_duration_seconds histogram
snmp_packet_duration_seconds_bucket{le="0.0001"} 0
snmp_packet_duration_seconds_bucket{le="0.0002"} 0
snmp_packet_duration_seconds_bucket{le="0.0004"} 0
snmp_packet_duration_seconds_bucket{le="0.0008"} 0
snmp_packet_duration_seconds_bucket{le="0.0016"} 0
snmp_packet_duration_seconds_bucket{le="0.0032"} 0
snmp_packet_duration_seconds_bucket{le="0.0064"} 0
snmp_packet_duration_seconds_bucket{le="0.0128"} 0
snmp_packet_duration_seconds_bucket{le="0.0256"} 0
snmp_packet_duration_seconds_bucket{le="0.0512"} 0
snmp_packet_duration_seconds_bucket{le="0.1024"} 0
snmp_packet_duration_seconds_bucket{le="0.2048"} 0
snmp_packet_duration_seconds_bucket{le="0.4096"} 0
snmp_packet_duration_seconds_bucket{le="0.8192"} 0
snmp_packet_duration_seconds_bucket{le="1.6384"} 0
snmp_packet_duration_seconds_bucket{le="+Inf"} 0
snmp_packet_duration_seconds_sum 0
snmp_packet_duration_seconds_count 0
# HELP snmp_packet_retries_total Number of SNMP packet retries.
# TYPE snmp_packet_retries_total counter
snmp_packet_retries_total 0
# HELP snmp_packets_total Number of SNMP packet sent, including retries.
# TYPE snmp_packets_total counter
snmp_packets_total 0

The text was updated successfully, but these errors were encountered:

candlerb · 2023-08-17T14:07:30Z

I think the problem is that for every request, a brand new Registry is created and a brand new Collector put in it:
https://github.com/prometheus/snmp_exporter/blob/v0.23.0/main.go#L126-L128

I'm not entirely sure why it's done that way, but I'm guessing it's because collector.Collect() can't take parameters.

But this means that c.metrics needs to be shared between collector.Collector instances. At simplest, it could be made global.

candlerb · 2023-08-17T14:20:41Z

Also, I notice that main.go declares some more metrics: snmpDuration and snmpRequestErrors

AFAICS, these are intended to create metrics snmp_collection_duration_seconds and snmp_request_errors_total, but these don't appear in the output either. I think this is because they are created in the global registry, not in the per-request registry.

candlerb · 2023-08-17T14:53:15Z

The fix to the latter problem is pretty straightforward:

diff --git a/main.go b/main.go
index c28316f..69a1ce4 100644
--- a/main.go
+++ b/main.go
@@ -27,7 +27,6 @@ import (
        "github.com/go-kit/log"
        "github.com/go-kit/log/level"
        "github.com/prometheus/client_golang/prometheus"
-       "github.com/prometheus/client_golang/prometheus/promauto"
        "github.com/prometheus/client_golang/prometheus/promhttp"
        "github.com/prometheus/common/promlog"
        "github.com/prometheus/common/promlog/flag"
@@ -50,14 +49,14 @@ var (
        toolkitFlags = webflag.AddFlags(kingpin.CommandLine, ":9116")

        // Metrics about the SNMP exporter itself.
-       snmpDuration = promauto.NewSummaryVec(
+       snmpDuration = prometheus.NewSummaryVec(
                prometheus.SummaryOpts{
                        Name: "snmp_collection_duration_seconds",
                        Help: "Duration of collections by the SNMP exporter",
                },
                []string{"auth", "module"},
        )
-       snmpRequestErrors = promauto.NewCounter(
+       snmpRequestErrors = prometheus.NewCounter(
                prometheus.CounterOpts{
                        Name: "snmp_request_errors_total",
                        Help: "Errors in requests to the SNMP exporter",
@@ -124,6 +123,8 @@ func handler(w http.ResponseWriter, r *http.Request, logger log.Logger) {

        start := time.Now()
        registry := prometheus.NewRegistry()
+       registry.MustRegister(snmpDuration)
+       registry.MustRegister(snmpRequestErrors)
        c := collector.New(r.Context(), target, auth, module, logger, registry)
        registry.MustRegister(c)
        // Delegate http serving to Prometheus client library, which will call collector.Collect.

However, it generates an explosion of NxM metrics, for every possible combination of auths and modules. In my case it's only 5 auths and 22 modules, which gives me 220 metrics (one sum and one count), but I don't think that will scale.

I think the snmp_collection_duration_seconds metric overlaps with the new per-module stats that #945 is introducing, so could probably be dropped, although it does measure the total time of both scraping and emitting metrics.

The snmp_request_errors_total is a single value, and refers to errors in the incoming HTTP request (missing or unknown target, auth, or module). It's reasonable to keep.

Fixes prometheus#950

Fixes prometheus#950 Signed-off-by: Brian Candler <[email protected]>

Pass the default metrics register into the collector rather than the ephemeral register. This makes sure the internal metrics are preserved between scrapes. Fixes: #950 Signed-off-by: SuperQ <[email protected]>

Move exporter metric definitions and registration out of the collector package and into the main package. This fixes metrics always being empty due to the epehemeral collector registry. Fixes: #950 Signed-off-by: SuperQ <[email protected]>

candlerb · 2023-08-29T18:12:35Z

These were fixed in the 0.24.0 release, but I think there are still some rough edges.

Breaking changes

This wasn't called out in the release notes, but some metrics have (sensibly) moved from /snmp to the /metrics endpoint.

This includes the snmp_packet_duration_seconds_* histogram (although it was all zeros anyway), snmp_packet_retries_total, snmp_packets_total, snmp_request_errors_total, snmp_unexpected_pdu_type_total

Cardinality of snmp_collection_duration_xxx

The /metrics endpoint now reveals the previously-hidden snmp_collection_duration histograms. These suffer from an NxM cardinality, because every combination of (auth,module) is listed - even those which are never used.

root@prometheus:~# curl -fsS 'localhost:9116/metrics' | grep '^snmp_packet_duration' | wc -l
18
root@prometheus:~# curl -fsS 'localhost:9116/metrics' | grep '^snmp_collection_duration' | wc -l
1540

Can I suggest this be changed to grouping on just (module)? If you happen to use different auth credentials for different devices, it doesn't necessarily have a meaningful mapping to different classes of device.

(Aside: I wonder if passing an explicit "class" scrape parameter, which defaults to the module name, might be a useful addition. You could then collect statistics specifically for a given target or group of targets. But the same effect could be achieved by simply duplicating a particular module under a different name)

Error returns

If there is a collection error, it appears that no message is returned over HTTP, nor is it logged (i.e. journalctl -eu snmp_exporter shows no error).

I thought it did before, although I could be mistaken.

Here are two examples:

Invalid module name

# curl -vfsS 'localhost:9116/snmp?target=gw1&module=JUNK&auth=workshop_v3'
*   Trying 127.0.0.1:9116...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9116 (#0)
> GET /snmp?target=gw1&module=JUNK&auth=workshop_v3 HTTP/1.1
> Host: localhost:9116
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
* The requested URL returned error: 400 Bad Request
* Closing connection 0
curl: (22) The requested URL returned error: 400 Bad Request

We get a 400 status code, but no message explaining the error.

Modules with overlapping metrics

Here, modules if_mib and mikrotik both collect the ifTable metrics.

# curl -vfsS 'localhost:9116/snmp?target=gw1&module=if_mib,mikrotik&auth=workshop_v3'
*   Trying 127.0.0.1:9116...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9116 (#0)
> GET /snmp?target=gw1&module=if_mib,mikrotik&auth=workshop_v3 HTTP/1.1
> Host: localhost:9116
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
* The requested URL returned error: 500 Internal Server Error
* Closing connection 0
curl: (22) The requested URL returned error: 500 Internal Server Error

Again, we get a 500 status code, but no error message.

SuperQ · 2023-08-29T18:42:39Z

@candlerb Let's open separate new issues for those items.

Move exporter metric definitions and registration out of the collector package and into the main package. This fixes metrics always being empty due to the epehemeral collector registry. Fixes: prometheus#950 Signed-off-by: SuperQ <[email protected]> Signed-off-by: Stephan Windischmann <[email protected]>

candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023

Include global metrics in the per-request collector registry

75dafdc

Fixes prometheus#950

candlerb mentioned this issue Aug 17, 2023

Include global metrics in the per-request collector registry #956

Closed

candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023

Include global metrics in the per-request collector registry

bedf566

Fixes prometheus#950 Signed-off-by: Brian Candler <[email protected]>

candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023

Include global metrics in the per-request collector registry

a2b1d10

Fixes prometheus#950 Signed-off-by: Brian Candler <[email protected]>

candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023

Include global metrics in the per-request collector registry

99fa3dd

Fixes prometheus#950 Signed-off-by: Brian Candler <[email protected]>

candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 18, 2023

Include global metrics in the per-request collector registry

6459432

Fixes prometheus#950 Signed-off-by: Brian Candler <[email protected]>

SuperQ mentioned this issue Aug 28, 2023

Fix exporter metrics #971

Merged

roidelapluie closed this as completed in #971 Aug 29, 2023

This was referenced Aug 29, 2023

Cardinality of snmp_collection_duration histograms #978

Closed

No error messages returned on problems #979

Closed

marctc mentioned this issue Oct 4, 2023

Bump SNMP exporter to v0.24.1 grafana/agent#5366

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snmp_packet stats are all zeros #950

snmp_packet stats are all zeros #950

candlerb commented Aug 16, 2023

candlerb commented Aug 17, 2023

candlerb commented Aug 17, 2023 •

edited

Loading

candlerb commented Aug 17, 2023 •

edited

Loading

candlerb commented Aug 29, 2023

SuperQ commented Aug 29, 2023

snmp_packet stats are all zeros #950

snmp_packet stats are all zeros #950

Comments

candlerb commented Aug 16, 2023

Host operating system: output of uname -a

snmp_exporter version: output of snmp_exporter -version

What device/snmpwalk OID are you using?

If this is a new device, please link to the MIB(s).

What did you do that produced an error?

What did you expect to see?

What did you see instead?

candlerb commented Aug 17, 2023

candlerb commented Aug 17, 2023 • edited Loading

candlerb commented Aug 17, 2023 • edited Loading

candlerb commented Aug 29, 2023

Breaking changes

Cardinality of snmp_collection_duration_xxx

Error returns

Invalid module name

Modules with overlapping metrics

SuperQ commented Aug 29, 2023

Host operating system: output of `uname -a`

snmp_exporter version: output of `snmp_exporter -version`

candlerb commented Aug 17, 2023 •

edited

Loading

candlerb commented Aug 17, 2023 •

edited

Loading