Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snmp_packet stats are all zeros #950

Closed
candlerb opened this issue Aug 16, 2023 · 5 comments · Fixed by #971
Closed

snmp_packet stats are all zeros #950

candlerb opened this issue Aug 16, 2023 · 5 comments · Fixed by #971

Comments

@candlerb
Copy link
Contributor

Host operating system: output of uname -a

Linux prometheus 5.4.0-153-generic #170-Ubuntu SMP Fri Jun 16 13:43:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

snmp_exporter version: output of snmp_exporter -version

snmp_exporter, version 0.23.0 (branch: HEAD, revision: 20657d2f486a1ac237238827beed97fa67eb8611)
  build user:       root@f2310b8dd92c
  build date:       20230816-09:56:45
  go version:       go1.20.7
  platform:         linux/amd64
  tags:             netgo

What device/snmpwalk OID are you using?

If this is a new device, please link to the MIB(s).

What did you do that produced an error?

Simple scrape (do it multiple times)

What did you expect to see?

The histogram snmp_packet_duration_seconds_bucket to increment; the counter snmp_packets_total to increment

(Note: these appear to be new. I tried with 0.21 and these metrics don't exist at all)

What did you see instead?

All metrics are zero, even after repeated scrapes:

# HELP snmp_packet_duration_seconds A histogram of latencies for SNMP packets.
# TYPE snmp_packet_duration_seconds histogram
snmp_packet_duration_seconds_bucket{le="0.0001"} 0
snmp_packet_duration_seconds_bucket{le="0.0002"} 0
snmp_packet_duration_seconds_bucket{le="0.0004"} 0
snmp_packet_duration_seconds_bucket{le="0.0008"} 0
snmp_packet_duration_seconds_bucket{le="0.0016"} 0
snmp_packet_duration_seconds_bucket{le="0.0032"} 0
snmp_packet_duration_seconds_bucket{le="0.0064"} 0
snmp_packet_duration_seconds_bucket{le="0.0128"} 0
snmp_packet_duration_seconds_bucket{le="0.0256"} 0
snmp_packet_duration_seconds_bucket{le="0.0512"} 0
snmp_packet_duration_seconds_bucket{le="0.1024"} 0
snmp_packet_duration_seconds_bucket{le="0.2048"} 0
snmp_packet_duration_seconds_bucket{le="0.4096"} 0
snmp_packet_duration_seconds_bucket{le="0.8192"} 0
snmp_packet_duration_seconds_bucket{le="1.6384"} 0
snmp_packet_duration_seconds_bucket{le="+Inf"} 0
snmp_packet_duration_seconds_sum 0
snmp_packet_duration_seconds_count 0
# HELP snmp_packet_retries_total Number of SNMP packet retries.
# TYPE snmp_packet_retries_total counter
snmp_packet_retries_total 0
# HELP snmp_packets_total Number of SNMP packet sent, including retries.
# TYPE snmp_packets_total counter
snmp_packets_total 0
@candlerb
Copy link
Contributor Author

I think the problem is that for every request, a brand new Registry is created and a brand new Collector put in it:
https://github.com/prometheus/snmp_exporter/blob/v0.23.0/main.go#L126-L128

I'm not entirely sure why it's done that way, but I'm guessing it's because collector.Collect() can't take parameters.

But this means that c.metrics needs to be shared between collector.Collector instances. At simplest, it could be made global.

@candlerb
Copy link
Contributor Author

candlerb commented Aug 17, 2023

Also, I notice that main.go declares some more metrics: snmpDuration and snmpRequestErrors

AFAICS, these are intended to create metrics snmp_collection_duration_seconds and snmp_request_errors_total, but these don't appear in the output either. I think this is because they are created in the global registry, not in the per-request registry.

@candlerb
Copy link
Contributor Author

candlerb commented Aug 17, 2023

The fix to the latter problem is pretty straightforward:

diff --git a/main.go b/main.go
index c28316f..69a1ce4 100644
--- a/main.go
+++ b/main.go
@@ -27,7 +27,6 @@ import (
        "github.com/go-kit/log"
        "github.com/go-kit/log/level"
        "github.com/prometheus/client_golang/prometheus"
-       "github.com/prometheus/client_golang/prometheus/promauto"
        "github.com/prometheus/client_golang/prometheus/promhttp"
        "github.com/prometheus/common/promlog"
        "github.com/prometheus/common/promlog/flag"
@@ -50,14 +49,14 @@ var (
        toolkitFlags = webflag.AddFlags(kingpin.CommandLine, ":9116")

        // Metrics about the SNMP exporter itself.
-       snmpDuration = promauto.NewSummaryVec(
+       snmpDuration = prometheus.NewSummaryVec(
                prometheus.SummaryOpts{
                        Name: "snmp_collection_duration_seconds",
                        Help: "Duration of collections by the SNMP exporter",
                },
                []string{"auth", "module"},
        )
-       snmpRequestErrors = promauto.NewCounter(
+       snmpRequestErrors = prometheus.NewCounter(
                prometheus.CounterOpts{
                        Name: "snmp_request_errors_total",
                        Help: "Errors in requests to the SNMP exporter",
@@ -124,6 +123,8 @@ func handler(w http.ResponseWriter, r *http.Request, logger log.Logger) {

        start := time.Now()
        registry := prometheus.NewRegistry()
+       registry.MustRegister(snmpDuration)
+       registry.MustRegister(snmpRequestErrors)
        c := collector.New(r.Context(), target, auth, module, logger, registry)
        registry.MustRegister(c)
        // Delegate http serving to Prometheus client library, which will call collector.Collect.

However, it generates an explosion of NxM metrics, for every possible combination of auths and modules. In my case it's only 5 auths and 22 modules, which gives me 220 metrics (one sum and one count), but I don't think that will scale.

I think the snmp_collection_duration_seconds metric overlaps with the new per-module stats that #945 is introducing, so could probably be dropped, although it does measure the total time of both scraping and emitting metrics.

The snmp_request_errors_total is a single value, and refers to errors in the incoming HTTP request (missing or unknown target, auth, or module). It's reasonable to keep.

candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023
candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023
candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023
candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 17, 2023
candlerb added a commit to candlerb/snmp_exporter that referenced this issue Aug 18, 2023
SuperQ added a commit that referenced this issue Aug 28, 2023
Pass the default metrics register into the collector rather than the
ephemeral register. This makes sure the internal metrics are preserved
between scrapes.

Fixes: #950

Signed-off-by: SuperQ <[email protected]>
SuperQ added a commit that referenced this issue Aug 28, 2023
Move exporter metric definitions and registration out of the collector
package and into the main package. This fixes metrics always being empty
due to the epehemeral collector registry.

Fixes: #950

Signed-off-by: SuperQ <[email protected]>
SuperQ added a commit that referenced this issue Aug 28, 2023
Move exporter metric definitions and registration out of the collector
package and into the main package. This fixes metrics always being empty
due to the epehemeral collector registry.

Fixes: #950

Signed-off-by: SuperQ <[email protected]>
SuperQ added a commit that referenced this issue Aug 29, 2023
Move exporter metric definitions and registration out of the collector
package and into the main package. This fixes metrics always being empty
due to the epehemeral collector registry.

Fixes: #950

Signed-off-by: SuperQ <[email protected]>
@candlerb
Copy link
Contributor Author

These were fixed in the 0.24.0 release, but I think there are still some rough edges.

Breaking changes

This wasn't called out in the release notes, but some metrics have (sensibly) moved from /snmp to the /metrics endpoint.

This includes the snmp_packet_duration_seconds_* histogram (although it was all zeros anyway), snmp_packet_retries_total, snmp_packets_total, snmp_request_errors_total, snmp_unexpected_pdu_type_total

Cardinality of snmp_collection_duration_xxx

The /metrics endpoint now reveals the previously-hidden snmp_collection_duration histograms. These suffer from an NxM cardinality, because every combination of (auth,module) is listed - even those which are never used.

root@prometheus:~# curl -fsS 'localhost:9116/metrics' | grep '^snmp_packet_duration' | wc -l
18
root@prometheus:~# curl -fsS 'localhost:9116/metrics' | grep '^snmp_collection_duration' | wc -l
1540

Can I suggest this be changed to grouping on just (module)? If you happen to use different auth credentials for different devices, it doesn't necessarily have a meaningful mapping to different classes of device.

(Aside: I wonder if passing an explicit "class" scrape parameter, which defaults to the module name, might be a useful addition. You could then collect statistics specifically for a given target or group of targets. But the same effect could be achieved by simply duplicating a particular module under a different name)

Error returns

If there is a collection error, it appears that no message is returned over HTTP, nor is it logged (i.e. journalctl -eu snmp_exporter shows no error).

I thought it did before, although I could be mistaken.

Here are two examples:

Invalid module name

# curl -vfsS 'localhost:9116/snmp?target=gw1&module=JUNK&auth=workshop_v3'
*   Trying 127.0.0.1:9116...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9116 (#0)
> GET /snmp?target=gw1&module=JUNK&auth=workshop_v3 HTTP/1.1
> Host: localhost:9116
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
* The requested URL returned error: 400 Bad Request
* Closing connection 0
curl: (22) The requested URL returned error: 400 Bad Request

We get a 400 status code, but no message explaining the error.

Modules with overlapping metrics

Here, modules if_mib and mikrotik both collect the ifTable metrics.

# curl -vfsS 'localhost:9116/snmp?target=gw1&module=if_mib,mikrotik&auth=workshop_v3'
*   Trying 127.0.0.1:9116...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9116 (#0)
> GET /snmp?target=gw1&module=if_mib,mikrotik&auth=workshop_v3 HTTP/1.1
> Host: localhost:9116
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
* The requested URL returned error: 500 Internal Server Error
* Closing connection 0
curl: (22) The requested URL returned error: 500 Internal Server Error

Again, we get a 500 status code, but no error message.

@SuperQ
Copy link
Member

SuperQ commented Aug 29, 2023

@candlerb Let's open separate new issues for those items.

stephan-windischmann-sky pushed a commit to stephan-windischmann-sky/snmp_exporter that referenced this issue Oct 27, 2023
Move exporter metric definitions and registration out of the collector
package and into the main package. This fixes metrics always being empty
due to the epehemeral collector registry.

Fixes: prometheus#950

Signed-off-by: SuperQ <[email protected]>
Signed-off-by: Stephan Windischmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants