Skip to content

Conversation

RodrigoVillar
Copy link
Contributor

@RodrigoVillar RodrigoVillar commented Oct 12, 2025

Why this should be merged

As mentioned in #4362, splitting up the Prometheus server and collector is ideal for clients of Firewood who want access to VM metrics but would prefer to use their own monitoring stack.

Although #4362 also discusses the option of hardcoding the metrics server port, I've opted to split this out into a separate PR.

How this works

Changes the METRICS_ENABLED parameter to METRICS_MODE, which has the following three options:

  • disabled: does not start both the Prometheus server and collector
  • server-only: starts only the Prometheus server
  • full: starts both the Prometheus server and collector

I've opted for keeping a single environment variable rather than splitting METRICS_ENABLED since there should never be a case for starting the Prometheus collector but not the Prometheus server. In either the server-only or full cases, a metrics endpoint or Grafana URL is printed out, respectively.

How this was tested

CI

Need to be documented in RELEASES.md?

No

@RodrigoVillar
Copy link
Contributor Author

The full option is enabled in CI as seen here: https://github.com/ava-labs/avalanchego/actions/runs/18465753103/job/52607232591?pr=4415#step:4:968

When running locally with METRICS_MODE=server-only, you get the following:

goos: darwin
goarch: arm64
pkg: github.com/ava-labs/avalanchego/tests/reexecute/c
cpu: Apple M1 Max
BenchmarkReexecuteRange
BenchmarkReexecuteRange/[1,1000]-Config-default-Runner-dev
[10-13|08:45:03.588] INFO c-chain-reexecution c/vm_reexecute_test.go:616 metrics endpoint available {"url": "http://127.0.0.1:62042/ext/metrics"}

When running locally with METRICS_MODE=full, you get the following:

goos: darwin
goarch: arm64
pkg: github.com/ava-labs/avalanchego/tests/reexecute/c
cpu: Apple M1 Max
BenchmarkReexecuteRange
BenchmarkReexecuteRange/[1,1000]-Config-default-Runner-dev
[10-13|08:46:22.299] INFO prometheus tmpnet/monitor_processes.go:370 collector already running {"cmd": "prometheus"}
[10-13|08:46:22.299] INFO prometheus tmpnet/monitor_processes.go:610 waiting for collector readiness {"cmd": "prometheus", "url": "http://127.0.0.1:9090/-/ready", "logPath": "/Users/rodrigo.villar/.tmpnet/prometheus/prometheus.log"}
[10-13|08:46:22.300] INFO prometheus tmpnet/monitor_processes.go:634 collector ready {"cmd": "prometheus"}
[10-13|08:46:22.300] INFO prometheus tmpnet/monitor_processes.go:60 To stop: tmpnetctl stop-metrics-collector
[10-13|08:46:22.300] INFO c-chain-reexecution c/vm_reexecute_test.go:665 metrics available via grafana {"url": "https://grafana-poc.avax-dev.network/d/Gl1I20mnk/c-chain?&var-filter=network_uuid%7C%3D%7C8ce59182-3fcf-4a8d-8272-eeadbcea1537&var-filter=is_ephemeral_node%7C%3D%7Cfalse&from=1760359582300&to=now"}

@RodrigoVillar RodrigoVillar marked this pull request as ready for review October 13, 2025 13:00
@Copilot Copilot AI review requested due to automatic review settings October 13, 2025 13:00
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR decouples the metrics server and collector functionality by replacing the boolean METRICS_ENABLED parameter with a more granular METRICS_MODE parameter. This allows users to run only the Prometheus server without the collector, enabling access to VM metrics while using their own monitoring stack.

  • Introduces a new metricsMode type with three values: disabled, server-only, and full
  • Replaces METRICS_ENABLED with METRICS_MODE across all configuration files and scripts
  • Updates the collectRegistry function to conditionally start the collector based on the metrics mode

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/reexecute/c/vm_reexecute_test.go Implements the new metricsMode type with validation and updates function signatures
tests/reexecute/c/README.md Updates documentation to explain the three metrics mode options
scripts/benchmark_cchain_range.sh Changes flag from metrics-enabled to metrics-mode
Taskfile.yml Updates default value from "false" to "disabled" for the new metrics mode
.github/actions/c-chain-reexecution-benchmark/action.yml Sets CI to use "full" metrics mode instead of boolean true

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 106 to 120
func (m *metricsMode) Set(s string) error {
s = strings.ToLower(strings.TrimSpace(s))

switch s {
case "disabled":
*m = MetricsDisabled
case "server-only":
*m = MetricsServerOnly
case "full":
*m = MetricsFull
default:
return fmt.Errorf("invalid metrics mode: %s (valid options: disabled, server-only, full)", s)
}
return nil
}
Copy link
Collaborator

@aaronbuchwald aaronbuchwald Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use a string here and perhaps an alias type rather than implementing Set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the metricsMode type here: a7cb056

Comment on lines 230 to 231
if metricsMode.shouldStartServer() {
collectRegistry(b, log, "c-chain-reexecution", prefixGatherer, labels, metricsMode.shouldStartCollector())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clean this up a little bit? It seems odd that we decompose metricsMode into two separate booleans and use one to handle the if condition here and the other half of it as an argument that gets passed in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the logic for starting the metrics server vs starting the metrics server and collector here: 9e319d0

@RodrigoVillar RodrigoVillar changed the title feat(reexecution/c): decouple metrics server and collector feat(reexecute/c): decouple metrics server and collector Oct 14, 2025
- `METRICS_MODE=disabled`: no metrics are available.
- `METRICS_MODE=server-only`: starts a Prometheus server exporting VM metrics. A
link to the metrics endpoint is logged during execution.
- `METRICS_MODE=full`: starts both a Prometheus server exporting VM metrics and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these names are not very self-explanatory (it's not clear what server-only and full refer to without the descriptions), I think it would be better to simply configure the two separately and require that if grafana is enabled, then the prometheus server must be enabled as well.

It's fine imo for the metrics server to be enabled and grafana disabled by default since that won't require setting any extra credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 5160738

If running locally, metrics collection can be customized via the following parameters:

- `METRICS_SERVER_ENABLED`: starts a Prometheus server exporting VM metrics.
- `METRICS_COLLECTOR_ENABLED`: starts a Prometheus collector (if enabled, then `METRICS_SERVER_ENABLED` must be enabled as well).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just implicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it's own, I'm not opposed to METRICS_COLLECTOR_ENABLED=true implicitly setting METRICS_SERVER_ENABLED=true as well. However, considering that this PR will be followed up by #4418 (which adds the ability to configure a port for the metrics server), I think this becomes confusing (i.e. it isn't clear what happens if METRICS_COLLECTOR_ENABLED=true and METRICS_PORT=X without reading the description of METRICS_COLLECTOR_ENABLED ).

This could be fixed by renaming METRICS_COLLECTOR_ENABLED to METRICS_SERVER_AND_COLLECTOR_ENABLED, but this looks similar to a previous iteration of this PR which received this review comment: #4415 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants