Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] PIP-307: Add monitoring metrics for graceful closure of producers/consumers #21854

Merged
merged 40 commits into from
Jan 10, 2024

Conversation

dragosvictor
Copy link
Contributor

@dragosvictor dragosvictor commented Jan 4, 2024

PIP: PIP-307

Motivation

This PR adds metrics to monitor the broker behavior during the graceful closure of producers and consumers.

Modifications

These new broker level metrics are being proposed:

  • brk_lb_unload_latency: exposes a histogram of total time spent (in milliseconds) unloading a topic (from state RELEASE to state FREE or OWNED) on the source broker
  • brk_lb_release_latency: exposes a histogram of milliseconds spent in the RELEASE state on the source broker
  • brk_lb_assign_latency: exposes a histogram of milliseconds spent in the ASSIGN state on the destination broker
  • brk_lb_disconnect_latency: exposes a histogram of milliseconds spent in the disconnected state on the source broker
  • brk_lb_ignored_ack_total: type gauge, exposes the total number of message ACKs ignored from consumers during topic unloading on the source broker
  • brk_lb_ignored_send_total: type gauge, exposes the total number of ignored messages sent by producers during topic unloading on the source broker

For the histogram metrics, the buckets are 1, 10, 100, 200, 1000 milliseconds.

A sample reading of these metrics from the /metrics/ endpoint follows. Note that gauge metrics have their prefixes changed from brk_ to pulsar_ here; for consistency, let's stick to brk_ for now, even though this may change soon.

# TYPE brk_lb_assign_latency_ms histogram
brk_lb_assign_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1.0"} 0.0
brk_lb_assign_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="10.0"} 3.0
brk_lb_assign_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="100.0"} 5.0
brk_lb_assign_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="200.0"} 5.0
brk_lb_assign_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1000.0"} 5.0
brk_lb_assign_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="+Inf"} 5.0
brk_lb_assign_latency_ms_count{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 5.0
brk_lb_assign_latency_ms_sum{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 64.0
brk_lb_assign_latency_ms_created{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 1.704483941683E9
# TYPE brk_lb_disconnect_latency_ms histogram
brk_lb_disconnect_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1.0"} 2.0
brk_lb_disconnect_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="10.0"} 2.0
brk_lb_disconnect_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="100.0"} 2.0
brk_lb_disconnect_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="200.0"} 2.0
brk_lb_disconnect_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1000.0"} 2.0
brk_lb_disconnect_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="+Inf"} 2.0
brk_lb_disconnect_latency_ms_count{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 2.0
brk_lb_disconnect_latency_ms_sum{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 0.0
brk_lb_disconnect_latency_ms_created{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 1.704484142359E9
# TYPE brk_lb_unload_latency_ms histogram
brk_lb_unload_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1.0"} 0.0
brk_lb_unload_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="10.0"} 0.0
brk_lb_unload_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="100.0"} 2.0
brk_lb_unload_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="200.0"} 2.0
brk_lb_unload_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1000.0"} 2.0
brk_lb_unload_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="+Inf"} 2.0
brk_lb_unload_latency_ms_count{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 2.0
brk_lb_unload_latency_ms_sum{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 68.0
brk_lb_unload_latency_ms_created{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 1.704484142359E9
# TYPE brk_lb_release_latency_ms histogram
brk_lb_release_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1.0"} 0.0
brk_lb_release_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="10.0"} 0.0
brk_lb_release_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="100.0"} 2.0
brk_lb_release_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="200.0"} 2.0
brk_lb_release_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="1000.0"} 2.0
brk_lb_release_latency_ms_bucket{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading",le="+Inf"} 2.0
brk_lb_release_latency_ms_count{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 2.0
brk_lb_release_latency_ms_sum{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 36.0
brk_lb_release_latency_ms_created{cluster="pulsar-mini",broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local:8080",metric="bundleUnloading"} 1.704484142341E9

# TYPE pulsar_lb_ignored_ack_total gauge
pulsar_lb_ignored_ack_total{cluster="pulsar-mini", broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local", metric="bundleUnloading"} 32
# TYPE pulsar_lb_ignored_send_total gauge
pulsar_lb_ignored_send_total{cluster="pulsar-mini", broker="pulsar-mini-broker-0.pulsar-mini-broker.pulsar.svc.cluster.local", metric="bundleUnloading"} 69

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • Modified test org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImplTest#testGetMetrics to validate ignored (ack/send) count metric values
  • Modified test org.apache.pulsar.broker.stats.PrometheusMetricsTest#testBundlesMetrics to validate the existence of new latency metrics

Manually verified that the metrics exposed are correct (see sample reading above) using a k8s deployment of Pulsar.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: dragosvictor#4

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 4, 2024
@dragosvictor dragosvictor marked this pull request as ready for review January 4, 2024 17:39
@github-actions github-actions bot added doc-required Your PR changes impact docs and you will update later. and removed doc-not-needed Your PR changes do not impact docs labels Jan 5, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jan 5, 2024

Codecov Report

Attention: 33 lines in your changes are missing coverage. Please review.

Comparison is base (cea5c93) 73.59% compared to head (f9ce88b) 73.58%.
Report is 3 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21854      +/-   ##
============================================
- Coverage     73.59%   73.58%   -0.02%     
- Complexity    32323    32336      +13     
============================================
  Files          1858     1859       +1     
  Lines        138174   138273      +99     
  Branches      15148    15155       +7     
============================================
+ Hits         101696   101751      +55     
- Misses        28608    28643      +35     
- Partials       7870     7879       +9     
Flag Coverage Δ
inttests 24.13% <1.37%> (-0.12%) ⬇️
systests 23.87% <5.51%> (+0.08%) ⬆️
unittests 72.86% <77.24%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...dbalance/extensions/ExtensibleLoadManagerImpl.java 79.86% <100.00%> (+0.31%) ⬆️
...xtensions/channel/ServiceUnitStateChannelImpl.java 84.46% <ø> (ø)
...alance/extensions/manager/StateChangeListener.java 100.00% <100.00%> (ø)
...r/loadbalance/extensions/models/UnloadCounter.java 93.84% <ø> (ø)
...ervice/AbstractDispatcherSingleActiveConsumer.java 90.55% <100.00%> (ø)
...ulsar/broker/stats/prometheus/metrics/Summary.java 100.00% <ø> (ø)
...rg/apache/pulsar/client/impl/TopicListWatcher.java 67.14% <100.00%> (+0.47%) ⬆️
...rg/apache/pulsar/broker/service/AbstractTopic.java 87.86% <0.00%> (-0.15%) ⬇️
...lance/extensions/channel/StateChangeListeners.java 84.61% <80.00%> (-3.62%) ⬇️
...va/org/apache/pulsar/broker/service/ServerCnx.java 72.13% <70.58%> (-0.27%) ⬇️
... and 2 more

... and 67 files with indirect coverage changes

@heesung-sn heesung-sn added this to the 3.2.0 milestone Jan 5, 2024
@merlimat merlimat merged commit 28ed48e into apache:master Jan 10, 2024
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Jan 10, 2024
@Technoboy-
Copy link
Contributor

Hi @heesung-sn we're now releasing 3.2, so all the new features need to label with milestone-3.3

@heesung-sn
Copy link
Contributor

I see. Thanks for the correction.

@dragosvictor dragosvictor deleted the pip-307-metrics branch January 31, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-required Your PR changes impact docs and you will update later.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants