Add consumer_lag in Kafka consumergroup metricset#14822
Add consumer_lag in Kafka consumergroup metricset#14822ChrsMark merged 10 commits intoelastic:masterfrom
Conversation
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
jsoriano
left a comment
There was a problem hiding this comment.
@ChrsMark thanks for taking this! I think that this solves one of the main issues we had with this module. We can decide in future PRs if we could further refactor this to have a single client, but I think that we still need to connect to non-leaders to monitor certain things.
Could you also take a look to the dashboard? The consumer lag visualization can be surely simplified with this new field 🙂
| } | ||
|
|
||
| // GetPartitionOffsetFromTheLeader fetches the OffsetNewest from the leader. | ||
| func (b *Broker) GetPartitionOffsetFromTheLeader(topic string, partitionID int32) (int64, error) { |
There was a problem hiding this comment.
Nit. Use Fetch for consistency with other methods here.
| func (b *Broker) GetPartitionOffsetFromTheLeader(topic string, partitionID int32) (int64, error) { | |
| func (b *Broker) FetchPartitionOffsetFromTheLeader(topic string, partitionID int32) (int64, error) { |
| testEvent("group1", "topic1", 0, common.MapStr{ | ||
| "client": clientMeta(0), | ||
| "offset": int64(10), | ||
| "offset": int64(10), "consumer_lag": int64(42) - int64(10), |
There was a problem hiding this comment.
Nit.
| "offset": int64(10), "consumer_lag": int64(42) - int64(10), | |
| "offset": int64(10), | |
| "consumer_lag": int64(42) - int64(10), |
| b.id = other.ID() | ||
| b.advertisedAddr = other.Addr() | ||
|
|
||
| c, err := getClusteWideClient(b.Addr(), b.cfg) |
There was a problem hiding this comment.
I was thinking that we could use this client for everything, but not, we may still need to fetch offsets from non-leader partition replicas for monitoring purpouses.
There was a problem hiding this comment.
Though we can also use client.Leader(topic, partitionID) for that.
There was a problem hiding this comment.
+1 on revisiting this in a followup PR with refactoring purpose
|
|
||
| for topic, partitions := range ret.off.Blocks { | ||
| for partition, info := range partitions { | ||
| partitionOffset, err := getPartitionOffsetFromTheLeader(b, topic, partition) |
There was a problem hiding this comment.
Something to explore here.
I guess that it may happen that the partition offset here is always going to be ahead of the group offset, because we get first the group offset, and then the partition offset. Between both operations the partition offset may have changed.
Starting on version 4 of ListOffsets (the API method used to get partition offsets), it is possible to indicate a current_leader_epoch to retrieve "old" metadata.
Starting on version 5 of OffsetFetch (the API method used to get consumer group offsets), its response contains a leader_epoch field.
I wonder if we can use the leader_epoch contained in the response of OffsetFetch when available to query for the offset of the partition in the same epoch. This way we could have a more accurate value.
There was a problem hiding this comment.
What are the implications for the support matrix? I'm not very familiar with API version vs Kafka versioning.
There was a problem hiding this comment.
What are the implications for the support matrix? I'm not very familiar with API version vs Kafka versioning.
All messages in kafka protocol are versioned, each client and broker can support a different range of versions for each message. There is a message (ApiVersionsRequest) to query the versions supported by the broker, we could use this method to decide if we can use the methods aware of the epoch.
@jsoriano not sure if this can be achieved with the current implementation of GetOffset
No, we would need to forge our own request as we do to request partition offsets to the leader. Or we could contribute to Sarama the support for these versions. 🙂
|
|
||
| - name: consumer_lag | ||
| type: long | ||
| description: consumer lag for partition/topic |
There was a problem hiding this comment.
probably worth it explaining what this is, as an important metric: the difference between the partition offset and consumer offset
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
|
@jsoriano, @mtojek , @exekias thank you all for reviewing! Almost everything is now addressed. @jsoriano regarding some special points:
Let me know what you think! |
|
Regarding @ChrsMark 's last comment - this PR is already medium sized one, I'm for pushing refactoring and dashboards to two next independent PRs. Thanks for addressing comments. LGTM! |
jsoriano
left a comment
There was a problem hiding this comment.
Yep, dashboard and further refactors can be left for future changes.
|
Failing tests are irrelevant and already addressed on #14849. |
(cherry picked from commit 23aaf5c)
This PR adds
consumer_lagfield inconsumergroupmetricset. This is calculated by subtracting groupOffset from partitionOffset for apartition-topicpair.partitionOffset is retrieved from the cluster directly using https://github.com/Shopify/sarama/blob/afedecade3c6d8e99ab6dfeeea7814bf800b90a4/client.go#L62
May resolve_ #3608.
Signed-off-by: chrismark chrismarkou92@gmail.com