Skip to content

[FEATURE REQUEST]: As a CF Operator, I expect that when I observe the "AppInstanceExceededLogRateLimitCount" metric, I can see the app instance details that caused this value to be incremented so I can take action with the app owner #457

@sunjayBhatia

Description

@sunjayBhatia

Is your feature request related to a problem?

Right now, when AppInstanceExceededLogRateLimitCount is emitted (here), it is emitted as a per-cell tagged counter metric. The loggregator agent adds just the cell-specific tags to the metric, nothing more, resulting in metrics that look like this:

deployment:"cf" job:"diego-cell" index:"0e98fd00-47b2-4589-94f0-385f78b3a04d" ip:"10.0.1.12" tags:<key:"instance_id" value:"0e98fd00-47b2-4589-94f0-385f78b3a04d" > tags:<key:"source_id" value:"rep" > counterEvent:<name:"AppInstanceExceededLogRateLimitCount" delta:1 total:206 >

As a result, operators must do some additional work to filter the app instances on the cell this metric comes from to identify the actual culprit of the chatty logging. This is not super straightforward and we can do better for operators so that it is much easier to identify problematic apps (and even app instances).

Describe the solution you'd like

In the the log rate limit reporter, we have access to the metric tags set on the app's desired LRP and use them to push our own log line into the app log stream. We should be able to do the same thing with our AppInstanceExceededLogRateLimitCount metric in order to tag it accordingly and ensure the value emitted is not a per-cell metric but a per-app instance metric.

We can potentially just make a new version of our IncrementCounter method to add tags to the envelope we want to send using this option.

Diego repo

executor

Describe alternatives you've considered

  • An alternative is to do the emission logic here instead keeping in the log streamer package. This package already has reporters that periodically emit per-container (app instance) metrics.
    • maybe implement another metric reporter that is available to the log streamer
    • IMO, moving the actual call to the loggregator/metron client out to a goroutine separate from the main functionality is "safer," the way this code is currently written if the call to loggregator blocks, the work we care about is blocked; we use the parallel goroutine for metric emission pattern a lot I believe for this sort of purpose, and to keep metric emission logic/periodicity in one place so we don't have to re-implement periodic metrics repeatedly
  • If the "counter" type metric does not work, we can try to use a gauge style metric instead

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions