Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow access to Shoryuken utilization metrics #672

Closed
cjlarose opened this issue Jul 10, 2021 · 6 comments · Fixed by #673
Closed

Allow access to Shoryuken utilization metrics #672

cjlarose opened this issue Jul 10, 2021 · 6 comments · Fixed by #673

Comments

@cjlarose
Copy link
Collaborator

Related: #671

The idea here is to create some sort of public API such that users can query for Shoryuken's current runtime state in terms of utilization. Ideally, such an API should consider the possibility that Shoryuken is using multiple processing groups and allow users to discern which utilization metrics are associated with which group.

The most obvious time that users might want access to this data is, of course, when that data changes. One option would be to provide the information to middleware directly. This would allow users to build something akin to sidekiq-statsd.

class StatsMiddleware
  def call(worker_instance, _queue, _sqs_msg, _body)
    manager = worker_instance.manager
    puts manager.group_name
    puts manager.running?
    puts manager.busy_processors
    puts manager.max_processors
    yield
  end
end

Shoryuken.configure_server do |config|
  config.server_middleware do |chain|
    chain.add StatsMiddleware
  end
end

This is a little bit awkward, though, because while users would be notified whenever a new job is picked up (busy_processors in incremented), they wouldn't be notified whenever a processor becomes available (busy_processors is decremented), because while executing middleware, a processor is necessarily currently being consumed.

Another option would be to expose some callbacks that are guaranteed to be executed any time that the utilization metrics change. I think this ultimately gives users the greatest flexibility on how they want to use the data. For example:

Shoryuken.configure_server do |config|
  config.on(:manager_startup) do |event|
    puts event.group_name
    puts event.processor_metrics.running?
    puts event.processor_metrics.busy_processors
    puts event.processor_metrics.max_processors
  end

  config.on(:worker_assignment) do |event|
    puts event.group_name
    puts event.processor_metrics.running?
    puts event.processor_metrics.busy_processors
    puts event.processor_metrics.max_processors
  end

  config.on(:worker_complete) do |event|
    puts event.group_name
    puts event.processor_metrics.running?
    puts event.processor_metrics.busy_processors
    puts event.processor_metrics.max_processors
  end
end
@rbroemeling
Copy link

@cjlarose Do you have a preferred direction for this functionality?

@cjlarose
Copy link
Collaborator Author

cjlarose commented Jul 13, 2021

I'm experimenting with adding this functionality by developing a shoryuken-statsd gem alongside whatever we'll need in Shoryuken itself. That work is here: https://github.com/cjlarose/shoryuken-statsd

That way I can be confident that we'll have the right hooks in Shoryuken so that folks can build their own metrics integration if they need to. Out of curiosity @rbroemeling, would you be interested in statsd integration specifically, or do you expect to use a different platform/protocol?

@cjlarose
Copy link
Collaborator Author

cjlarose commented Jul 13, 2021

Opened #673 as a draft. I ended up adding a new event called :utilization_update instead of forcing folks to subscribe to a bunch of different events for all the times that the utilization metrics would change. Let me know if that new event would work for your use case.

@rbroemeling
Copy link

@cjlarose We're specifically looking for Datadog metrics (i.e., to send metrics to dogstatsd). So, they're statsd-compatible, but we might want to add our own implementation as well so that we can use some of Datadog's specific implementation details and enhancements on the metrics that we'll be reporting from Shoryuken.

I like the idea of writing an event system into Shoryuken that ensures that people can easily implement their own statistics gathering if/when necessary, that's a great plan.

@rbroemeling
Copy link

Your draft PR looks good, @cjlarose. One concern that I have is that in extreme cases this could cause storms of stats updates (i.e., assume each loop retrieves 10 messages, then each loop will "storm" 20 statsd packets into the statsd listener). In high-load cases, doing two stats reports per job (i.e., one on assignation, one on completion) might be an unnecessary amount of statsd load.

Brainstorming some other metrics that might be interesting (though, not positive that these fit with the utilization_update event):

  • number of SQS messages retrieved from AWS on the last call
  • duration of main dispatch loop (one iteration took XXms)

@cjlarose
Copy link
Collaborator Author

Awesome to hear that you're interested in Datadog specifically because that's what I was targeting when I was experimenting with the shoryuken-statsd project.

One concern that I have is that in extreme cases this could cause storms of stats updates (i.e., assume each loop retrieves 10 messages, then each loop will "storm" 20 statsd packets into the statsd listener).

This is something I thought of, but one thing to consider is that clients like dogstatsd don't send 1 UDP packet for every metric update: instead, a bunch of updates are buffered internally and then the whole buffer is flushed into one big packet (depending on the network MTU). I think I might just try to get an MVP working first, and then we can adjust accordingly. Either way, I think it's possible to defer the responsibility of throttling/debouncing/batching to the client from Shoryuken's perspective. And plus, there might be some clients that actually do want to be notified on every update, so we should at least give them that option in case they need it.

Brainstorming some other metrics that might be interesting (though, not positive that these fit with the utilization_update event):

* number of SQS messages retrieved from AWS on the last call

* duration of main dispatch loop (one iteration took XXms)

I've been thinking about some the same ideas, too! I think what I'll do is try to wrap up shoryuken-statsd's MVP which would just be the utilization metrics, and then we can go from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants