Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker lambda metrics #734

Merged
merged 15 commits into from
Aug 31, 2022
Merged

Worker lambda metrics #734

merged 15 commits into from
Aug 31, 2022

Conversation

DavidLawes
Copy link
Contributor

@DavidLawes DavidLawes commented Aug 26, 2022

What does this change?

This adds metrics for the worker lambdas (the harvester and the senders) to give us a longer-term view of the performance of these services.

The change has been deployed to CODE and the metrics have been added to the notifications dashboard:

Harvester:
Screenshot 2022-08-26 at 15 15 28
^ dimension of type allows us to view processingTime metric separately for breakingNews notifications and other types.

Sender lambdas:
Screenshot 2022-08-30 at 11 35 22
^ dimensions of platform and type allow us to view processingTime metrics for ios and android, separated further by breakingNews notifications or other types.

@DavidLawes
Copy link
Contributor Author

DavidLawes commented Aug 26, 2022

cc @michaelwmcnamara @jacobwinch

This is a first attempt at adding embedded metrics to the worker lambdas. The format of the _aws log entry is pretty much copy and pasted from your PR

The current state of this PR doesn't attempt to abstract the common functionality yet because I was faced with a strange error:

  • we can see that the metrics being successfully parsed by aws (we see the metrics and their values in the aws console)
  • in the cloudwatch logs I can see the corresponding log message
  • however, in kibana i never see the log message that contains the _aws object (which is disconcerting as a dev because there is certain info not being shown and made me second-guess myself as to whether something is wrong)

Screenshot 2022-08-26 at 15 23 58

Screenshot 2022-08-26 at 15 25 27

Did you come across this situation when creating your original PR?

EDIT: I've tested again this morning with the same code and I now see the logs in kibana and the metrics in aws so maybe there was a temporal issue in the elk stack somewhere (or I was being impatient)

"Timestamp" -> end.toEpochMilli,
"CloudWatchMetrics" -> List(Map(
"Namespace" -> s"Notifications/${env.stage}/workers",
"Dimensions" -> List(List("platform")),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waisingyiu @frankie297
Dimensions allow us to view metrics in greater granularity. For the sender lambdas, I wonder whether it could be useful to see the metric by breakingNews/other? E.g. we could get metrics:
ios + breakingNews
ios + other
android + breakingNews
android + other

Or, maybe just knowing the processing time for android vs ios will be enough. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be useful when we need to look at the performance of a particular breaking news notification? We may show metrics for breaking news only, and we could associate the metric with a particular breaking news notification based on the time. Please ignore me if cloudwatch metric should not be used for this purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waisingyiu :) If I recall correctly, providing the notificationId as a metric dimension would incur additional cost (ref: https://docs.google.com/document/d/10AnOZ4MLjuTO7mXaySoVO2SwmroLmns4gGmsfY7egmw/edit#heading=h.ajh00ba9hm68). I think if we needed to analyse the performance of a specific notificationId we'd need to rely on our logs+kibana dashboard (I could've misinterpreted this though!).

For the metrics, I was hoping to generate a minimum (static) subset of dimensions that would allow us to analyse performance/trends. Based on your response I think there would be value in providing an additional dimension for the sender lambda metrics:

  • platform: ios or android
  • type: breakingNews or other

Copy link
Contributor

@jacobwinch jacobwinch Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we needed to analyse the performance of a specific notificationId we'd need to rely on our logs+kibana dashboard (I could've misinterpreted this though!).

I think this is the correct approach 👍

I also agree that adding platform and type as dimensions would give us some extra benefits without increasing costs too much.

@jacobwinch
Copy link
Contributor

Did you come across this situation when creating your original PR?

I don't remember facing this problem. We'd expect all Lambda logs to show up in ELK pretty quickly, certainly within a minute or two.

EDIT: I've tested again this morning with the same code and I now see the logs in kibana and the metrics in aws so maybe there was a temporal issue in the elk stack somewhere (or I was being impatient)

Let's keep an eye on things when this is merged and investigate further if it happens again!

"harvester.notificationProcessingEndTime.string" -> end.toString,
), "Finished processing notification event")
)
records.foreach {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the harvester batchSize is 1. I think if we increased this then logging in the finally block won't provide us with accurate metrics. Not something that I think I need to address now, just a consideration for the future

@DavidLawes
Copy link
Contributor Author

I think for now I'd like to:

  • push these changes into prod so we can start collecting metrics
  • consider possible refactoring of the embedded cloudwatch metric as a subsequent PR (just abstracting the CloudWatchMetrics Map to a common function didn't increase readability, for me at least, and I wonder if there should be higher levels of abstraction to ensure that dimensions + metricName have corresponding keys in the log object too)

@DavidLawes DavidLawes marked this pull request as ready for review August 30, 2022 10:56
Copy link
Contributor

@waisingyiu waisingyiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your great work! I have just two queries:

  1. Will the app notifications dashboard in ELK continue to work after this PR?

  2. I notice that the startTime come from the attributes sentTimestamp of the event record rather than being set as the start of the function. Does AWS populate this attribute, or the service upstream? What time exactly is it?

Happy to approve. Thanks David.

@DavidLawes
Copy link
Contributor Author

DavidLawes commented Aug 30, 2022

Thank you for your great work! I have just two queries:

  1. Will the app notifications dashboard in ELK continue to work after this PR?
  2. I notice that the startTime come from the attributes sentTimestamp of the event record rather than being set as the start of the function. Does AWS populate this attribute, or the service upstream? What time exactly is it?

Happy to approve. Thanks David.

Thanks!

The apps notifications dashboard in ELK will continue to work as expected after this PR.

About the sentTimestamp - this is set by aws when the message lands on the queue. The suggestion was to use this time to measure the total time taken to process a message (e.g. total time = time spent on queue before processing + time spent processing message by a lambda). I think this way we'll get a better feel for how the overall system is performing (ref: https://docs.google.com/document/d/10AnOZ4MLjuTO7mXaySoVO2SwmroLmns4gGmsfY7egmw/edit#heading=h.n166j1upsg46)

Hope that makes sense!

@DavidLawes DavidLawes merged commit 6ca8a1b into main Aug 31, 2022
@DavidLawes DavidLawes deleted the dlawes/worker-lambda-metrics branch August 31, 2022 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants