Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add "flow logger" + instrument upload flow with it #47

Merged
merged 4 commits into from
Aug 16, 2023

Conversation

matt-codecov
Copy link
Contributor

@matt-codecov matt-codecov commented Aug 2, 2023

tl;dr:

  • new logging util, sends metrics to sentry like "time from the start of Task A to the end of Task B"
  • two things to review:
    • CheckpointLogger and its tests
    • UploadFlow and the instrumentation of UploadTask, UploadFinisherTask, and NotifyTask (and tests)
  • i'm experimenting here, feedback or better ideas are appreciated if you have them

tested end-to-end, see this dashboard in sentry with widgets for each of the subflows created:
Screenshot 2023-08-11 at 1 44 03 PM

will see about creating a dashboard in the real Codecov org when this is deployed to staging

related issue: codecov/engineering-team#84

CheckpointLogger + UploadFlow

there are currently two things i want to measure that don't really fit in sentry transaction traces or statsd:

  • the time the entire set of UploadProcessorTasks takes to run, median + p75 + p95
  • the time between when we first receive an upload and when we send a notification about it, median + p75 + p95

you can figure these out for specific samples in sentry, but it didn't seem like you can aggregate/dashboard them effectively. and statsd timers are very code-local and can't capture these high-level event intervals. so i'm playing with a new thing

class MyFlow(Enum)
    A = auto()
    B = auto()

checkpoints = CheckpointLogger(MyFlow)
checkpoints.log(MyFlow.A)
checkpoints.log(MyFlow.B)
checkpoints.submit_subflow("a_to_b", MyFlow.A, MyFlow.B)

the above snippet will record timestamps for A and B and then submit the interval to sentry as a custom performance metric on the current transaction with the name "a_to_b". in worker, the current transaction is (i think) the task you're in when you submit the subflow. create a CheckpointLogger at the start of your flow and pass it as an argument to functions/tasks until you reach the end of your flow.

i've instrumented the upload flow and am hoping to make a dashboard in sentry with these metrics:

  • UploadTask: time_before_processing, initial_processing_duration
  • UploadFinisherTask: batch_processing_duration, total_processing_duration
  • NotifyTask: notification_latency

batch_processing_duration is the "all the UploadProcessorTasks" metric and notification_latency is the end-to-end metric.

in the future, i am interested in a version of this that provides reliability metrics for these flows too. basically i want to add funnel analysis: of the N users who enter this flow (upload task), what % end in a success state (notifications sent), and what's the breakdown of where they fall out of the funnel (failed to process uploads, too many retries, no valid credentials)? let me know if you have any ideas :)

OPEN Q: is there anything i should know about clock synchronization between hosts in our deployments? the higher percentiles of these metrics are up there in the minutes so clock drift of a few seconds is probably fine, but i'm kind of hand-waving here

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

Copy link
Contributor

@giovanni-guidini giovanni-guidini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, so I'll leave my ✅ , but interested on other people's opinions too.

I don't know of any ready made tools to do this, but I think it's clean implementation.

tasks/notify.py Outdated
@@ -50,6 +50,7 @@ async def run_async(
commitid: str,
current_yaml=None,
empty_upload=None,
checkpoints=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be an instance of CheckpointLogger? I believe all these task arguments need to be serializable in some way (not sure if JSON is used or something else) - so just wanted to make sure this was tested in some sort of E2E fashion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah great callout, i think the tests don't actually exercise passing this between tasks

i was at least planning to create a sentry project and try to get a dashboard populated end-to-end before merging, but i'll also see if i can write a proper integration test that passes this between tasks in CI

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Merging #47 (aacbd27) into main (76c780f) will decrease coverage by 0.01%.
The diff coverage is 98.22%.

Changes have been made to critical files, which contain lines commonly executed in production. Learn more

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
- Coverage   98.59%   98.59%   -0.01%     
==========================================
  Files         357      359       +2     
  Lines       26004    26275     +271     
==========================================
+ Hits        25639    25905     +266     
- Misses        365      370       +5     
Flag Coverage Δ
integration 98.56% <98.22%> (-0.01%) ⬇️
latest-uploader-overall 98.56% <98.22%> (-0.01%) ⬇️
onlysomelabels 98.58% <98.22%> (-0.01%) ⬇️
unit 98.56% <98.22%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 97.29% <94.73%> (-0.03%) ⬇️
OutsideTasks 98.34% <99.48%> (+0.01%) ⬆️
Files Changed Coverage Δ
tasks/upload_finisher.py 97.11% <80.00%> (-1.83%) ⬇️
tasks/upload.py Critical 98.21% <90.90%> (-0.83%) ⬇️
helpers/checkpoint_logger.py 98.33% <98.33%> (ø)
conftest.py 94.44% <100.00%> (+0.47%) ⬆️
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/notify.py Critical 98.78% <100.00%> (+0.02%) ⬆️
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_task.py 99.44% <100.00%> (+0.01%) ⬆️
Related Entrypoints
run/app.tasks.upload.Upload

@matt-codecov matt-codecov dismissed giovanni-guidini’s stale review August 11, 2023 20:45

made significant changes since review

@matt-codecov
Copy link
Contributor Author

@scott-codecov was right, the CheckpointLogger object was not serializable. reworked it so you pass the serializable payload between tasks and recreate the object on either side

also got a dashboard created in a spare sentry org, proving the basic idea?

Copy link
Contributor

@scott-codecov scott-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great @matt-codecov - excited to see it in action!

@matt-codecov matt-codecov merged commit 5df32f9 into main Aug 16, 2023
9 of 12 checks passed
@matt-codecov matt-codecov deleted the matt/add-upload-flow-logs branch August 16, 2023 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants