feat: add "flow logger" + instrument upload flow with it #47

matt-codecov · 2023-08-02T02:51:45Z

tl;dr:

new logging util, sends metrics to sentry like "time from the start of Task A to the end of Task B"
two things to review:
- CheckpointLogger and its tests
- UploadFlow and the instrumentation of UploadTask, UploadFinisherTask, and NotifyTask (and tests)
i'm experimenting here, feedback or better ideas are appreciated if you have them

tested end-to-end, see this dashboard in sentry with widgets for each of the subflows created:

will see about creating a dashboard in the real Codecov org when this is deployed to staging

related issue: codecov/engineering-team#84

`CheckpointLogger` + `UploadFlow`

there are currently two things i want to measure that don't really fit in sentry transaction traces or statsd:

the time the entire set of UploadProcessorTasks takes to run, median + p75 + p95
the time between when we first receive an upload and when we send a notification about it, median + p75 + p95

you can figure these out for specific samples in sentry, but it didn't seem like you can aggregate/dashboard them effectively. and statsd timers are very code-local and can't capture these high-level event intervals. so i'm playing with a new thing

class MyFlow(Enum)
    A = auto()
    B = auto()

checkpoints = CheckpointLogger(MyFlow)
checkpoints.log(MyFlow.A)
checkpoints.log(MyFlow.B)
checkpoints.submit_subflow("a_to_b", MyFlow.A, MyFlow.B)

the above snippet will record timestamps for A and B and then submit the interval to sentry as a custom performance metric on the current transaction with the name "a_to_b". in worker, the current transaction is (i think) the task you're in when you submit the subflow. create a CheckpointLogger at the start of your flow and pass it as an argument to functions/tasks until you reach the end of your flow.

i've instrumented the upload flow and am hoping to make a dashboard in sentry with these metrics:

UploadTask: time_before_processing, initial_processing_duration
UploadFinisherTask: batch_processing_duration, total_processing_duration
NotifyTask: notification_latency

batch_processing_duration is the "all the UploadProcessorTasks" metric and notification_latency is the end-to-end metric.

in the future, i am interested in a version of this that provides reliability metrics for these flows too. basically i want to add funnel analysis: of the N users who enter this flow (upload task), what % end in a success state (notifications sent), and what's the breakdown of where they fall out of the funnel (failed to process uploads, too many retries, no valid credentials)? let me know if you have any ideas :)

OPEN Q: is there anything i should know about clock synchronization between hosts in our deployments? the higher percentiles of these metrics are up there in the minutes so clock drift of a few seconds is probably fine, but i'm kind of hand-waving here

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

giovanni-guidini

I like this, so I'll leave my ✅ , but interested on other people's opinions too.

I don't know of any ready made tools to do this, but I think it's clean implementation.

scott-codecov · 2023-08-02T13:55:55Z

tasks/notify.py

@@ -50,6 +50,7 @@ async def run_async(
        commitid: str,
        current_yaml=None,
        empty_upload=None,
+        checkpoints=None,


Would this be an instance of CheckpointLogger? I believe all these task arguments need to be serializable in some way (not sure if JSON is used or something else) - so just wanted to make sure this was tested in some sort of E2E fashion.

ah great callout, i think the tests don't actually exercise passing this between tasks

i was at least planning to create a sentry project and try to get a dashboard populated end-to-end before merging, but i'll also see if i can write a proper integration test that passes this between tasks in CI

codecov · 2023-08-11T16:07:56Z

Codecov Report

Merging #47 (aacbd27) into main (76c780f) will decrease coverage by 0.01%.
The diff coverage is 98.22%.

Changes have been made to critical files, which contain lines commonly executed in production. Learn more

@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
- Coverage   98.59%   98.59%   -0.01%     
==========================================
  Files         357      359       +2     
  Lines       26004    26275     +271     
==========================================
+ Hits        25639    25905     +266     
- Misses        365      370       +5

Flag	Coverage Δ
integration	`98.56% <98.22%> (-0.01%)`	⬇️
latest-uploader-overall	`98.56% <98.22%> (-0.01%)`	⬇️
onlysomelabels	`98.58% <98.22%> (-0.01%)`	⬇️
unit	`98.56% <98.22%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`97.29% <94.73%> (-0.03%)`	⬇️
OutsideTasks	`98.34% <99.48%> (+0.01%)`	⬆️

Files Changed	Coverage Δ
tasks/upload_finisher.py	`97.11% <80.00%> (-1.83%)`	⬇️
tasks/upload.py Critical	`98.21% <90.90%> (-0.83%)`	⬇️
helpers/checkpoint_logger.py	`98.33% <98.33%> (ø)`
conftest.py	`94.44% <100.00%> (+0.47%)`	⬆️
helpers/tests/unit/test_checkpoint_logger.py	`100.00% <100.00%> (ø)`
tasks/notify.py Critical	`98.78% <100.00%> (+0.02%)`	⬆️
tasks/tests/unit/test_notify_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_finisher_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_task.py	`99.44% <100.00%> (+0.01%)`	⬆️

Related Entrypoints
run/app.tasks.upload.Upload

made significant changes since review

matt-codecov · 2023-08-11T20:47:12Z

@scott-codecov was right, the CheckpointLogger object was not serializable. reworked it so you pass the serializable payload between tasks and recreate the object on either side

also got a dashboard created in a spare sentry org, proving the basic idea?

scott-codecov

This looks really great @matt-codecov - excited to see it in action!

feat: add "flow logger" + instrument upload flow with it

716078f

matt-codecov requested review from trent-codecov, dana-yaish, giovanni-guidini, scott-codecov, joseph-sentry and rfox-codecov August 2, 2023 02:51

giovanni-guidini previously approved these changes Aug 2, 2023

View reviewed changes

scott-codecov reviewed Aug 2, 2023

View reviewed changes

Merge branch 'main' into matt/add-upload-flow-logs

581381b

rework so checkpoint data is serialized between tasks

73f311b

scott-codecov approved these changes Aug 16, 2023

View reviewed changes

Merge branch 'main' into matt/add-upload-flow-logs

aacbd27

matt-codecov merged commit 5df32f9 into main Aug 16, 2023
9 of 12 checks passed

matt-codecov deleted the matt/add-upload-flow-logs branch August 16, 2023 17:16

matt-codecov mentioned this pull request Oct 25, 2023

Implement uploads success metric codecov/engineering-team#167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add "flow logger" + instrument upload flow with it #47

feat: add "flow logger" + instrument upload flow with it #47

matt-codecov commented Aug 2, 2023 •

edited

Loading

giovanni-guidini left a comment

scott-codecov Aug 2, 2023

matt-codecov Aug 3, 2023

codecov bot commented Aug 11, 2023 •

edited

Loading

matt-codecov commented Aug 11, 2023

scott-codecov left a comment

feat: add "flow logger" + instrument upload flow with it #47

feat: add "flow logger" + instrument upload flow with it #47

Conversation

matt-codecov commented Aug 2, 2023 • edited Loading

tl;dr:

CheckpointLogger + UploadFlow

Legal Boilerplate

giovanni-guidini left a comment

Choose a reason for hiding this comment

scott-codecov Aug 2, 2023

Choose a reason for hiding this comment

matt-codecov Aug 3, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 11, 2023 • edited Loading

Codecov Report

matt-codecov commented Aug 11, 2023

scott-codecov left a comment

Choose a reason for hiding this comment

matt-codecov commented Aug 2, 2023 •

edited

Loading

`CheckpointLogger` + `UploadFlow`

codecov bot commented Aug 11, 2023 •

edited

Loading