Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics #100

Merged
merged 13 commits into from
Oct 4, 2023

Conversation

matt-codecov
Copy link
Contributor

this PR adds some features to CheckpointLogger:

  • @subflows() decorator
    • define subflow metrics alongside the checkpoints themselves instead of peppered around the code
    • auto-submit a pre-defined subflow when you log its end checkpoint
  • @success_events() and @failure_events() decorators
    • designate "completed successfully" and "completed with error" terminal checkpoints for your flow
    • automatically create a subflow from the first checkpoint in a flow to each of these terminal checkpoints (if one wasn't manually created)
  • @reliability_counters decorator
    • increment a counter for each checkpoint logged ({flow}.events.{checkpoint})
    • increment special counters for reliability metrics
      • {flow}.total.begun when the first checkpoint is logged
      • {flow}.total.failed when any @failure_events() checkpoint is logged
      • {flow}.total.succeeded when any @success_events() checkpoint is logged
      • {flow}.total.ended when any terminal checkpoint (success of failure) is logged

some reliability metrics this enables:

  • {flow}.total.ended / {flow}.total.begun: should be 1, but if it isn't that suggests there are exit conditions that you haven't instrumented
  • {flow}.total.succeeded / {flow}.total.begun: success rate. denominator maybe could be ended instead
  • {flow}.total.failed / {flow}.total.begun: failure rate. denominator maybe could be ended instead
  • {flow}.events.SPECIFIC_ERROR / {flow}.total.failed: % of failures caused by SPECIFIC_ERROR

this PR also instruments more of the exit conditions of the upload flow. there are more non-error exit conditions (no pending jobs, not sending notifs, stale PR head) as well as error exit conditions (no valid bot, git service errors, too many retries...)

sentry is still the backend for subflow durations and statsd is still the backend for the reliability counters, but i'm looking to change both in future iterations.

before

class Foo(Enum):
    BEGIN = auto()
    CHECKPOINT = auto
    SUCCESS = auto()

checkpoints = checkpoints_from_kwargs(Foo, kwargs).log(Foo.BEGIN)
...
# explicitly adds `first_checkpoint` metric to sentry transaction
# that's all
checkpoints.log(Foo.CHECKPOINT).submit_subflow("first_checkpoints", Foo.BEGIN, Foo.CHECKPOINT)
...

the checkpoints are defined in one place, but the relationships between them (the subflows) are peppered throughout the code.

now

@success_events('SUCCESS')
@failure_events('FAILURE')
@subflows(
    ('time_to_checkpoint', 'BEGIN', 'SUCCESS')
    ('time_to_success', 'BEGIN', 'SUCCESS'),
    # implicitly creates a subflow for all terminal metrics
    # ('Foo_BEGIN_to_FAILURE', 'BEGIN', 'FAILURE')
)
@reliability_counters
class Foo(Enum):
    BEGIN = auto()
    CHECKPOINT = auto()
    FAILURE = auto()
    SUCCESS = auto()

# implicitly increments `Foo.events.BEGIN`
# implicitly increments `Foo.total.begun`
checkpoints = checkpoints_from_kwargs(Foo, kwargs).log(Foo.BEGIN)
...
# implicitly increments `Foo.events.CHECKPOINT`
# implicitly adds the pre-defined `time_to_checkpoint` metric to sentry transaction
checkpoints.log(Foo.CHECKPOINT)
...
# implicitly increments `Foo.events.FAILURE`
# implicitly increments `Foo.total.failed`
# implicitly increments `Foo.total.ended`
# implicitly adds the implicitly-defined `Foo_BEGIN_to_FAILURE` metric to sentry transaction
checkpoints.log(Foo.FAILURE)

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@codecov
Copy link

codecov bot commented Sep 15, 2023

Codecov Report

Merging #100 (1dcc4e3) into main (518f466) will increase coverage by 0.00%.
The diff coverage is 98.93%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main     #100    +/-   ##
========================================
  Coverage   98.46%   98.46%            
========================================
  Files         369      371     +2     
  Lines       27233    27495   +262     
========================================
+ Hits        26814    27073   +259     
- Misses        419      422     +3     
Flag Coverage Δ
integration 98.43% <98.93%> (+<0.01%) ⬆️
latest-uploader-overall 98.43% <98.93%> (+<0.01%) ⬆️
unit 98.43% <98.93%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 97.07% <98.34%> (+0.01%) ⬆️
OutsideTasks 98.22% <98.81%> (+<0.01%) ⬆️
Files Coverage Δ
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/notify.py 98.83% <100.00%> (+0.04%) ⬆️
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_task.py 99.44% <100.00%> (+<0.01%) ⬆️
tasks/upload.py 98.26% <100.00%> (+<0.01%) ⬆️
tasks/upload_finisher.py 97.22% <100.00%> (+0.07%) ⬆️
helpers/checkpoint_logger/__init__.py 97.81% <97.81%> (ø)
Related Entrypoints
run/app.tasks.notify.Notify
run/app.tasks.upload.Upload
run/app.tasks.upload.UploadFinisher

Copy link
Contributor

@giovanni-guidini giovanni-guidini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of defining the success/failure events and the subflows in the metrics definition directly. Quite the complex piece of code. Exquisite job making it so flexible and easy to use 👏👏👏 .

Looks good in general, just some points I'd like to get your opinion on first

Some (small) comments I would add here would be:

  • I think UploadFlow should be defined in a separate file. Maybe helpers/flows/upload.py or something.
  • It looks like that @subflows should be defined after @failure_events and @success_events. If I'm wrong just skip the rest. The docs and examples hint as much, but I wonder if you should add checks to the success/failure if sub flows have been defined already and at least emit a warning that sub flows might be missing
  • Do checkpoints have a type that we can maybe type hint?... Looks like they're just strings tho :E probably just my own biases towards types talking here, but maybe we can define an explicit Flow type (in lieu of (str, Enum))

@matt-codecov
Copy link
Contributor Author

I think UploadFlow should be defined in a separate file

you're right, i was lazy lol. your way also makes it clearer people can write their own flows

It looks like that @subflows should be defined after @failure_events and @success_events

great catch! i should at least make this explicit in the doc comments so i don't ruin anyone's day if they miss it. thanks!

Do checkpoints have a type that we can maybe type hint?

this question made me finally start reading about python typing lol. thinking out loud:

enum values are instances of the enum class. apparently python has generics now (???) so i can maybe do something like:

from typing import Generic, Self, TypeVar
T = TypeVar("T")

class CheckpointLogger(Generic[T]):
    def __init__(self, data = {}, strict = False):
        self.data = data
        self.strict = strict

    def log(self, checkpoint: T, ignore_repeats=False, kwargs=None) -> Self:
        ...

checkpoints = CheckpointLogger[UploadFlow]()
checkpoints.log(UploadFlow.UPLOAD_TASK_BEGIN)

i can play around with it. idk if there's any weirdness using it from an otherwise untyped file

@matt-codecov
Copy link
Contributor Author

so static typing is awkward when i insert new functions at runtime :P i'll keep playing with it but i might file an issue to revisit it later if i can't figure it out

@codecov-staging
Copy link

codecov-staging bot commented Oct 3, 2023

Codecov Report

Merging #100 (1dcc4e3) into main (518f466) will increase coverage by 0.04%.
The diff coverage is 98.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   93.25%   93.30%   +0.04%     
==========================================
  Files         346      347       +1     
  Lines       26889    27091     +202     
==========================================
+ Hits        25075    25276     +201     
- Misses       1814     1815       +1     
Flag Coverage Δ
integration 93.30% <98.57%> (+0.04%) ⬆️
latest-uploader-overall 93.30% <98.57%> (+0.04%) ⬆️
unit 93.30% <98.57%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.01% <97.79%> (+0.05%) ⬆️
OutsideTasks 96.74% <98.81%> (+0.02%) ⬆️
Files Coverage Δ
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 88.50% <100.00%> (+0.05%) ⬆️
tasks/tests/unit/test_upload_task.py 57.38% <100.00%> (+0.07%) ⬆️
tasks/upload.py 80.00% <100.00%> (+0.96%) ⬆️
tasks/upload_finisher.py 97.19% <100.00%> (+0.08%) ⬆️
tasks/notify.py 93.02% <88.88%> (-0.32%) ⬇️
helpers/checkpoint_logger/__init__.py 97.81% <97.81%> (ø)

@codecov-qa
Copy link

codecov-qa bot commented Oct 3, 2023

Codecov Report

Merging #100 (1dcc4e3) into main (518f466) will increase coverage by 0.04%.
The diff coverage is 98.57%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   93.25%   93.30%   +0.04%     
==========================================
  Files         346      347       +1     
  Lines       26889    27091     +202     
==========================================
+ Hits        25075    25276     +201     
- Misses       1814     1815       +1     
Flag Coverage Δ
integration 93.30% <98.57%> (+0.04%) ⬆️
latest-uploader-overall ?
unit 93.30% <98.57%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.01% <97.79%> (+0.05%) ⬆️
OutsideTasks 96.74% <98.81%> (+0.02%) ⬆️
Files Coverage Δ
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 88.50% <100.00%> (+0.05%) ⬆️
tasks/tests/unit/test_upload_task.py 57.38% <100.00%> (+0.07%) ⬆️
tasks/upload.py 80.00% <100.00%> (+0.96%) ⬆️
tasks/upload_finisher.py 97.19% <100.00%> (+0.08%) ⬆️
tasks/notify.py 93.02% <88.88%> (-0.32%) ⬇️
helpers/checkpoint_logger/__init__.py 97.81% <97.81%> (ø)

@matt-codecov matt-codecov merged commit 06ee6c8 into main Oct 4, 2023
26 of 28 checks passed
@matt-codecov matt-codecov deleted the matt/checkpoint-logger-v2 branch October 4, 2023 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants