Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics #152

Merged
merged 3 commits into from
Oct 20, 2023

Conversation

matt-codecov
Copy link
Contributor

undo of #151 / redo of #100 with bug fixed

top commit is just the bugfix, bottom commit is #100 unmodified

the bug was: we serialize dicts of checkpoints data between worker tasks, and that requires BaseFlow to inherit from str. #100 forgot and removed that inheritance. this PR adds that inheritance back, adds a regression test, and makes the serialized representation of a checkpoint include the enum name so we can actually validate them as intended

original description from #100

this PR adds some features to CheckpointLogger:

  • @subflows() decorator
    • define subflow metrics alongside the checkpoints themselves instead of peppered around the code
    • auto-submit a pre-defined subflow when you log its end checkpoint
  • @success_events() and @failure_events() decorators
    • designate "completed successfully" and "completed with error" terminal checkpoints for your flow
    • automatically create a subflow from the first checkpoint in a flow to each of these terminal checkpoints (if one wasn't manually created)
  • @reliability_counters decorator
    • increment a counter for each checkpoint logged ({flow}.events.{checkpoint})
    • increment special counters for reliability metrics
      • {flow}.total.begun when the first checkpoint is logged
      • {flow}.total.failed when any @failure_events() checkpoint is logged
      • {flow}.total.succeeded when any @success_events() checkpoint is logged
      • {flow}.total.ended when any terminal checkpoint (success of failure) is logged

some reliability metrics this enables:

  • {flow}.total.ended / {flow}.total.begun: should be 1, but if it isn't that suggests there are exit conditions that you haven't instrumented
  • {flow}.total.succeeded / {flow}.total.begun: success rate. denominator maybe could be ended instead
  • {flow}.total.failed / {flow}.total.begun: failure rate. denominator maybe could be ended instead
  • {flow}.events.SPECIFIC_ERROR / {flow}.total.failed: % of failures caused by SPECIFIC_ERROR

this PR also instruments more of the exit conditions of the upload flow. there are more non-error exit conditions (no pending jobs, not sending notifs, stale PR head) as well as error exit conditions (no valid bot, git service errors, too many retries...)

sentry is still the backend for subflow durations and statsd is still the backend for the reliability counters, but i'm looking to change both in future iterations.

before

class Foo(Enum):
    BEGIN = auto()
    CHECKPOINT = auto
    SUCCESS = auto()

checkpoints = checkpoints_from_kwargs(Foo, kwargs).log(Foo.BEGIN)
...
# explicitly adds `first_checkpoint` metric to sentry transaction
# that's all
checkpoints.log(Foo.CHECKPOINT).submit_subflow("first_checkpoints", Foo.BEGIN, Foo.CHECKPOINT)
...

the checkpoints are defined in one place, but the relationships between them (the subflows) are peppered throughout the code.

now

@success_events('SUCCESS')
@failure_events('FAILURE')
@subflows(
    ('time_to_checkpoint', 'BEGIN', 'SUCCESS')
    ('time_to_success', 'BEGIN', 'SUCCESS'),
    # implicitly creates a subflow for all terminal metrics
    # ('Foo_BEGIN_to_FAILURE', 'BEGIN', 'FAILURE')
)
@reliability_counters
class Foo(Enum):
    BEGIN = auto()
    CHECKPOINT = auto()
    FAILURE = auto()
    SUCCESS = auto()

# implicitly increments `Foo.events.BEGIN`
# implicitly increments `Foo.total.begun`
checkpoints = checkpoints_from_kwargs(Foo, kwargs).log(Foo.BEGIN)
...
# implicitly increments `Foo.events.CHECKPOINT`
# implicitly adds the pre-defined `time_to_checkpoint` metric to sentry transaction
checkpoints.log(Foo.CHECKPOINT)
...
# implicitly increments `Foo.events.FAILURE`
# implicitly increments `Foo.total.failed`
# implicitly increments `Foo.total.ended`
# implicitly adds the implicitly-defined `Foo_BEGIN_to_FAILURE` metric to sentry transaction
checkpoints.log(Foo.FAILURE)

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@matt-codecov matt-codecov changed the title Matt/checkpoint logger v2.1 (redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics Oct 18, 2023
@codecov-staging
Copy link

codecov-staging bot commented Oct 18, 2023

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will increase coverage by 0.00%.
The diff coverage is 98.96%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main     #152    +/-   ##
========================================
  Coverage   98.43%   98.43%            
========================================
  Files         346      347     +1     
  Lines       27069    27281   +212     
========================================
+ Hits        26645    26855   +210     
- Misses        424      426     +2     
Flag Coverage Δ
integration 98.43% <98.96%> (+<0.01%) ⬆️
latest-uploader-overall 98.43% <98.96%> (+<0.01%) ⬆️
unit 98.43% <98.96%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 96.99% <98.38%> (+0.01%) ⬆️
OutsideTasks 98.23% <98.86%> (+<0.01%) ⬆️
Files Coverage Δ
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/notify.py 98.83% <100.00%> (+0.04%) ⬆️
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_task.py 99.44% <100.00%> (+<0.01%) ⬆️
tasks/upload.py 98.26% <100.00%> (+<0.01%) ⬆️
tasks/upload_finisher.py 95.45% <100.00%> (+0.12%) ⬆️
helpers/checkpoint_logger/__init__.py 97.88% <97.88%> (ø)

@codecov
Copy link

codecov bot commented Oct 18, 2023

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will decrease coverage by 0.06%.
The diff coverage is 99.73%.

Changes have been made to critical files, which contain lines commonly executed in production. Learn more

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #152      +/-   ##
==========================================
- Coverage   98.46%   98.40%   -0.06%     
==========================================
  Files         373      373              
  Lines       27665    27860     +195     
==========================================
+ Hits        27239    27417     +178     
- Misses        426      443      +17     
Flag Coverage Δ
integration 98.43% <98.96%> (+<0.01%) ⬆️
latest-uploader-overall 98.43% <98.96%> (+<0.01%) ⬆️
unit 98.43% <98.96%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 96.92% <99.63%> (-0.13%) ⬇️
OutsideTasks 98.23% <98.86%> (+<0.01%) ⬆️
Files Coverage Δ
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/notify.py Critical 98.83% <100.00%> (+0.04%) ⬆️
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_task.py 99.45% <100.00%> (+<0.01%) ⬆️
tasks/upload.py Critical 98.27% <100.00%> (+<0.01%) ⬆️
tasks/upload_finisher.py 95.49% <100.00%> (+0.12%) ⬆️
helpers/checkpoint_logger/__init__.py 99.55% <99.55%> (+1.71%) ⬆️
Related Entrypoints
run/app.tasks.upload.Upload
run/app.tasks.upload.UploadFinisher
run/app.tasks.notify.Notify

@codecov-qa
Copy link

codecov-qa bot commented Oct 18, 2023

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will increase coverage by 0.00%.
The diff coverage is 98.96%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##             main     #152    +/-   ##
========================================
  Coverage   98.43%   98.43%            
========================================
  Files         346      347     +1     
  Lines       27069    27281   +212     
========================================
+ Hits        26645    26855   +210     
- Misses        424      426     +2     
Flag Coverage Δ
integration 98.43% <98.96%> (+<0.01%) ⬆️
latest-uploader-overall 98.43% <98.96%> (+<0.01%) ⬆️
unit 98.43% <98.96%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 96.99% <98.38%> (+0.01%) ⬆️
OutsideTasks 98.23% <98.86%> (+<0.01%) ⬆️
Files Coverage Δ
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/notify.py 98.83% <100.00%> (+0.04%) ⬆️
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_task.py 99.44% <100.00%> (+<0.01%) ⬆️
tasks/upload.py 98.26% <100.00%> (+<0.01%) ⬆️
tasks/upload_finisher.py 95.45% <100.00%> (+0.12%) ⬆️
helpers/checkpoint_logger/__init__.py 97.88% <97.88%> (ø)

@codecov-public-qa
Copy link

codecov-public-qa bot commented Oct 18, 2023

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will increase coverage by 0.00%.
The diff coverage is 98.96%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main     #152    +/-   ##
========================================
  Coverage   98.43%   98.43%            
========================================
  Files         346      347     +1     
  Lines       27069    27281   +212     
========================================
+ Hits        26645    26855   +210     
- Misses        424      426     +2     
Flag Coverage Δ
integration 98.43% <98.96%> (+<0.01%) ⬆️
latest-uploader-overall 98.43% <98.96%> (+<0.01%) ⬆️
unit 98.43% <98.96%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 96.99% <98.38%> (+0.01%) ⬆️
OutsideTasks 98.23% <98.86%> (+<0.01%) ⬆️
Files Coverage Δ
helpers/checkpoint_logger/flows.py 100.00% <100.00%> (ø)
helpers/tests/unit/test_checkpoint_logger.py 100.00% <100.00%> (ø)
tasks/notify.py 98.83% <100.00%> (+0.04%) ⬆️
tasks/tests/unit/test_notify_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_finisher_task.py 100.00% <100.00%> (ø)
tasks/tests/unit/test_upload_task.py 99.44% <100.00%> (+<0.01%) ⬆️
tasks/upload.py 98.26% <100.00%> (+<0.01%) ⬆️
tasks/upload_finisher.py 95.45% <100.00%> (+0.12%) ⬆️
helpers/checkpoint_logger/__init__.py 97.88% <97.88%> (ø)

into the enum's value (e.g. "MEMBER_NAME" -> "MyEnum.MEMBER_NAME")
"""
value = f"{cls.__name__}.{value}"
return super().__new__(cls, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where the de-serialization magic is happening such that we can have a string like "MyEnum.A" and it get's deserialized back into the actual enum member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think technically it doesn't deserialize into the enum member, it deserializes into a str which can be downcast to the enum member which inherits a comparison function from str, but yes

each enum value is an instance of the enum class that gets created at startup, and it appears _generate_next_value_() determines what gets passed into __new__() when using auto() to declare the enum member. so we get the member name ("A") from _generate_next_value_(), we prepend the class name ("MyEnum") in __new__(), and then we pass that value to the superclass constructors (i think this calls __new__() from both str and Enum)

defining __str__() doesn't work, the json encoder is looking specifically for a str or subclass thereof, it doesn't try to cast

@matt-codecov matt-codecov merged commit 7e263ca into main Oct 20, 2023
27 of 30 checks passed
@matt-codecov matt-codecov deleted the matt/checkpoint-logger-v2.1 branch October 20, 2023 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants