(redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics #152

matt-codecov · 2023-10-18T23:37:31Z

undo of #151 / redo of #100 with bug fixed

top commit is just the bugfix, bottom commit is #100 unmodified

the bug was: we serialize dicts of checkpoints data between worker tasks, and that requires BaseFlow to inherit from str. #100 forgot and removed that inheritance. this PR adds that inheritance back, adds a regression test, and makes the serialized representation of a checkpoint include the enum name so we can actually validate them as intended

original description from #100

this PR adds some features to CheckpointLogger:

@subflows() decorator
- define subflow metrics alongside the checkpoints themselves instead of peppered around the code
- auto-submit a pre-defined subflow when you log its end checkpoint
@success_events() and @failure_events() decorators
- designate "completed successfully" and "completed with error" terminal checkpoints for your flow
- automatically create a subflow from the first checkpoint in a flow to each of these terminal checkpoints (if one wasn't manually created)
@reliability_counters decorator
- increment a counter for each checkpoint logged ({flow}.events.{checkpoint})
- increment special counters for reliability metrics
  - {flow}.total.begun when the first checkpoint is logged
  - {flow}.total.failed when any @failure_events() checkpoint is logged
  - {flow}.total.succeeded when any @success_events() checkpoint is logged
  - {flow}.total.ended when any terminal checkpoint (success of failure) is logged

some reliability metrics this enables:

{flow}.total.ended / {flow}.total.begun: should be 1, but if it isn't that suggests there are exit conditions that you haven't instrumented
{flow}.total.succeeded / {flow}.total.begun: success rate. denominator maybe could be ended instead
{flow}.total.failed / {flow}.total.begun: failure rate. denominator maybe could be ended instead
{flow}.events.SPECIFIC_ERROR / {flow}.total.failed: % of failures caused by SPECIFIC_ERROR

this PR also instruments more of the exit conditions of the upload flow. there are more non-error exit conditions (no pending jobs, not sending notifs, stale PR head) as well as error exit conditions (no valid bot, git service errors, too many retries...)

sentry is still the backend for subflow durations and statsd is still the backend for the reliability counters, but i'm looking to change both in future iterations.

before

class Foo(Enum):
    BEGIN = auto()
    CHECKPOINT = auto
    SUCCESS = auto()

checkpoints = checkpoints_from_kwargs(Foo, kwargs).log(Foo.BEGIN)
...
# explicitly adds `first_checkpoint` metric to sentry transaction
# that's all
checkpoints.log(Foo.CHECKPOINT).submit_subflow("first_checkpoints", Foo.BEGIN, Foo.CHECKPOINT)
...

the checkpoints are defined in one place, but the relationships between them (the subflows) are peppered throughout the code.

now

@success_events('SUCCESS')
@failure_events('FAILURE')
@subflows(
    ('time_to_checkpoint', 'BEGIN', 'SUCCESS')
    ('time_to_success', 'BEGIN', 'SUCCESS'),
    # implicitly creates a subflow for all terminal metrics
    # ('Foo_BEGIN_to_FAILURE', 'BEGIN', 'FAILURE')
)
@reliability_counters
class Foo(Enum):
    BEGIN = auto()
    CHECKPOINT = auto()
    FAILURE = auto()
    SUCCESS = auto()

# implicitly increments `Foo.events.BEGIN`
# implicitly increments `Foo.total.begun`
checkpoints = checkpoints_from_kwargs(Foo, kwargs).log(Foo.BEGIN)
...
# implicitly increments `Foo.events.CHECKPOINT`
# implicitly adds the pre-defined `time_to_checkpoint` metric to sentry transaction
checkpoints.log(Foo.CHECKPOINT)
...
# implicitly increments `Foo.events.FAILURE`
# implicitly increments `Foo.total.failed`
# implicitly increments `Foo.total.ended`
# implicitly adds the implicitly-defined `Foo_BEGIN_to_FAILURE` metric to sentry transaction
checkpoints.log(Foo.FAILURE)

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

… counters, more UploadFlow metrics" (#151)" This reverts commit c2e6f61.

codecov-staging · 2023-10-18T23:42:43Z

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will increase coverage by 0.00%.
The diff coverage is 98.96%.

@@           Coverage Diff            @@
##             main     #152    +/-   ##
========================================
  Coverage   98.43%   98.43%            
========================================
  Files         346      347     +1     
  Lines       27069    27281   +212     
========================================
+ Hits        26645    26855   +210     
- Misses        424      426     +2

Flag	Coverage Δ
integration	`98.43% <98.96%> (+<0.01%)`	⬆️
latest-uploader-overall	`98.43% <98.96%> (+<0.01%)`	⬆️
unit	`98.43% <98.96%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`96.99% <98.38%> (+0.01%)`	⬆️
OutsideTasks	`98.23% <98.86%> (+<0.01%)`	⬆️

Files	Coverage Δ
helpers/checkpoint_logger/flows.py	`100.00% <100.00%> (ø)`
helpers/tests/unit/test_checkpoint_logger.py	`100.00% <100.00%> (ø)`
tasks/notify.py	`98.83% <100.00%> (+0.04%)`	⬆️
tasks/tests/unit/test_notify_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_finisher_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_task.py	`99.44% <100.00%> (+<0.01%)`	⬆️
tasks/upload.py	`98.26% <100.00%> (+<0.01%)`	⬆️
tasks/upload_finisher.py	`95.45% <100.00%> (+0.12%)`	⬆️
helpers/checkpoint_logger/__init__.py	`97.88% <97.88%> (ø)`

codecov · 2023-10-18T23:42:54Z

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will decrease coverage by 0.06%.
The diff coverage is 99.73%.

Changes have been made to critical files, which contain lines commonly executed in production. Learn more

@@            Coverage Diff             @@
##             main     #152      +/-   ##
==========================================
- Coverage   98.46%   98.40%   -0.06%     
==========================================
  Files         373      373              
  Lines       27665    27860     +195     
==========================================
+ Hits        27239    27417     +178     
- Misses        426      443      +17

Flag	Coverage Δ
integration	`98.43% <98.96%> (+<0.01%)`	⬆️
latest-uploader-overall	`98.43% <98.96%> (+<0.01%)`	⬆️
unit	`98.43% <98.96%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`96.92% <99.63%> (-0.13%)`	⬇️
OutsideTasks	`98.23% <98.86%> (+<0.01%)`	⬆️

Files	Coverage Δ
helpers/checkpoint_logger/flows.py	`100.00% <100.00%> (ø)`
helpers/tests/unit/test_checkpoint_logger.py	`100.00% <100.00%> (ø)`
tasks/notify.py Critical	`98.83% <100.00%> (+0.04%)`	⬆️
tasks/tests/unit/test_notify_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_finisher_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_task.py	`99.45% <100.00%> (+<0.01%)`	⬆️
tasks/upload.py Critical	`98.27% <100.00%> (+<0.01%)`	⬆️
tasks/upload_finisher.py	`95.49% <100.00%> (+0.12%)`	⬆️
helpers/checkpoint_logger/__init__.py	`99.55% <99.55%> (+1.71%)`	⬆️

Related Entrypoints
run/app.tasks.upload.Upload
run/app.tasks.upload.UploadFinisher
run/app.tasks.notify.Notify

codecov-qa · 2023-10-18T23:43:01Z

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will increase coverage by 0.00%.
The diff coverage is 98.96%.

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #152    +/-   ##
========================================
  Coverage   98.43%   98.43%            
========================================
  Files         346      347     +1     
  Lines       27069    27281   +212     
========================================
+ Hits        26645    26855   +210     
- Misses        424      426     +2

Flag	Coverage Δ
integration	`98.43% <98.96%> (+<0.01%)`	⬆️
latest-uploader-overall	`98.43% <98.96%> (+<0.01%)`	⬆️
unit	`98.43% <98.96%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`96.99% <98.38%> (+0.01%)`	⬆️
OutsideTasks	`98.23% <98.86%> (+<0.01%)`	⬆️

Files	Coverage Δ
helpers/checkpoint_logger/flows.py	`100.00% <100.00%> (ø)`
helpers/tests/unit/test_checkpoint_logger.py	`100.00% <100.00%> (ø)`
tasks/notify.py	`98.83% <100.00%> (+0.04%)`	⬆️
tasks/tests/unit/test_notify_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_finisher_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_task.py	`99.44% <100.00%> (+<0.01%)`	⬆️
tasks/upload.py	`98.26% <100.00%> (+<0.01%)`	⬆️
tasks/upload_finisher.py	`95.45% <100.00%> (+0.12%)`	⬆️
helpers/checkpoint_logger/__init__.py	`97.88% <97.88%> (ø)`

codecov-public-qa · 2023-10-18T23:43:17Z

Codecov Report

Merging #152 (80baa5d) into main (b201e1a) will increase coverage by 0.00%.
The diff coverage is 98.96%.

@@           Coverage Diff            @@
##             main     #152    +/-   ##
========================================
  Coverage   98.43%   98.43%            
========================================
  Files         346      347     +1     
  Lines       27069    27281   +212     
========================================
+ Hits        26645    26855   +210     
- Misses        424      426     +2

Flag	Coverage Δ
integration	`98.43% <98.96%> (+<0.01%)`	⬆️
latest-uploader-overall	`98.43% <98.96%> (+<0.01%)`	⬆️
unit	`98.43% <98.96%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`96.99% <98.38%> (+0.01%)`	⬆️
OutsideTasks	`98.23% <98.86%> (+<0.01%)`	⬆️

Files	Coverage Δ
helpers/checkpoint_logger/flows.py	`100.00% <100.00%> (ø)`
helpers/tests/unit/test_checkpoint_logger.py	`100.00% <100.00%> (ø)`
tasks/notify.py	`98.83% <100.00%> (+0.04%)`	⬆️
tasks/tests/unit/test_notify_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_finisher_task.py	`100.00% <100.00%> (ø)`
tasks/tests/unit/test_upload_task.py	`99.44% <100.00%> (+<0.01%)`	⬆️
tasks/upload.py	`98.26% <100.00%> (+<0.01%)`	⬆️
tasks/upload_finisher.py	`95.45% <100.00%> (+0.12%)`	⬆️
helpers/checkpoint_logger/__init__.py	`97.88% <97.88%> (ø)`

scott-codecov · 2023-10-19T17:58:43Z

helpers/checkpoint_logger/__init__.py

+        into the enum's value (e.g. "MEMBER_NAME" -> "MyEnum.MEMBER_NAME")
+        """
+        value = f"{cls.__name__}.{value}"
+        return super().__new__(cls, value)


Is this where the de-serialization magic is happening such that we can have a string like "MyEnum.A" and it get's deserialized back into the actual enum member?

i think technically it doesn't deserialize into the enum member, it deserializes into a str which can be downcast to the enum member which inherits a comparison function from str, but yes

each enum value is an instance of the enum class that gets created at startup, and it appears _generate_next_value_() determines what gets passed into __new__() when using auto() to declare the enum member. so we get the member name ("A") from _generate_next_value_(), we prepend the class name ("MyEnum") in __new__(), and then we pass that value to the superclass constructors (i think this calls __new__() from both str and Enum)

defining __str__() doesn't work, the json encoder is looking specifically for a str or subclass thereof, it doesn't try to cast

matt-codecov added 2 commits October 18, 2023 16:26

Revert "Revert "feat: CheckpointLogger v2: cleaner usage, reliability…

503eb97

… counters, more UploadFlow metrics" (#151)" This reverts commit c2e6f61.

fix bug with serializing between tasks and add test

d9fb63b

matt-codecov requested review from trent-codecov, dana-yaish, giovanni-guidini, scott-codecov and joseph-sentry October 18, 2023 23:37

matt-codecov changed the title ~~Matt/checkpoint logger v2.1~~ (redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics Oct 18, 2023

scott-codecov reviewed Oct 19, 2023

View reviewed changes

scott-codecov approved these changes Oct 20, 2023

View reviewed changes

Merge branch 'main' into matt/checkpoint-logger-v2.1

80baa5d

matt-codecov merged commit 7e263ca into main Oct 20, 2023
27 of 30 checks passed

matt-codecov deleted the matt/checkpoint-logger-v2.1 branch October 20, 2023 22:02

matt-codecov mentioned this pull request Oct 25, 2023

Implement uploads success metric codecov/engineering-team#167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics #152

(redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics #152

matt-codecov commented Oct 18, 2023

codecov-staging bot commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 18, 2023 •

edited

Loading

codecov-qa bot commented Oct 18, 2023 •

edited

Loading

codecov-public-qa bot commented Oct 18, 2023 •

edited

Loading

scott-codecov Oct 19, 2023

matt-codecov Oct 19, 2023

(redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics #152

(redo) feat: CheckpointLogger v2: cleaner usage, reliability counters, more UploadFlow metrics #152

Conversation

matt-codecov commented Oct 18, 2023

original description from #100

before

now

Legal Boilerplate

codecov-staging bot commented Oct 18, 2023 • edited Loading

Codecov Report

codecov bot commented Oct 18, 2023 • edited Loading

Codecov Report

codecov-qa bot commented Oct 18, 2023 • edited Loading

Codecov Report

codecov-public-qa bot commented Oct 18, 2023 • edited Loading

Codecov Report

scott-codecov Oct 19, 2023

Choose a reason for hiding this comment

matt-codecov Oct 19, 2023

Choose a reason for hiding this comment

codecov-staging bot commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 18, 2023 •

edited

Loading

codecov-qa bot commented Oct 18, 2023 •

edited

Loading

codecov-public-qa bot commented Oct 18, 2023 •

edited

Loading