Skip to content

chore: retro-review PR #102 — test(tariff): Hypothesis fuzz the five tariff_engine invariants (Phase 11 PR-18)#134

Closed
Artic0din wants to merge 1 commit into
claude-retro/pr-102-basefrom
claude-retro/pr-102-head
Closed

chore: retro-review PR #102 — test(tariff): Hypothesis fuzz the five tariff_engine invariants (Phase 11 PR-18)#134
Artic0din wants to merge 1 commit into
claude-retro/pr-102-basefrom
claude-retro/pr-102-head

Conversation

@Artic0din

Copy link
Copy Markdown
Owner

Synthetic retro-review of merged PR #102. Diff = original 0bc35b0..7e92212 (same as original PR). Do NOT merge; close after @claude review posts.

…e 11 PR-18)

PR-18 — Final v2.0 GA plank. Property-based tests cover the five
invariants from v2 research § 7.3 for tariff_engine pure functions
(calc_stepped_cost, get_stepped_import_rate, get_current_tou_period).

- 1. Monotonic stepped cost: cost(k1) <= cost(k2) for k1 <= k2.
- 2. Threshold equality: cost(threshold) == threshold * step1_rate.
- 3. Step composition above threshold: cost = step1_cost +
     (k - threshold) * step2_rate.
- 4. Stepped rate dichotomy: returned rate is exactly step1 or step2.
- 5. TOU period closure: returned period name is in the supplied
     dict OR "unknown"; rate matches the period when found; full-day
     coverage produces no "unknown".

≥ 200 fuzzed examples per invariant via hypothesis. 9 test classes.
1054 total pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Artic0din

Copy link
Copy Markdown
Owner Author

@claude review

@claude

claude Bot commented May 23, 2026

Copy link
Copy Markdown

Claude finished @Artic0din's task in 16m 51s —— View job


Code Review — PR #102 Retro: Hypothesis Fuzz Tests for Tariff Engine

Todo

  • Load tools and post initial comment
  • Read changed files (tests/test_tariff_engine_hypothesis.py, CHANGELOG.md)
  • Diff against base branch
  • Analyze test structure, correctness, and coverage
  • Post full review findings

Overall the test suite is well-structured and the invariant coverage is sensible. Two production-code issues and one test-design concern need attention before this can be considered a clean fuzz harness.


Bug: Boundary semantics mismatch between calc_stepped_cost and get_stepped_import_rate

This is the most important finding. The two functions disagree on what happens at k == threshold:

tariff_engine.py:83calc_stepped_cost

if total_kwh <= threshold:          # ≤  → AT threshold, charges step1_rate
    return total_kwh * tariff["step1_rate"]

tariff_engine.py:75get_stepped_import_rate

if daily_kwh < threshold:           # <  → AT threshold, returns step2_rate
    return tariff["step1_rate"]
return tariff["step2_rate"]

Invariant 2 (TestStepCostAtThreshold) correctly asserts that calc_stepped_cost(tariff, threshold) == threshold * step1. But TestSteppedRateDichotomy.test_at_or_above_threshold_returns_step2 simultaneously asserts that get_stepped_import_rate(tariff, threshold) returns step2. These encode contradictory boundary semantics as both correct. The TariffEngine.update() loop at tariff_engine.py:329 uses get_stepped_import_rate with the accumulated daily kWh, so a household that has imported exactly threshold kWh will be charged at step2_rate for their next unit — consistent with calc_stepped_cost above threshold — but the marginal rate sensor exposed via current_import_rate_c_kwh (line 385) will also show step2_rate. This is probably intentional ("you've crossed the threshold"), but the test file should document this explicitly rather than silently encoding both. Fix this →


Test design: return vs assume() causes silent under-coverage

tests/test_tariff_engine_hypothesis.py:109-110:

if k <= threshold:
    return  # Different invariant covers ≤ threshold.

Same pattern at lines 143-144 and 153-154.

Using return marks the example as passed rather than as a discard, so Hypothesis does not attempt to generate a replacement. With _kwh in [0, 200] and _threshold in [0.01, 50], roughly 75% of (k, threshold) pairs have k ≤ threshold, meaning TestStepCostAboveThreshold and the two split tests in TestSteppedRateDichotomy each effectively exercise around 50 genuinely constrained examples out of the 200 requested. The correct idiom is:

from hypothesis import assume
assume(k > threshold)

This causes Hypothesis to count only valid draws toward max_examples, so the 200-example budget is spent entirely on the interesting region. Fix this →


Minor: Floating-point tolerance inconsistency

Invariant 2 (test_cost_at_threshold_uses_only_step1, line 92) uses 1e-9. Invariant 3 (test_above_threshold_step_composition, line 114) uses 1e-6. Both are single multiplications of bounded floats (max ~200 × 200 = 40 000) so IEEE 754 round-trip error is at most a few ULPs (~4e-12), making 1e-9 safely tight. The inconsistency is harmless but worth aligning to a single standard (either 1e-9 everywhere or math.isclose(rel_tol=1e-9) for scale-independent comparison).


Positive notes

  • The deadline=None setting is correct — CI wall-clock variance would otherwise produce flaky failures on slow runners.
  • _threshold = st.floats(min_value=0.01, ...) with the explanatory comment is good defensive design; it correctly excludes the degenerate threshold=0 case and explains why.
  • _basic_tou_periods with an overnight midnight-crossing window ("23:00""07:00") directly exercises the start > end branch at tariff_engine.py:64-67 — good edge-case selection.
  • test_full_day_coverage_no_unknown as a fixture self-check is a nice pattern; it catches gaps in _basic_tou_periods before the real property tests run.
  • The docstring mapping each numbered invariant to the test class is clear and should stay.

Summary

Finding Severity File Lines
Boundary < vs inconsistency silently encoded as correct Medium tariff_engine.py 75, 83
return instead of assume() undercuts effective example count Low test_tariff_engine_hypothesis.py 109, 143, 153
Tolerance inconsistency 1e-9 vs 1e-6 Cosmetic test_tariff_engine_hypothesis.py 92, 114

The boundary mismatch is the one worth following up on — whether it's intentional or a subtle bug in get_stepped_import_rate, the intent should be documented in a comment or resolved with a consistent operator. Everything else is clean.

@claude

claude Bot commented May 23, 2026

Copy link
Copy Markdown

Claude encountered an error after 8m 17s —— View job


I'll analyze this and get back to you.

@Artic0din

Copy link
Copy Markdown
Owner Author

Retro-review complete. Findings (if any) tracked in fix/retro-review-batch (#143) or noted in .planning/retro-review-findings.md. Closing synthetic PR + cleaning branches.

@Artic0din Artic0din closed this May 23, 2026
@Artic0din Artic0din deleted the claude-retro/pr-102-head branch May 23, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant