Skip to content

fix(test): prevent TestDailyFees from overshooting second proving window#13551

Open
wjmelements wants to merge 2 commits intomasterfrom
fix/daily-fees-flaky-test
Open

fix(test): prevent TestDailyFees from overshooting second proving window#13551
wjmelements wants to merge 2 commits intomasterfrom
fix/daily-fees-flaky-test

Conversation

@wjmelements
Copy link
Copy Markdown
Contributor

Summary

  • Fixes a flaky TestDailyFees failure where the post-PoST wait could advance the chain past a sector's second proving window close, causing paymentsPast > 1 and a mismatch in checkFeeRecords
  • Replaces head.Height()+WPoStChallengeWindow+10 with dlInfo.Close+5 — a targeted wait to just after the current deadline window closes, which is always within the current proving period
  • Verified with 10 consecutive local passes

Root cause

After feePostWg.Wait(), the chain head could already be near the end of the proving period (e.g. epoch 5890 with the period ending at 5960). Adding a full WPoStChallengeWindow+10 (70 epochs) pushed past epoch 5960 — the close of deadline 0 in the second period — making paymentsPast = 2 for those sectors and causing checkFeeRecords to fail.

dlInfo.Close+5 advances at most WPoStChallengeWindow+5 = 65 epochs from within the current window, which is always less than the 60-epoch gap before deadline 0's second window closes.

Replace the fixed head+WPoStChallengeWindow+10 wait with a targeted wait
to dlInfo.Close+5. The old approach could advance past the close of a
sector's second proving window (making paymentsPast > 2) when the chain
head was already near the end of the proving period after feePostWg.Wait().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FilOz Mar 17, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjmelements wjmelements requested review from rjan90 and rvagg and removed request for rjan90 March 17, 2026 01:21
@wjmelements
Copy link
Copy Markdown
Contributor Author

See also #12973 (comment)

Comment thread itests/daily_fees_test.go
dlInfo, err := client.StateMinerProvingDeadline(ctx, mminer.ActorAddr, types.EmptyTSK)
req.NoError(err)
client.WaitTillChain(ctx, kit.HeightAtLeast(head.Height()+miner.WPoStChallengeWindow()+10))
client.WaitTillChain(ctx, kit.HeightAtLeast(dlInfo.Close+5))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't block on this but there is still a race here I think as is the sad story for many lotus itests. We really need a way to halt mining when querying state to actually get rid of them. WaitTillChain is used everywhere but is inherently racy.

For example if post lands on last epoch of the deadline and your call to grab dlInfo only executes an epoch later when this first proving period is over then you again lose the race and are waiting until after the 2nd proving period.

But forgetting about the race which is a systemic problem this is a cleaner way to express what we are doing so I support merging this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use dlInfo.Close+1 since its the first epoch the condition should be met on and we are now being more exact.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a failure when I did dlInfo.Close+1. Claude explains:

The most likely reason: in Filecoin, the deadline cron — which actually burns the fees — runs as part of the block execution AT dlInfo.Close. But in the test's simulated chain, there can be null rounds around the close epoch. If there's a null round at dlInfo.Close, no block is produced at that exact epoch, and the cron gets deferred to the first block produced after it

So even with +5 we might still see some failures. I just haven't seen any yet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is slightly more subtle than that, null round crons are run before msgs anyway so there's an effective catch-up in place but they do contribute to the problem here. I'm pretty sure just using dlInfo.Close would be enough because the miner is enrolled at close - 1 but the problem is all our APIs use ParentState on the tipset you call them with, so with null rounds you're reaching into the parent before the null rounds. Cron's run, but you're still going back into history.

A properly robust way to deal with this might be to wait for 1 non-null, then wait for 1 more after that so ParentState is always a post-close state:

ts := client.WaitTillChain(ctx, kit.HeightAtLeast(dlInfo.Close)) // or use +1, whatever
client.WaitTillChain(ctx, kit.HeightAtLeast(ts.Height()+1))

Pretty janky, might be something we could put into the test kit though because this kind of problem isn't atypical and the solution might apply elsewhere.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.WaitTillStateIncludes(condition)

@github-project-automation github-project-automation Bot moved this from 📌 Triage to ✔️ Approved by reviewer in FilOz Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✔️ Approved by reviewer

Development

Successfully merging this pull request may close these issues.

3 participants