Skip to content

Comments

Allocate just one ClosureTable entry per self-send#2663

Merged
mergify[bot] merged 65 commits intomasterfrom
gabor/cleanup
Jul 29, 2021
Merged

Allocate just one ClosureTable entry per self-send#2663
mergify[bot] merged 65 commits intomasterfrom
gabor/cleanup

Conversation

@ggreif
Copy link
Contributor

@ggreif ggreif commented Jul 16, 2021

This PR cleans up the ClosureTable entries in the last-resort cleanup callback. Since potentially all call and reply state changes of a canister can be rolled back by carefully engineering the traps, the initiator of a message send must only allocate exactly one slot in the closure table. Formerly self-sends would leak a slot when the receiver trapped and the reject callback too. See #2663 (comment).

Motivation: fixes #2640

ggreif added 3 commits July 13, 2021 14:57
undortunatrely I still see
```
trace: Warning: `stdenv.lib` is deprecated and will be removed in the next release. Please use `lib` instead. For more information see NixOS/nixpkgs#108938
```
... but I don't see the `ping! ...` log line :-(
@dfinity-ci
Copy link

This PR does not affect the produced WebAssembly code.

ggreif added 16 commits July 16, 2021 16:19
but I cannot see yet how oneshot is involved
still no closures lost by oneshot
These are copies from `run-drun/ok/count-callbacks.*`, but
strangely running `accept` won't refresh them!?
Fun thing is that these appear for other tests than the one _designed
to trigger_ it.

The plot thickens...
must be a Motoko test from an inner directory
this triggers for
```
make -C test/run-drun commit-on-await.only
make -C test/run-drun await-without-yield.only
```
but strangely not for
```
make -C test/run-drun oneshot-callbacks.only
```
It remains to be investigated why this is the case.

Also I see `ping` being rejected (or at least the reject callback
being called) for one-shot sends.
this is the current trace:
oneshot-callbacks: [tc] [comp] [comp-ref] [valid] [valid-ref] [ic-ref-run]
--- oneshot-callbacks.ic-ref-run (expected)
+++ oneshot-callbacks.ic-ref-run (actual)
@@ -4,19 +4,67 @@
 ← replied: ()
 → update go()
 debug.print: go 0: 0
+debug.print: remember_closure: 0
+debug.print: remember_closure: 1
 debug.print: ping! 2
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
 debug.print: go 1: 1
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: remember_closure: 1
+debug.print: remember_closure: 0
 debug.print: ping! 2
-← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.26-14.38"
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: go 2: 1
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: IN REJECT callback
+← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.98-14.110"
 → update go()
 debug.print: go 0: 1
+debug.print: remember_closure: 0
+debug.print: remember_closure: 2
 debug.print: ping! 3
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
 debug.print: go 1: 2
+debug.print: recall_closure: 2
+debug.print: recall_closure: 2 OUT
+debug.print: remember_closure: 2
+debug.print: remember_closure: 0
 debug.print: ping! 3
-← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.26-14.38"
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 2
+debug.print: recall_closure: 2 OUT
+debug.print: go 2: 2
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: IN REJECT callback
+← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.98-14.110"
 → update go()
 debug.print: go 0: 2
+debug.print: remember_closure: 0
+debug.print: remember_closure: 3
 debug.print: ping! 4
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
 debug.print: go 1: 3
+debug.print: recall_closure: 3
+debug.print: recall_closure: 3 OUT
+debug.print: remember_closure: 3
+debug.print: remember_closure: 0
 debug.print: ping! 4
-← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.26-14.38"
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 3
+debug.print: recall_closure: 3 OUT
+debug.print: go 2: 3
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: IN REJECT callback
+← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.98-14.110"
Some tests failed:
oneshot-callbacks.mo

I *think* I understand this by now :-)
…leak

This doesn't handle the case yet where the reject traps, but
the testcase doesn't implement that just yet.

Here comes the evidence trace:
```
oneshot-callbacks: [tc] [comp] [comp-ref] [valid] [valid-ref] [ic-ref-run]
--- oneshot-callbacks.ic-ref-run (expected)
+++ oneshot-callbacks.ic-ref-run (actual)
@@ -4,19 +4,73 @@
 ← replied: ()
 → update go()
 debug.print: go 0: 0
+debug.print: remember_closure: 0
+debug.print: remember_closure: 1
 debug.print: ping! 2
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
 debug.print: go 1: 1
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: remember_closure: 1
+debug.print: remember_closure: 0
 debug.print: ping! 2
-← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.26-14.38"
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: go 2: 1
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: IN SELF REJECT callback
+← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.98-14.110"
 → update go()
-debug.print: go 0: 1
-debug.print: ping! 3
-debug.print: go 1: 2
-debug.print: ping! 3
-← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.26-14.38"
+debug.print: go 0: 0
+debug.print: remember_closure: 1
+debug.print: remember_closure: 0
+debug.print: ping! 2
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: go 1: 1
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: remember_closure: 0
+debug.print: remember_closure: 1
+debug.print: ping! 2
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: go 2: 1
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: IN SELF REJECT callback
+← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.98-14.110"
 → update go()
-debug.print: go 0: 2
-debug.print: ping! 4
-debug.print: go 1: 3
-debug.print: ping! 4
-← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.26-14.38"
+debug.print: go 0: 0
+debug.print: remember_closure: 0
+debug.print: remember_closure: 1
+debug.print: ping! 2
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: go 1: 1
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: remember_closure: 1
+debug.print: remember_closure: 0
+debug.print: ping! 2
+debug.print: RECALLING, but this may trap and gets undone
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: go 2: 1
+debug.print: recall_closure: 0
+debug.print: recall_closure: 0 OUT
+debug.print: recall_closure: 1
+debug.print: recall_closure: 1 OUT
+debug.print: IN SELF REJECT callback
+← rejected (RC_CANISTER_REJECT): canister trapped: EvalTrapError region:0xXXX-0xXXX "canister trapped explicitly: assertion failed at oneshot-callbacks.mo:14.98-14.110"
Some tests failed:
oneshot-callbacks.mo
```
instead, perpetuate the argument directly
@ggreif
Copy link
Contributor Author

ggreif commented Jul 22, 2021

We have several problems currently.
a) non-self-send async methods may invoke a reply/reject continuation that traps. The index allocated in the ClosureTable and what it points to will never be deallocated.
b) Explicit self-sends allocate ClosureTable index (for the future) that is supposed to be removed on the other side of the call. If that traps, the sender-side becomes the owner of the future's slot. Then the reject callback gets invoked and can (additionally) pop it. The trouble starts when the reject callback also traps. Then the cleanup callback will be invoked and must clean up two slots. But the problem is that if the reply callback trapped it only needs to clean-up one slot. Alas, the cleanup callback has no information whatsoever what happened.

For a) the solution is easy, pop the one slot in the cleanup callback.
For b) we have to consolidate the two slots to one (pointing to a 3-element array) in which the other side can peek to fetch the future. This reduces the problem to a), for which we have a solution.

@crusso
Copy link
Contributor

crusso commented Jul 22, 2021

To avoid serializing/deserializing the result of the future, if wonder if it would make sense to transfer it in a fourth slot of that array?

@ggreif
Copy link
Contributor Author

ggreif commented Jul 22, 2021

To avoid serializing/deserializing the result of the future, if wonder if it would make sense to transfer it it in a fourth slot of that array?

Yep, I was thinking overwriting the third position of the array (i.e. the future) since that is not needed any more. But sending the payload back in heap changes the semantics a bit, so I'd only do that for single, small results.

Update: this is #2679 now.

ggreif added 4 commits July 23, 2021 13:05
this means _the other side_ never owns a table slot and must simply
peek into the array to fetch the future's closure

cleanup becomes the same as a foreign send's
@ggreif ggreif requested a review from crusso July 29, 2021 11:11
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

(Good you refactored the ic_(self)_call too - wasn't actually asking for that ;->)

Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

(Good you refactored the ic_(self)_call too - wasn't actually asking for that ;->)

(One formatting change only)

review feedback
@ggreif ggreif added the automerge-squash When ready, merge (using squash) label Jul 29, 2021
@mergify mergify bot merged commit 7c95059 into master Jul 29, 2021
@mergify mergify bot deleted the gabor/cleanup branch July 29, 2021 12:28
@mergify mergify bot removed the automerge-squash When ready, merge (using squash) label Jul 29, 2021
mergify bot pushed a commit that referenced this pull request Jul 30, 2021
This is a wholesale rename, in the spirit of #2663 (comment)
@nomeata
Copy link
Contributor

nomeata commented Aug 9, 2021

Thanks for this! I had quite some trouble figuring out what’s happening based on the patch only, but the comment in #2663 (comment) helped. Maybe worth expanding the code comments a bit, so that future readers (who don’t have the benefit of the PR discussion) can make sense of it? Probably one comprehensive GHC-style code node that explains how async self-calls works would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Possible memory leaks through closure table, induced by faulting callbacks.

5 participants