Skip to content

Algod: Modify EvalTracer design and improve testing for failures#5071

Merged
jannotti merged 10 commits into
algorand:masterfrom
jasonpaulos:evaltracer-failure-handling
Feb 15, 2023
Merged

Algod: Modify EvalTracer design and improve testing for failures#5071
jannotti merged 10 commits into
algorand:masterfrom
jasonpaulos:evaltracer-failure-handling

Conversation

@jasonpaulos
Copy link
Copy Markdown
Contributor

Summary

Follow up to #4438, which introduced the new EvalTracer interface for tracing transaction and program evaluation.

This PR improves how evaluation failures are reported to EvalTracers. Prior to this, if a failure occurred in a deeply nested inner transaction, evaluation would immediately stop, and the AfterOpcode and AfterProgram tracer methods would get invoked with the error, starting with the deeply nested inner transaction, and unwinding up the layers to the top-level transaction which encompasses the error. The issue is that the AfterTxn and AfterTxnGroup methods would not get invoked at all during failures, so an EvalTracer would have no idea that the context ever changed from the inner transaction which spawned the failure.

This PR reconciles the situation by making it so that AfterTxn and AfterTxnGroup also get called when an error happens in an inner transaction context. Now EvalTracers will see the correct context changes when evaluation halts because of an error.

(You might be wondering why we want EvalTracers to get called after an error happens at all, and the answer to this is so that they can gather detailed information about the failure, such as partially-completed ApplyDeltas).

Test Plan

New tests are added which cover transaction and program failures. Specifically, the mocktracer package now contains test scenarios which cover many possible failures in an app call and its inner transactions.

ad.EvalDelta = transactions.EvalDelta{}
return errors.New("Approval program failed")
}
ad.EvalDelta = cx.txn.EvalDelta
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real ledger does not set EvalDeltas if an error happens, however the test ledger implicitly does right now. This is because ad is pointer to cx.txn.ApplyData, so this line is a noop.

I added some lines above this to explicitly clear the AD's EvalDelta if a failure happens, and that's good enough to get my tests to pass.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add this comment inline

Comment thread ledger/internal/eval.go
// along with go-algorand. If not, see <https://www.gnu.org/licenses/>.

package logic
package logic_test
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing to the logic_test package, we can now import the mocktracer package without circular import issues!

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 28, 2023

Codecov Report

Merging #5071 (0b50423) into master (a688c23) will increase coverage by 0.02%.
The diff coverage is 95.45%.

@@            Coverage Diff             @@
##           master    #5071      +/-   ##
==========================================
+ Coverage   53.44%   53.46%   +0.02%     
==========================================
  Files         431      431              
  Lines       54364    54373       +9     
==========================================
+ Hits        29056    29073      +17     
+ Misses      23053    23043      -10     
- Partials     2255     2257       +2     
Impacted Files Coverage Δ
data/transactions/logic/debugger.go 68.75% <ø> (+1.56%) ⬆️
data/transactions/logic/tracer.go 57.14% <0.00%> (+19.64%) ⬆️
data/transactions/logic/eval.go 90.40% <100.00%> (-0.08%) ⬇️
ledger/internal/eval.go 49.67% <100.00%> (+0.19%) ⬆️
ledger/blockqueue.go 82.25% <0.00%> (-2.69%) ⬇️
ledger/tracker.go 74.26% <0.00%> (-0.85%) ⬇️
ledger/testing/randomAccounts.go 56.26% <0.00%> (-0.62%) ⬇️
ledger/acctupdates.go 68.99% <0.00%> (-0.25%) ⬇️
network/wsNetwork.go 64.98% <0.00%> (-0.19%) ⬇️
... and 6 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bbroder-algo
Copy link
Copy Markdown
Contributor

What about a second paradigm where you separate out errors into its own class, i.e partial-ApplyData on an error is stored in a special erroredApplyData, and instead of calling AfterTxn or AfterProgram you go straight to an AfterError method.

@jasonpaulos
Copy link
Copy Markdown
Contributor Author

@bbroder-algo having a separate function for errors, like AfterError, was something I considered, but I decided against it for these reasons:

  • It's necessary to trigger an error at every txn depth, since the errors will contain the partial ApplyData for that level of inner txn. E.g. if we want to know the ApplyData of a child txn and its parent at failure, the tracer method must be invoked twice.
  • If a tracer is interested in listening to an event, it probably also wants to know if that event failed. So it seems simpler to report errors in the After* hooks rather than introducing a new set of hooks that most tracers would want to implement anyway.

@jasonpaulos jasonpaulos requested a review from jannotti January 30, 2023 17:48
Copy link
Copy Markdown
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine - I tried to pay most of my attention to places I thought might change or slow down execution. I put relatively less effort into ensuring I believe the traces are correct.

One place you might to test is that all the handling is correct if an opcode panics. That seems like a challenging case, especially an opcode in an inner txn.

TestPanic shows nice trick to have such a test.

Comment thread ledger/internal/eval.go Outdated
Comment thread data/transactions/logic/debugger_eval_test.go Outdated
Comment thread data/transactions/logic/eval.go Outdated
Comment on lines +860 to +869
// Due to the LIFO behavior of defer statements, if we want the tracer to catch panics, we
// need to have a `recover` defer statement defined after the tracer-invoking defer
// statement. However, since it's possible for the tracer itself to panic, we still want to
// keep the outmost `recover` defer statement defined at the top of this function.
defer func() {
if x := recover(); x != nil {
err = makePanicError(x, cx.EvalParams, "Eval")
pass = false
}
}()
Copy link
Copy Markdown
Contributor Author

@jasonpaulos jasonpaulos Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely looking for feedback on this. As the comment mentions, go defer statements are executed LIFO, so without this additional recovery, the above defer which invokes tracer methods will not see any PanicErrors, as those are only created by the defer at the top of this function.

By adding another defer layer that can recover from panics, we make them visible to the tracer.

An alternative here would be to just move the single recovery defer from the beginning of the function to after we define the tracer deferred logic. I opted against it since I didn't want to decrease coverage of potential panicking code, especially since it's in theory possible for tracer methods themselves to panic.

(There's a new test case, TestEvalWithTracerPanic, which confirm a tracer method panicking gets caught during eval.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I follow what's going on, but let me ask in order to be sure. You are trying to catch the panic "early", and build the stack trace so that you have it available to give out to the tracer. Might you simply call the Tracer with a simpler error here, and then panic again with x?

That is, only one defer, something like:

		defer func() {
           x := recover();
           if x != nil {
				// A panic error occurred during the eval loop. Report it now.
				cx.Tracer.AfterOpcode(cx, fmt.Errorf("some nonsense"))
			}

			// Ensure we update the tracer before exiting
			cx.Tracer.AfterProgram(cx, err)

            if x != nil {
                panic(x)
            }
		}()

I recognize this gives AfterOpcode slightly less information to go on. But I'm trying to make it so that a panic is always handled through the single stack trace creating handler. Done this way, when there's a Tracer, we build the stack here, and we don't go through the existing panic handler, since we're no longer panicking.

My motivation is that I think I can see a future where we really want to preserve the difference of "normal" error returns from evaluation vs a panic, if/when we consider supporting a itxn_try_submit which I think should continue to fail for panics, rather than cleanly reporting 0 for a failed txn.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I implemented it in 912d5de, let me know if you see any issues

@jasonpaulos jasonpaulos requested a review from jannotti February 2, 2023 17:03
Copy link
Copy Markdown
Contributor

@algochoi algochoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I mostly went through the tests using a debugger to double-check that they behave as expected. I'll defer the logic/eval nuances to JJ.

Comment thread ledger/internal/eval_test.go
Comment thread data/transactions/logic/eval.go
// possible during each inner transaction, as well as before all inners, between the two inner
// groups, and after all inners. For app call failures, there are scenarios for both rejection and
// runtime errors, which should invoke tracer hooks slightly differently.
func GetTestScenarios() map[string]TestScenarioGenerator {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

@bbroder-algo
Copy link
Copy Markdown
Contributor

testing is excellent, passes for me, covers the cases discussed in the PR, I spent additional time staring at the changes to eval and the new panic handling/code segment relocations.

@bbroder-algo bbroder-algo self-requested a review February 15, 2023 00:13
Copy link
Copy Markdown
Contributor

@bbroder-algo bbroder-algo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

if pe, ok := err.(PanicError); ok {
require.Equal(t, panicString, pe.PanicValue)
pes := pe.Error()
require.True(t, strings.Contains(pes, "panic"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's require.Contains() if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants