Conversation
|
| Branch | refactor/nano/error-handling-2 |
| Testbed | ubuntu-22.04 |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result minutes (m) (Result Δ%) | Lower Boundary minutes (m) (Limit %) | Upper Boundary minutes (m) (Limit %) |
|---|---|---|---|---|
| sync-v2 (up to 20000 blocks) | 📈 view plot 🚷 view threshold | 1.64 m(-0.04%)Baseline: 1.64 m | 1.47 m (90.04%) | 1.80 m (90.87%) |
8979377 to
b96dc40
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1321 +/- ##
==========================================
- Coverage 85.65% 85.59% -0.07%
==========================================
Files 424 425 +1
Lines 32095 32156 +61
Branches 4994 5045 +51
==========================================
+ Hits 27492 27524 +32
- Misses 3603 3623 +20
- Partials 1000 1009 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d3075f7 to
22fc416
Compare
98969a4 to
cef01ac
Compare
cef01ac to
57122df
Compare
| if type(value) is bytes: | ||
| value = value.hex() | ||
| except Exception as e: | ||
| except (Exception, NCInternalException) as e: |
There was a problem hiding this comment.
Even if there's a bug in our code, I guess we should return an NCValueErrorResponse (instead of failing the whole execution). What do you think?
There was a problem hiding this comment.
I didn't get it, isn't that the current behavior?
hathor/p2p/states/hello.py
Outdated
| try: | ||
| remote_sync_versions = _parse_sync_versions(data) | ||
| except HathorError as e: | ||
| except (HathorError, __NCTransactionFail__) as e: |
There was a problem hiding this comment.
Please see #1321 (comment). However I did remove it from here in 4bfc918 because for this case it doesn't make sense, indeed.
| try: | ||
| self.vertex_handler.on_new_block(blk, deps=[]) | ||
| except HathorError: | ||
| except (HathorError, __NCTransactionFail__): |
There was a problem hiding this comment.
This method ends up calling verification on transactions, which can raise nano-related exceptions.
Before, all nano exceptions inherited from NCError(HathorError), and now they inherit from __NCTransactionFail__(BaseException). It's not possible to make __NCTransactionFail__ inherit from HathorError because it has to be a BaseException and not an Exception (we could change HathorError to inherit from BaseException instead of Exception too, but that could have unwanted side effects).
In an attempt to be safe and keep the previous behavior, I grepped all usages of except HathorError and added the __NCTransactionFail__ here, too. Before, nano exceptions were caught because they inherited from HathorError, but with the hierarchy change they wouldn't be caught here anymore. So I had to catch the new hierarchy too.
There was a problem hiding this comment.
Shouldn't we prevent __NCTransactionFail__ from leaking outside of the "nano code"?
| try: | ||
| yield self.sync_agent.on_block_complete(blk, vertex_list) | ||
| except HathorError as e: | ||
| except (HathorError, __NCTransactionFail__) as e: |
| try: | ||
| self._verification_service.validate_full(vertex, reject_locked_reward=reject_locked_reward) | ||
| except HathorError as e: | ||
| except (HathorError, __NCTransactionFail__) as e: |
| except NCFail as e: | ||
| except __NCTransactionFail__ as e: | ||
| # These are the exception types that will make an NC transaction fail. | ||
| # Any other exception will bubble up and crash the full node. |
There was a problem hiding this comment.
Even if there's a bug that generates an exception, is crashing the full node the best solution? I guess we could safely log that an unhandled exception happened in our code and void the transaction. In this case, the fix would need to be activated using the feature activation service.
There was a problem hiding this comment.
Wouldn't that defeat the purpose of this PR? What it tries to do is make sure NCs fail only when the exception is not a bug from our code. Even then, it doesn't fully accomplish this task because of the Risks and Discussion section in the PR description.
If we accept the fact that we'll fail NCs with exceptions from our own bugs, isn't that the same as doing the except Exception on the execution entrypoint?
Also, it's worth mentioning that failing NCs with our bugs is more than just an inconvenience for the users: it could make blueprints reach unrecoverable states that would cause locked funds in contracts, until fixed with feature activation.
57122df to
deb8a99
Compare
| try: | ||
| self.vertex_handler.on_new_block(blk, deps=[]) | ||
| except HathorError: | ||
| except (HathorError, __NCTransactionFail__): |
There was a problem hiding this comment.
Shouldn't we prevent __NCTransactionFail__ from leaking outside of the "nano code"?
# Conflicts: # hathor/nanocontracts/blueprint_env.py
4bfc918 to
3a08627
Compare
|
This PR became too complex and most of its changes were unnecessary. It's replaced by the simpler #1468 which makes only the essential changes. |
Motivation
Current exception handling on Nano Contract execution has a few problems. It wraps any
Exceptionthrown during user code execution (that is, blueprint code) in anNCFail, and later anyNCFailmarks a tx execution as failed. This means bugs in our code would become wrapped exceptions and fail tx executions instead of crashing the full node, which would be the expected behavior.This PR addresses this by wrapping exceptions in special types not only when we call user code from internal Hathor code, but also on the other way around, when internal Hathor code is called from user code. This means we intercept exceptions on all boundaries between internal and user code, and handle them accordingly.
When an exception is a bug in our code, it's wrapped in an
__NCUnhandledInternalException__. Since this type inherits fromBaseExceptionbut not fromException, it bubbles up until it crashes the full node, as expected. Other exceptions that are part of the normal flow of NC execution are wrapped in a type that inherits from__NCTransactionFail__, which will cause the caller transaction to fail execution during consensus. A more detailed explanation is in the newerror_handlingmodule. There's also some discussion on gray-area below.Review Notes
error_handling.pyandnanocontracts/exceptions.pyfiles, to understand the new model.metered_exec.pywhich are the innermost code handling exceptions when user code is executed withexec().blueprint_env.pywhich is the main file with internal code called from user code.Acceptance Criteria
error_handlingmodel for NC execution.__NCTransactionFail__instead of anNCFail.@internal_code_called_from_user_code.user_code_called_from_internal_code.NCInternalException.NCFailis now just an alias toNCUserException.Rationale and Alternatives
I considered using a
Resulterror type in #1311, but it got too complex and I didn't feel it had the robustness we needed. Maybe if it was created like this from the beginning, it could be a good solution. But changing it now, guaranteeing correctness, is hard. Or maybe it just isn't suitable for Python.Risks and Discussion
There's a risk of balancing between being too restrictive or too permissive with exceptions, which can either
Before this PR, ALL exceptions raised during contract execution would become part of the blockchain state forever through the tx execution state. This problem has been mitigated, because now only exceptions with specific types become part of the tx failure. Any other exceptions, for example from bugs or asserts, will now crash the full node as expected.
This means we can still incorrectly make a bug part of the blockchain state if we raise an exception with the wrong type, for example. The opposite is also true, that is, we can incorrectly raise a non-tx-failing exception when we should, which will crash the full node instead of failing the tx.
Serialization
The serialization system is too extensive and raises a lot of exceptions internally. Instead of refactoring the whole module to use specific exceptions, I just handled them on the boundary since NC execution only calls it in a single place (not quite true, as explained below).
For that, I identified it can raise 3 exception types:
SerializationError, ValueError, TypeError. This means that any other exceptions raised in the serialization module will crash the full node instead of becoming part of the blockchain. If there are any other known exceptions that can be raised, we should include them in this handling. But it's hard to do better than grepping for raises.It also means that any
ValueErrororTypeErrorraised by Python itself will also become a reason for tx execution failure, even if it's caused by a bug in internal code. Ideally we should refactor the module to use specific custom exceptions.Dealing with boundaries
Calling user code from internal Hathor code
This is the simpler case, because there are only a few places where user code is called. They are contained in
metered_exec.pyand happen through calls to stdlib'scompile()andexec(). We simply catch any raisedExceptions with the new@user_code_called_from_internal_codedecorator and can assume they're unhandled exceptions or bugs in user code, making the caller transaction fail.Calling internal Hathor code from user code
This case has a lot more nuance and is partially an open discussion. There are multiple places where user code calls internal code, and ideally all of them should be decorated with
@internal_code_called_from_user_code, but that's not that simple as explained below:@internal_code_called_from_user_codedecorator to wrap all methods. This means any raised__NCTransactionFail__will cause the tx to fail, and any other exceptions will cause the full node to crash.asserts validating user input. If we use the decorator, the user would be able to crash the full node. We can either leave it like this and assume the risk of our own bugs causing tx failures, or refactor the module so it can't raise any exceptions caused by user input. This would have to be done in a separate PR.NCLoggeris analogous to the RNG, with the difference that it's way simpler. For this reason, I did use the decorator in its methods, after reviewing its code searching for exceptions that could be triggered by user input, and not finding any. There's a risk that assumption is incorrect, in which case there are ways for users to crash the full node.__get__,__set__,__del__,__getitem__, etc. These methods indirectly call the serialization system which can raise a lot of exceptions, as explained above. These exceptions can be caused by user input and are not normalized to__NCTransactionFail__. This means we cannot use the decorator, and all exceptions raised by these methods will fail the tx. There's an extensive API surface that is indirectly called by these methods that would have to be reviewed carefully to allow using the decorator. For now, we assume the risk that any bug in that part of the code will cause txs to fail, becoming part of the blockchain.DequeStorageContainer'sappend,pop, etc, that are analogous to attributes and are not decorated, for now.Checklist
master, confirm this code is production-ready and can be included in future releases as soon as it gets merged