Skip to content

fix(nano): improve error handling#1321

Closed
glevco wants to merge 2 commits intomasterfrom
refactor/nano/error-handling-2
Closed

fix(nano): improve error handling#1321
glevco wants to merge 2 commits intomasterfrom
refactor/nano/error-handling-2

Conversation

@glevco
Copy link
Contributor

@glevco glevco commented Jul 15, 2025

Motivation

Current exception handling on Nano Contract execution has a few problems. It wraps any Exception thrown during user code execution (that is, blueprint code) in an NCFail, and later any NCFail marks a tx execution as failed. This means bugs in our code would become wrapped exceptions and fail tx executions instead of crashing the full node, which would be the expected behavior.

This PR addresses this by wrapping exceptions in special types not only when we call user code from internal Hathor code, but also on the other way around, when internal Hathor code is called from user code. This means we intercept exceptions on all boundaries between internal and user code, and handle them accordingly.

When an exception is a bug in our code, it's wrapped in an __NCUnhandledInternalException__. Since this type inherits from BaseException but not from Exception, it bubbles up until it crashes the full node, as expected. Other exceptions that are part of the normal flow of NC execution are wrapped in a type that inherits from __NCTransactionFail__, which will cause the caller transaction to fail execution during consensus. A more detailed explanation is in the new error_handling module. There's also some discussion on gray-area below.

Review Notes

  • Begin by reviewing the new error_handling.py and nanocontracts/exceptions.py files, to understand the new model.
  • Review the changes in metered_exec.py which are the innermost code handling exceptions when user code is executed with exec().
  • Review blueprint_env.py which is the main file with internal code called from user code.
  • Review all the rest.

Acceptance Criteria

  • Implement new error_handling model for NC execution.
  • Add custom checks to prevent mistakes with new error handling model.
  • Block consensus now fails a tx when it catches an __NCTransactionFail__ instead of an NCFail.
  • Decorate all syscalls with @internal_code_called_from_user_code.
  • Wrap calls to user code with user_code_called_from_internal_code.
  • Fix incorrect raises of Python exceptions or user exceptions.
  • Refactor NC exception hierarchy so all of them inherit from NCInternalException.
  • NCFail is now just an alias to NCUserException.
  • Normalize exception handling on serialization calls from consensus code.

Rationale and Alternatives

I considered using a Result error type in #1311, but it got too complex and I didn't feel it had the robustness we needed. Maybe if it was created like this from the beginning, it could be a good solution. But changing it now, guaranteeing correctness, is hard. Or maybe it just isn't suitable for Python.

Risks and Discussion

There's a risk of balancing between being too restrictive or too permissive with exceptions, which can either

  1. Make NC transactions fail with exceptions that are bugs in internal Hathor code, becoming part of the blockchain, or
  2. Make the full node crash with exceptions that can be caused by user code.

Before this PR, ALL exceptions raised during contract execution would become part of the blockchain state forever through the tx execution state. This problem has been mitigated, because now only exceptions with specific types become part of the tx failure. Any other exceptions, for example from bugs or asserts, will now crash the full node as expected.

This means we can still incorrectly make a bug part of the blockchain state if we raise an exception with the wrong type, for example. The opposite is also true, that is, we can incorrectly raise a non-tx-failing exception when we should, which will crash the full node instead of failing the tx.

Serialization

The serialization system is too extensive and raises a lot of exceptions internally. Instead of refactoring the whole module to use specific exceptions, I just handled them on the boundary since NC execution only calls it in a single place (not quite true, as explained below).

For that, I identified it can raise 3 exception types: SerializationError, ValueError, TypeError. This means that any other exceptions raised in the serialization module will crash the full node instead of becoming part of the blockchain. If there are any other known exceptions that can be raised, we should include them in this handling. But it's hard to do better than grepping for raises.

It also means that any ValueError or TypeError raised by Python itself will also become a reason for tx execution failure, even if it's caused by a bug in internal code. Ideally we should refactor the module to use specific custom exceptions.

Dealing with boundaries

Calling user code from internal Hathor code

This is the simpler case, because there are only a few places where user code is called. They are contained in metered_exec.py and happen through calls to stdlib's compile() and exec(). We simply catch any raised Exceptions with the new @user_code_called_from_internal_code decorator and can assume they're unhandled exceptions or bugs in user code, making the caller transaction fail.

Calling internal Hathor code from user code

This case has a lot more nuance and is partially an open discussion. There are multiple places where user code calls internal code, and ideally all of them should be decorated with @internal_code_called_from_user_code, but that's not that simple as explained below:

  1. Syscalls: This is the most straight forward one. We use the new @internal_code_called_from_user_code decorator to wrap all methods. This means any raised __NCTransactionFail__ will cause the tx to fail, and any other exceptions will cause the full node to crash.
  2. RNG: This module also has methods that are called from user code. For now, the decorator is not used. This means any exception raised in this module will cause the tx to fail, including bugs in our own code. The reason I didn't use the decorator for now, is that there are a lot of exceptions that can be thrown in these methods caused by users. For example, we have asserts validating user input. If we use the decorator, the user would be able to crash the full node. We can either leave it like this and assume the risk of our own bugs causing tx failures, or refactor the module so it can't raise any exceptions caused by user input. This would have to be done in a separate PR.
  3. Logging: The NCLogger is analogous to the RNG, with the difference that it's way simpler. For this reason, I did use the decorator in its methods, after reviewing its code searching for exceptions that could be triggered by user input, and not finding any. There's a risk that assumption is incorrect, in which case there are ways for users to crash the full node.
  4. Attributes: This is likely the hardest case, when we use methods such as __get__, __set__, __del__, __getitem__, etc. These methods indirectly call the serialization system which can raise a lot of exceptions, as explained above. These exceptions can be caused by user input and are not normalized to __NCTransactionFail__. This means we cannot use the decorator, and all exceptions raised by these methods will fail the tx. There's an extensive API surface that is indirectly called by these methods that would have to be reviewed carefully to allow using the decorator. For now, we assume the risk that any bug in that part of the code will cause txs to fail, becoming part of the blockchain.
  5. Field Methods: There are field methods such as DequeStorageContainer's append, pop, etc, that are analogous to attributes and are not decorated, for now.

Checklist

  • If you are requesting a merge into master, confirm this code is production-ready and can be included in future releases as soon as it gets merged

@glevco glevco self-assigned this Jul 15, 2025
@glevco glevco requested review from jansegre and msbrogli as code owners July 15, 2025 16:22
@glevco glevco moved this from Todo to In Progress (WIP) in Hathor Network Jul 15, 2025
@github-actions
Copy link

github-actions bot commented Jul 15, 2025

🐰 Bencher Report

Branchrefactor/nano/error-handling-2
Testbedubuntu-22.04
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
minutes (m)
(Result Δ%)
Lower Boundary
minutes (m)
(Limit %)
Upper Boundary
minutes (m)
(Limit %)
sync-v2 (up to 20000 blocks)📈 view plot
🚷 view threshold
1.64 m
(-0.04%)Baseline: 1.64 m
1.47 m
(90.04%)
1.80 m
(90.87%)
🐰 View full continuous benchmarking report in Bencher

@glevco glevco force-pushed the refactor/nano/error-handling-2 branch 2 times, most recently from 8979377 to b96dc40 Compare July 16, 2025 00:20
@glevco glevco changed the title Refactor/nano/error handling 2 fix(nano): improve error handling Jul 16, 2025
@codecov
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

❌ Patch coverage is 94.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.59%. Comparing base (5c883a7) to head (3a08627).
⚠️ Report is 147 commits behind head on master.

Files with missing lines Patch % Lines
hathor/nanocontracts/error_handling.py 93.75% 1 Missing and 1 partial ⚠️
hathor/nanocontracts/method.py 85.71% 2 Missing ⚠️
hathor/nanocontracts/context.py 66.66% 1 Missing ⚠️
hathor/nanocontracts/nc_exec_logs.py 87.50% 1 Missing ⚠️
hathor/nanocontracts/runner/runner.py 80.00% 1 Missing ⚠️
hathor/p2p/sync_v2/blockchain_streaming_client.py 50.00% 1 Missing ⚠️
hathor/p2p/sync_v2/transaction_streaming_client.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1321      +/-   ##
==========================================
- Coverage   85.65%   85.59%   -0.07%     
==========================================
  Files         424      425       +1     
  Lines       32095    32156      +61     
  Branches     4994     5045      +51     
==========================================
+ Hits        27492    27524      +32     
- Misses       3603     3623      +20     
- Partials     1000     1009       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@glevco glevco force-pushed the refactor/nano/error-handling-2 branch 5 times, most recently from d3075f7 to 22fc416 Compare July 16, 2025 21:26
@glevco glevco force-pushed the refactor/nano/error-handling-2 branch 4 times, most recently from 98969a4 to cef01ac Compare July 17, 2025 13:43
@glevco glevco moved this from In Progress (WIP) to In Progress (Done) in Hathor Network Jul 17, 2025
@glevco glevco force-pushed the refactor/nano/error-handling-2 branch from cef01ac to 57122df Compare July 17, 2025 18:58
if type(value) is bytes:
value = value.hex()
except Exception as e:
except (Exception, NCInternalException) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if there's a bug in our code, I guess we should return an NCValueErrorResponse (instead of failing the whole execution). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get it, isn't that the current behavior?

try:
remote_sync_versions = _parse_sync_versions(data)
except HathorError as e:
except (HathorError, __NCTransactionFail__) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #1321 (comment). However I did remove it from here in 4bfc918 because for this case it doesn't make sense, indeed.

try:
self.vertex_handler.on_new_block(blk, deps=[])
except HathorError:
except (HathorError, __NCTransactionFail__):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method ends up calling verification on transactions, which can raise nano-related exceptions.

Before, all nano exceptions inherited from NCError(HathorError), and now they inherit from __NCTransactionFail__(BaseException). It's not possible to make __NCTransactionFail__ inherit from HathorError because it has to be a BaseException and not an Exception (we could change HathorError to inherit from BaseException instead of Exception too, but that could have unwanted side effects).

In an attempt to be safe and keep the previous behavior, I grepped all usages of except HathorError and added the __NCTransactionFail__ here, too. Before, nano exceptions were caught because they inherited from HathorError, but with the hierarchy change they wouldn't be caught here anymore. So I had to catch the new hierarchy too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we prevent __NCTransactionFail__ from leaking outside of the "nano code"?

try:
yield self.sync_agent.on_block_complete(blk, vertex_list)
except HathorError as e:
except (HathorError, __NCTransactionFail__) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #1321 (comment)

try:
self._verification_service.validate_full(vertex, reject_locked_reward=reject_locked_reward)
except HathorError as e:
except (HathorError, __NCTransactionFail__) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #1321 (comment)

except NCFail as e:
except __NCTransactionFail__ as e:
# These are the exception types that will make an NC transaction fail.
# Any other exception will bubble up and crash the full node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if there's a bug that generates an exception, is crashing the full node the best solution? I guess we could safely log that an unhandled exception happened in our code and void the transaction. In this case, the fix would need to be activated using the feature activation service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that defeat the purpose of this PR? What it tries to do is make sure NCs fail only when the exception is not a bug from our code. Even then, it doesn't fully accomplish this task because of the Risks and Discussion section in the PR description.

If we accept the fact that we'll fail NCs with exceptions from our own bugs, isn't that the same as doing the except Exception on the execution entrypoint?

Also, it's worth mentioning that failing NCs with our bugs is more than just an inconvenience for the users: it could make blueprints reach unrecoverable states that would cause locked funds in contracts, until fixed with feature activation.

@github-project-automation github-project-automation bot moved this from In Progress (Done) to In Review (WIP) in Hathor Network Jul 22, 2025
@glevco glevco force-pushed the refactor/nano/error-handling-2 branch from 57122df to deb8a99 Compare July 25, 2025 19:39
jansegre
jansegre previously approved these changes Jul 28, 2025
try:
self.vertex_handler.on_new_block(blk, deps=[])
except HathorError:
except (HathorError, __NCTransactionFail__):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we prevent __NCTransactionFail__ from leaking outside of the "nano code"?

glevco added 2 commits July 29, 2025 13:13
# Conflicts:
#	hathor/nanocontracts/blueprint_env.py
@glevco glevco force-pushed the refactor/nano/error-handling-2 branch from 4bfc918 to 3a08627 Compare July 29, 2025 16:13
@glevco glevco moved this from In Review (WIP) to In Progress (WIP) in Hathor Network Aug 26, 2025
@glevco
Copy link
Contributor Author

glevco commented Oct 15, 2025

This PR became too complex and most of its changes were unnecessary. It's replaced by the simpler #1468 which makes only the essential changes.

@glevco glevco closed this Oct 15, 2025
@github-project-automation github-project-automation bot moved this from In Progress (WIP) to Waiting to be deployed in Hathor Network Oct 15, 2025
@glevco glevco mentioned this pull request Oct 15, 2025
1 task
@jansegre jansegre moved this from Waiting to be deployed to Done in Hathor Network Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants