Skip to content
This repository has been archived by the owner on Jul 15, 2018. It is now read-only.

Explain Determinism #56

Closed
ethanfrey opened this issue Jan 19, 2017 · 3 comments
Closed

Explain Determinism #56

ethanfrey opened this issue Jan 19, 2017 · 3 comments
Assignees
Labels

Comments

@ethanfrey
Copy link
Contributor

ethanfrey commented Jan 19, 2017

This is more a question and a proposal to enhance documentation. Please give feedback here and then we can make some good docs out of it.

I often hear @jaekwon and @ebuchman saying that the hardest part of blockchain programming is the determinism. And I thought it was the p2p networking and bft consensus layer. Anyway, this brings me to the point that we probably need to make this very explicit and document what forms indeterminism can creep into our system.

In general, one assumes arbitrary op codes on a cpu are deterministic. I mean 2+2=4 and push, pop, jump, add, etc are all quite deterministic actions. Where then does the indeterminism come from? (short answer: syscall)

Below is my first attempt to explain sources of indeterminism, please expand, and comment:

  1. Exceptional failures of the hardware
  • cosmic rays flip bits and the cpu overheated causing a short in the addition registers. This can happen, but should be regarded as so rare and unpredictable as not warrant much more concern.
  1. Relying on some node-dependent state
  • calls to /dev/urandom or syscall.random should be different on every machine
  • calls to time, timeofday, etc. will also be different on other machines, even if one tries to sync the clocks well
  1. Relying on unspecified implementation details
  • Mathematical functions are deterministic in that if f(x) = y, that is true over all time and space
  • However, many common packages hold a lesser rule. The same code may be deterministic (produce the same result if called 100 times), but there is no guarantee that v1.8.1 and v1.8.2 of the library will produce the same results. Especially, if client/servers use different libraries and/or languages to implement this function.
  • Race conditions (depending on the timing of concurrent processes)
  • Floating point numbers can easily introduce indeterminism, which is theoretically possible but tricky to avoid.
  • Another example is json encoding (there are many valid json encodings for one struct), and a simple changing of the spacing will break determinism
    • json decoding, however, should be deterministic (with the exception of floats, see above)
  • The order of items when iterating through a map/hashtable/dict/whatever, is very dependent on the implementation, and should be considered random.
    • However, if you just iterate to get a total count, which is invariant on the ordering of the items, then your code is once again deterministic
  1. Depending on results from external systems
  • Filesystems can be full unpredictably, this can introduce non-deterministic failures (but generally not false values except for condition 1 above)
  • Networking calls can also introduce not just non-deterministic failures, but also non-deterministic successes. If I make a REST call on some API, even if there is no networking error, it may return different results each time.
  1. Other categories???

Another way to break these down is:

  1. Sources that may introduce non-deterministic failures
  • As long as your code handles these non-deterministic errors properly, leading to aborting the operation, and restarting from the beginning of the block (last known deterministic state), you should be able to handle this in a way that a system will remain deadlocked until the failure is resolved, rather than produce inconsistent results.
  • Careful coding can limit problems from non-deterministic failures, but we should give examples of how to do this safely, and also try to avoid these cases as much as possible
  1. Sources that may introduce non-deterministic successes
  • These must be avoided at all costs, and should be well documented for all developers
  • It would be great to build a tool (like go vet) that checks code for these cases
  • Main danger zones: network calls, time, urandom (duh), floats, json encoding, iteration order over maps, ...

Okay, done for now, would love some feedback here. And maybe we can make a wiki page out of the results from this issue.

@melekes
Copy link
Contributor

melekes commented Jan 19, 2017

Great write up! I think this article http://www.ocoudert.com/blog/2011/05/30/how-to-make-software-deterministic/#memory_address is worth reading as well.

But should be building a tool (like go vet) or our own language (+10000 hours of work) a priority? To draw the analogy with RethinkDB (post-mortem), isn't it like chasing correctness instead of a usecase? Or it will be our main advantage?

@zramsay zramsay self-assigned this Aug 24, 2017
@zramsay zramsay removed their assignment Oct 3, 2017
@ebuchman
Copy link
Contributor

Let's make sure we have a clear place in our docs for talking about this. Then we can close this issue @zramsay

@zramsay
Copy link
Contributor

zramsay commented Mar 6, 2018

tendermint/tendermint#1280 provides a place to discuss ;)

@zramsay zramsay closed this as completed Mar 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants