From 607364af5dab41f743a9492b8c2c617bd60c1880 Mon Sep 17 00:00:00 2001
From: Alex Crichton <alex@alexcrichton.com>
Date: Wed, 14 Aug 2024 12:25:52 -0700
Subject: [PATCH] Fill out an initial Overview.md

This is mostly a transcription of the presentation we made to the CG
this week. I've also taken the time to flesh out a description of the
current state of WebAssembly at least relative to Wasmtime with some
links and words.
---
 proposals/128-bit-arithmetic/Overview.md | 277 ++++++++++++++++++++++-
 1 file changed, 273 insertions(+), 4 deletions(-)

diff --git a/proposals/128-bit-arithmetic/Overview.md b/proposals/128-bit-arithmetic/Overview.md
index 7e1e9c7415..84c9928255 100644
--- a/proposals/128-bit-arithmetic/Overview.md
+++ b/proposals/128-bit-arithmetic/Overview.md
@@ -2,19 +2,221 @@
 
 ## Motivation
 
-... TODO ...
+There are a number of use cases for 128-bit numbers and arithmetic in source
+languages today such as:
+
+* Aribtrary precision math - many languages have a bignum-style library which is
+  an arbitrary precision integer. For example libgmp in C, numbers Python,
+  `BigInt` in JS, etc. Big integers have a range of specific applications as
+  well which can include being integral portions of cryptographic algorithms.
+
+* Checking for overflow - some programs may want to check for overflow when
+  performing arithmetic operations, such as seeing if a 64-bit addition
+  overflowed. Using 128-bit arithmetic can be done to detect these sorts of
+  situations.
+
+* Niche bit tricks - some PRNGs use 128-bit integer state for efficient storage
+  and calculation of the next state. Using 128-bit integers has also been done
+  for hash table indexing as well.
+
+Today, however, these use cases of 128-bit integers are significantly slower in
+WebAssembly then they are on native platforms. The performance gap can range
+from 2-7x slower than native at this time.
+
+The goal of this proposal is to close this performance gap between native and
+WebAssembly by adding new instructions which enable more efficient lowerings of
+128-bit arithmetic operations.
+
+### WebAssembly today with 128-bit arithmetic
+
+[This is an example](https://godbolt.org/z/fMdjqvEaq) of what LLVM emits today
+for 128-bit operations in source languages. Notably:
+
+* `i64.add128` - expands to three `add` instructions plus comparisons.
+* `i64.sub128` - same as `i64.add`, but with `sub` instructions.
+* `i64.mul128` - this notably uses the `__multi3` libcall which is significantly
+  slower than performing the operation inline.
+
+For the same code [this is what native platforms
+emit](https://godbolt.org/z/65d45ff5K). Notably:
+
+* x86\_64 - addition/subtraction use `adc` and `sbb` to tightly couple the two
+  additions/subtractions together and avoid moving the flags register into a
+  general purpose register. Multiplication uses the native `mul` instruction
+  which produces a 128-bit result which is much more efficient than the
+  implementation of `__multi3`.
+* aarch64 - addition/subtraction also use `adc` and `sbc` like x86\_64.
+  Multiplication uses `umulh` to generate the upper bits of a multiplication and
+  can efficiently use `madd` as well. This is a much more compact sequence than
+  `__multi3`.
+* riscv64 - this architecture notably does not have overflow flags and the
+  generated code looks quite similar to the WebAssembly. Multiplication,
+  however, has access to `mulhu` which WebAssembly does not easily provide.
+
+For a comparison [this is the generated output of
+Wasmtime](https://godbolt.org/z/46dcajxWa) for add/sub given the WebAssembly
+that LLVM emits today (edited to produce a multivalue result instead of storing
+it into memory). Notably:
+
+* x86\_64 - addition/subtraction is not pattern matching to generate `adc` or
+  `sbc` meaning that a compare-and-set is required.
+* aarch64 - same consequences as x86\_64.
+* riscv64 - the generated code mostly matches native output modulo frame pointer
+  setup/teardown. On riscv64 it's expected that `i64.{add,sub}128` won't
+  provide much of a performance benefit over today. Multiplication however will
+  still be faster.
+
+Overall the main cause for slowdowns are:
+
+* On x86\_64 and aarch64 WebAssembly doesn't provide access to overflow flags
+  done by `add` and `adds` and thus it's difficult for compilers to
+  pattern-match and generate `adc` and `sbc`.
+* On all platforms the `__multi3` libcall is significantly slower than native
+  instructions because the libcall itself can't use the native instructions and
+  the libcall's results are required to travel through memory (according to its
+  ABI).
+
+This proposal's native instructions for 128-bit operations should solve all of
+these issues.
 
 ## Proposal
 
-... TODO ...
+This proposal currently adds three new instructions to WebAssembly:
+
+* `i64.add128`
+* `i64.sub128`
+* `i64.mul128`
+
+These instructions all have the type `[i64 i64 i64 i64] -> [i64 i64]` where the
+values are:
+
+* i64 argument 0 - the low 64 bits of the left-hand-side argument
+* i64 argument 1 - the high 64 bits of the left-hand-side argument
+* i64 argument 2 - the low 64 bits of the right-hand-side argument
+* i64 argument 3 - the high 64 bits of the right-hand-side argument
+* i64 result 0 - the low 64 bits of the result
+* i64 result 1 - the high 64 bits of the result
+
+Each 128-bit operand and result is split into a low/high pair of `i64` values.
+The semantics of add/sub/mul are the same as their 64-bit equivalents except
+that they work at the level of 128-bits instead of 64-bits.
 
 ## Example
 
-... TODO ...
+
+An example of implementing
+[`u64::overflowing_add`](https://doc.rust-lang.org/std/primitive.u64.html#method.overflowing_add)
+in Rust in WebAssembly might look like:
+
+```wasm
+(module
+  (func $"u64::overflowing_add"
+    (param i64 i64) (result i64 i64)
+    (i64.add128
+      (local.get 0) (i64.const 0) ;; lo/hi of lhs
+      (local.get 1) (i64.const 0) ;; lo/hi of rhs
+    )
+  )
+)
+```
+
+Here the two input values are zero-extended with constant 0 upper bits. The
+overflow flag, the second result, is guaranteed to be either 0 or 1 depending
+on whether overflow occurred.
 
 ## Spec Changes
 
-... TODO ...
+### Structure
+
+The definition for [numeric
+instructions](https://webassembly.github.io/spec/core/syntax/instructions.html#numeric-instructions)
+will be extended with:
+
+```
+instr ::= ...
+		| i64.{binop128}
+
+binop128 ::= add128
+		   | sub128
+		   | mul128
+```
+
+### Validation
+
+Validation of [numeric
+instructions](https://webassembly.github.io/spec/core/valid/instructions.html#numeric-instructions)
+will be updated to contain:
+
+```
+i64.{binop128}
+
+* The instruction is valid with type [i64 i64 i64 i64] -> [i64 i64]
+
+
+            ----------------------------------------------------
+             C ⊢ i64.{binop128} : [i64 i64 i64 i64] -> [i64 i64]
+
+```
+
+### Execution
+
+Execution of [numeric
+instructions](https://webassembly.github.io/spec/core/exec/instructions.html#numeric-instructions)
+will be updated with:
+
+```
+i64.{binop128}
+
+* Assert: due to validation, four values of type i64 are on the top of the stack.
+* Pop the value `i64.const c4` from the stack.
+* Pop the value `i64.const c3` from the stack.
+* Pop the value `i64.const c2` from the stack.
+* Pop the value `i64.const c1` from the stack.
+* Create 128-bit value `v1` by concatenating `c1` and `c2` where `c1` is the low
+  64-bits and `c2` is the upper 64-bits.
+* Create 128-bit value `v2` by concatenating `c3` and `c4` where `c3` is the low
+  64-bits and `c4` is the upper 64-bits.
+* Let `r` be the result of computing `{binop128}(v1, v2)`
+* Let `r1` be the low 64-bits of `r`
+* Let `r2` be the high 64-bits of `r`
+* Push the value `i64.const r1` to the stack
+* Push the value `i64.const r2` to the stack
+
+
+    (i64.const c1) (i64.const c2) (i64.const c3) (i64.const c4) i64.{binop128}
+                             ↪ (i64.const r1) (i64.const r2)
+                             (if r1:r2 = {binop128}(c1:c2, c3:c4))
+```
+
+### Binary Format
+
+The binary format for [numeric
+instructions](https://webassembly.github.io/spec/core/binary/instructions.html#numeric-instructions)
+will be extended with:
+
+```
+instr ::= ...
+        | 0xFC 19:u32   ⇒ i64.add128
+        | 0xFC 20:u32   ⇒ i64.sub128
+        | 0xFC 21:u32   ⇒ i64.mul128
+```
+
+> **Note**: opcodes 0-7 are `*.trunc_sat_*` instructions, 8-17 are bulk-memory
+> and reference-types `{table,memory}.{copy,fill,init}`, `{elem,data}.drop`, and
+> `table.grow`. Opcode 18 is proposed to be `memory.discord`.
+
+### Text Format
+
+The text format for [numeric
+instructions](https://webassembly.github.io/spec/core/text/instructions.html#numeric-instructions)
+will be extended with:
+
+```
+plaininstr_l ::= ...
+               | 'i64.add128' ⇒ i64.add128
+               | 'i64.sub128' ⇒ i64.sub128
+               | 'i64.mul128' ⇒ i64.mul128
+```
 
 ## Implementation Status
 
@@ -22,4 +224,71 @@
 
 ## Alternatives
 
+### Alternative: Overflow Flags
+
 ... TODO ...
+
+### Alternative: Widening multiplication
+
+Instead of `i64.mul128` it would be sufficient to add instructions such as
+`i64.mul_wide_{u,s}` which are typed as `[i64 i64] -> [i64 i64]` and are defined
+as producing a 128-bit result by multiplying the two 64-bit provided values.
+This corresponds to `mul` and `imul` on x86-64 and has [equivalents on other
+platforms as well](https://godbolt.org/z/eojr3MdWz). This is a lower-level
+primitive than `i64.mul128` and is well-supported across architectures. Given
+the current shape of the proposal, however, it "feels cleaner" to have
+`i64.mul128`. The `i64.mul_wide_u` instruction can be encoded as `i64.mul128`
+with constant 0 values for the upper bits and `i64.mul_wide_s` can be encoded
+similarly with a right-shift of the low bits. This means that `i64.mul128`
+should be sufficient for expressing the use cases of `i64.mul_wide_{s,u}`.
+
+### Alternative: Why not add an `i128` type to WebAssembly?
+
+Frontends compiling to WebAssembly are currently required to lower
+source-language-level `i128` types into two 64-bit halves. This is done by LLVM,
+for example, when lowering its internal `i128` type to WebAssembly. Adding
+`i128` to WebAssembly would make this translation lower and remove the need for
+`i64.add128` for example by instead being `i128.add`.
+
+This alternative though is a major change to WebAssembly and can be a very large
+increase in complexity for engines. Given the relatively niche use cases for
+128-bit integers this is seen as an imbalance in responsibilities where a
+relatively rarely used feature of 128-bit integers would require a significant
+amount of investment in engines to support.
+
+Native ISAs also typically do not have a 128-bit integer type. This means that
+most operations need to be emulated with 64-bit values anyway such as
+bit-operations or loads/stores. Loads/stores of 128-bit values in WebAssembly
+can raise questions of tearing in threaded settings in addition to
+partially-out-of-bounds loads/stores as well.
+
+This leads to the conclusion to not add `i128` to WebAssembly and instead use
+the other types already present in WebAssembly.
+
+### Alternative: Why not use `v128` as an operand type?
+
+WebAssembly already has a 128-bit value type of `v128` from the simd proposal.
+Compilers typically keep this value in vector registers, however, such as
+`%xmmN`. An operation like `i64.add128` would then have to move `%xmmN` into
+general purpose registers, perform the operation, and then move it back to the
+`%xmmN` register. This is hypothesized to pessimize performance.
+
+Alternatively compilers could keep track of whether the value is in and `%xmmN`
+vector register or in a general purpose register, but this is seen as a
+significant increase in complexity for code translators.
+
+Overall it seemed best to use `i64` operands instead of `v128` as it more
+closely maps what native platforms do by operating on values in general-purpose
+registers.
+
+### Alternative: Why not add `i64.div128_{u,s}`?
+
+Native ISAs generally do not have support for 128-bit division. The x86-64 ISA
+has the ability to divide a 128-bit value by a 64-bit value producing a 64-bit
+result, but this doesn't map to the desired semantics of `i64.div128_{u,s}` to
+be equivalent to `i64.div_{u,s}` for example.
+
+LLVM additionally for native platforms [unconditionally lowers 128-bit
+division](https://godbolt.org/z/4xbGvbxja) to a host libcall of the `__udivti3`
+function. It's expected that a host-provided implementation of `__udivti3` is
+unlikely to be significantly faster than `__udivti3`-compiled-to-WebAssembly.