From a6a212a891c38bb3623d0b40ef1e76a98339b0a7 Mon Sep 17 00:00:00 2001 From: Joe Caulfield Date: Tue, 8 Jul 2025 03:37:38 +0000 Subject: [PATCH 1/3] vm register 2 instruction data pointer --- .../0321-vm-r2-instruction-data-pointer.md | 115 ++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 proposals/0321-vm-r2-instruction-data-pointer.md diff --git a/proposals/0321-vm-r2-instruction-data-pointer.md b/proposals/0321-vm-r2-instruction-data-pointer.md new file mode 100644 index 000000000..e647a83a3 --- /dev/null +++ b/proposals/0321-vm-r2-instruction-data-pointer.md @@ -0,0 +1,115 @@ +--- +simd: '0321' +title: VM Register 2 Instruction Data Pointer +authors: + - Joe Caulfield (Anza) +category: Standard +type: Core +status: Review +created: 2025-07-11 +feature: (fill in with feature key and github tracking issues once accepted) +--- + +## Summary + +Provide a pointer to instruction data in VM register 2 (`r2`) at program +entrypoint, enabling direct access to instruction data without parsing the +serialized input region. + +## Motivation + +Currently, sBPF programs must parse the entire serialized input region to +locate instruction data. The serialization layout places accounts before +instruction data, requiring programs to iterate through all accounts before +reaching the instruction data section. This is inefficient for programs that +primarily or exclusively need to access instruction data. + +By providing a direct pointer to instruction data in `r2`, programs can +immediately access this data without any parsing overhead, resulting in +improved performance and reduced compute unit consumption. + +## New Terminology + +* **Instruction data pointer**: An 8-byte pointer stored in VM register 2 that + points directly to the start of the instruction data section in the input + region. + +## Detailed Design + +When the feature is activated, the VM shall set register 2 (`r2`) to contain a +pointer to the beginning of the instruction data section within the input +region. The instruction data format remains unchanged: + +``` +[8 bytes: data length (little-endian)][N bytes: instruction data] +``` + +This pointer in `r2` is made available to all programs, under all loaders, +regardless of whether or not the value is read. Prior to this feature, `r2` +contains uninitialized data at program entrypoint. This change assumes no +existing programs depend on the garbage value in `r2`. + +**Register Assignment:** + +* `r1`: Input region pointer (existing behavior) +* `r2`: Pointer to instruction data section (new) + +**Pointer Details:** + +* The pointer in `r2` points to the first byte of the actual instruction data, + NOT the length field. +* The pointer value in `r2` is stored as a native 64-bit pointer (8 bytes) in + little-endian format (x86_64). +* When there is no instruction data (length = 0), `r2` still points to where + the instruction data would be, immediately after the 8-byte length field. +* The pointer must always point to valid memory within the input region bounds. + +## Alternatives Considered + +1. **Provide a pointer to instruction data length**: Store a pointer to the + instruction data length field in `r2`. However, providing a direct pointer to + the start of instruction data is more ergonomic. + +2. **Provide optional entrypoint parameter**: Allow programs to opt-in via a + different entrypoint signature. The current approach is simpler as it avoids + supporting multiple entrypoint signatures and makes the pointer universally + available. This relies on the assumption that no programs depend on the + garbage value previously in `r2`. + +3. **Modify serialization layout**: The serialization layout will eventually be + overhauled with ABI v2, a comprehensive upgrade that could resolve this issue + among many others. Given the significant scope of ABI v2 and potential for + delays, this targeted optimization provides immediate value. + +## Impact + +On-chain programs are positively impacted by this change. The new `r2` pointer +gives programs the ability to efficiently read instruction data, further +customize their program's control flow and maximize compute unit effiency. +However, any programs that currently depend on the uninitialized/garbage value +in `r2` at entrypoint will break when this feature is activated. + +Validators are almost completely unaffected as the instruction data pointer is +already available during serialization, and setting a register is a negligible +CPU operation. + +Core contributors must implement this feature, which should be extremely +minimally invasive, depending on the VM implementation. + +## Security Considerations + +Programs should read and validate the instruction data length (stored at `r2 - 8`) +before accessing data via the `r2` pointer. Failing to check the length could +result in reading unintended memory contents or out-of-bounds access attempts. + +Additionally, programs that currently rely on `r2` containing uninitialized or +garbage data at entrypoint will experience breaking changes when this feature +is activated. + +## Backwards Compatibility + +This feature is only backwards compatible for programs that currently do not +read from `r2` at program entrypoint. + +This feature is NOT backwards compatible for any programs that depend on the +uninitialized/garbage data previously in `r2`. From 7b1061431f8e4f7cde21d97e7fbb0a041b22c327 Mon Sep 17 00:00:00 2001 From: Joe Caulfield Date: Sat, 12 Jul 2025 05:26:07 +0000 Subject: [PATCH 2/3] deanmlittle feedback --- .../0321-vm-r2-instruction-data-pointer.md | 45 +++++++++++-------- 1 file changed, 27 insertions(+), 18 deletions(-) diff --git a/proposals/0321-vm-r2-instruction-data-pointer.md b/proposals/0321-vm-r2-instruction-data-pointer.md index e647a83a3..bf27bc7e1 100644 --- a/proposals/0321-vm-r2-instruction-data-pointer.md +++ b/proposals/0321-vm-r2-instruction-data-pointer.md @@ -13,14 +13,15 @@ feature: (fill in with feature key and github tracking issues once accepted) ## Summary Provide a pointer to instruction data in VM register 2 (`r2`) at program -entrypoint, enabling direct access to instruction data without parsing the -serialized input region. +entrypoint, enabling direct access to instruction data without having to parse +the accounts section of the serialized input region. ## Motivation -Currently, sBPF programs must parse the entire serialized input region to -locate instruction data. The serialization layout places accounts before -instruction data, requiring programs to iterate through all accounts before +Currently, sBPF programs must parse the accounts section of the serialized +input region to locate instruction data. The serialization layout places +accounts before instruction data, requiring programs to iterate through all +accounts before reaching the instruction data section. This is inefficient for programs that primarily or exclusively need to access instruction data. @@ -30,9 +31,9 @@ improved performance and reduced compute unit consumption. ## New Terminology -* **Instruction data pointer**: An 8-byte pointer stored in VM register 2 that - points directly to the start of the instruction data section in the input - region. +* **Instruction data pointer**: A 64-bit pointer (8 bytes) stored in VM + register 2 that points directly to the start of the instruction data + section in the input region. ## Detailed Design @@ -46,8 +47,14 @@ region. The instruction data format remains unchanged: This pointer in `r2` is made available to all programs, under all loaders, regardless of whether or not the value is read. Prior to this feature, `r2` -contains uninitialized data at program entrypoint. This change assumes no -existing programs depend on the garbage value in `r2`. +contains uninitialized data at program entrypoint. + +Despite technically being a breaking change, mainnet-beta testing with a modified +Agave validator confirms no divergence in execution or consensus. This is +because `r2` can typically only be accessed uninitialized through contrived +examples such as assembly manipulation or compiler bugs. The performance +benefits are considered a reasonable tradeoff. See security section for more +details. **Register Assignment:** @@ -60,8 +67,10 @@ existing programs depend on the garbage value in `r2`. NOT the length field. * The pointer value in `r2` is stored as a native 64-bit pointer (8 bytes) in little-endian format (x86_64). -* When there is no instruction data (length = 0), `r2` still points to where - the instruction data would be, immediately after the 8-byte length field. +* When there is no instruction data (length = 0), `r2` still points to the + offset immediately proceeding the instruction length counter; in this case, + the first byte of the program ID, ensuring it will always point to valid, + readable memory within the bounds of the input region. * The pointer must always point to valid memory within the input region bounds. ## Alternatives Considered @@ -79,7 +88,8 @@ existing programs depend on the garbage value in `r2`. 3. **Modify serialization layout**: The serialization layout will eventually be overhauled with ABI v2, a comprehensive upgrade that could resolve this issue among many others. Given the significant scope of ABI v2 and potential for - delays, this targeted optimization provides immediate value. + delays, this targeted optimization provides immediate value and remains + compatible with ABI v2. ## Impact @@ -89,10 +99,6 @@ customize their program's control flow and maximize compute unit effiency. However, any programs that currently depend on the uninitialized/garbage value in `r2` at entrypoint will break when this feature is activated. -Validators are almost completely unaffected as the instruction data pointer is -already available during serialization, and setting a register is a negligible -CPU operation. - Core contributors must implement this feature, which should be extremely minimally invasive, depending on the VM implementation. @@ -104,7 +110,10 @@ result in reading unintended memory contents or out-of-bounds access attempts. Additionally, programs that currently rely on `r2` containing uninitialized or garbage data at entrypoint will experience breaking changes when this feature -is activated. +is activated. While it is technically possible with assembly manipulations, no +compiled code uses `r2` with an uninitialized value except in the case of +`sol_log_64_` which is not a direct security concern as logs are not enshrined +by consensus. ## Backwards Compatibility From 18bfb35bf127412deb376536458371c703e347f5 Mon Sep 17 00:00:00 2001 From: Joe Caulfield Date: Thu, 17 Jul 2025 02:04:06 +0000 Subject: [PATCH 3/3] lucasste feedback --- proposals/0321-vm-r2-instruction-data-pointer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/0321-vm-r2-instruction-data-pointer.md b/proposals/0321-vm-r2-instruction-data-pointer.md index bf27bc7e1..1870db728 100644 --- a/proposals/0321-vm-r2-instruction-data-pointer.md +++ b/proposals/0321-vm-r2-instruction-data-pointer.md @@ -66,7 +66,7 @@ details. * The pointer in `r2` points to the first byte of the actual instruction data, NOT the length field. * The pointer value in `r2` is stored as a native 64-bit pointer (8 bytes) in - little-endian format (x86_64). + little-endian format. * When there is no instruction data (length = 0), `r2` still points to the offset immediately proceeding the instruction length counter; in this case, the first byte of the program ID, ensuring it will always point to valid,