Skip to content

Conversation

@benaadams
Copy link
Member

@benaadams benaadams commented Jan 6, 2026

Changes

Source Generator Implementation

  • Added StackPushBytesGenerator and GenerateStackOpcodeGenerator to auto-generate optimized push methods for byte sizes 1-32
  • Eliminated runtime size checks and branching through compile-time specialization
  • Generated methods use [GenerateStackPushBytes(size, PadDirection)] attribute

SIMD Stack Operations

  • Replaced generic Span.CopyTo with direct Vector256<byte>/Vector128<byte> construction
  • Added CopyUpTo32 helper using unaligned SIMD reads for optimal byte copying
  • Single 32-byte stores replace multiple smaller writes
  • Specialized paths for common sizes (1, 2, 4, 8, 16 bytes)

VM Execution Loop

  • Simplified opcode dispatch using nuint indexing with function pointers
  • Removed special-case POP inlining (now uniform dispatch)
  • Moved opcode count tracking to exception paths only

API Changes

  • Push methods now return EvmExceptionType instead of void for unified error handling
  • Added PushBytesNullableRef for stack overflow detection without exceptions
  • Explicit PushZero/PushOne methods replace conditional pushes
image

Before:

stack.PushBytes<TTracingInst>(immediateData);  // Generic copy

After:

stack.Push8Bytes<TTracingInst>(ref bytes);  // Generated SIMD implementation

Types of changes

What types of changes does your code introduce?

  • Optimization

Testing

Requires testing

  • No

Documentation

Requires documentation update

  • No

Requires explanation in Release Notes

  • No

Copilot AI review requested due to automatic review settings January 6, 2026 13:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes EVM stack push operations by replacing generic Span.CopyTo calls with specialized, size-specific copy implementations that leverage SIMD instructions (Vector256/Vector128) for better performance.

Key changes:

  • Introduced specialized push methods (PushRightPaddedBytes, PushBothPaddedBytes) with optimized byte packing logic
  • Added helper methods (PackHiU64, PackLoU64, CopyUpTo32) for efficient small-size data copying
  • Replaced ternary conditional pushes with explicit method calls (PushZero, PushOne) for clearer semantics
  • Ensured proper memory alignment via AsAlignedSpan for warmup scenarios

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
EvmStack.cs Core optimization: refactored PushBytes and introduced specialized push methods with SIMD-optimized byte packing for both left-padded and right-padded scenarios
EvmInstructions.Storage.cs Optimized CALLDATALOAD to use specialized PushRightPaddedBytes instead of generic zero-padding, improving performance for common call data operations
EvmInstructions.Stack.cs Updated PUSH operations to use renamed PushBothPaddedBytes method, maintaining correctness for edge cases where immediate data is truncated
EvmInstructions.Environment.cs Simplified BLOCKHASH to use explicit PushZero instead of conditional with BytesZero32, improving code clarity
EvmInstructions.Call.cs Optimized successful empty call path by using PushOne instead of pushing StatusCode bytes
VirtualMachine.Warmup.cs Added alignment guarantees via AsAlignedSpan to ensure stack operations can safely use SIMD instructions

@benaadams benaadams requested review from Marchhill and flcl42 January 6, 2026 13:53
@LukaszRozmej
Copy link
Member

Any benchmarks?

@benaadams
Copy link
Member Author

Any benchmarks?

Going deeper

@benaadams benaadams marked this pull request as draft January 7, 2026 08:50
@benaadams benaadams requested a review from Copilot January 7, 2026 13:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

});
}

private static string? EmitForType(SourceProductionContext spc, INamedTypeSymbol? type, Candidate[] methods)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt there a way to have a template? it would provide a clearer idea on the structure of the emitted type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use more interpolated strings; though you can view the output as it is generated for the compilation

So 10 bytes

image

And 30 bytes

image

@benaadams benaadams marked this pull request as ready for review January 7, 2026 17:22
@benaadams
Copy link
Member Author

@copilot summarise the changes for PR body text

Copy link
Contributor

Copilot AI commented Jan 7, 2026

@benaadams I've opened a new pull request, #10137, to work on those changes. Once the pull request is ready, I'll request review from you.

@benaadams benaadams changed the title Optimize Stack Pushes Optimize EVM stack push operations with source generation Jan 7, 2026
@benaadams
Copy link
Member Author

Any benchmarks?

image

@benaadams
Copy link
Member Author

Any benchmarks?

The top ones that didn't move that much already had hand coded implementations
image

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 28 changed files in this pull request and generated no new comments.

@benaadams benaadams force-pushed the optimize-stack-push branch from 87516fc to f9ab2eb Compare January 8, 2026 20:13
Copy link
Member

@LukaszRozmej LukaszRozmej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source generation seems overcomplicated to me. Can't we just do it by hand?

Looking at generated methods they all have similar pattern.

Top part is differs by one param - the count

        if (TTracingInst.IsActive)
        {
            _tracer.TraceBytes(in value, 16); // <- count goes here
        }

        uint headOffset = (uint)Head;
        uint newOffset = headOffset + 1;
        ref Vector256<byte> head = ref Unsafe.As<byte, Vector256<byte>>(ref Unsafe.Add(ref MemoryMarshal.GetReference(_bytes), (nint)(headOffset * WordSize)));
        if (newOffset >= MaxStackSize)
        {
            return EvmExceptionType.StackOverflow;
        }

        Head = (int)newOffset;

while bottom part is more complicated and has more variations, example:

        if (Vector256.IsHardwareAccelerated)
        {
            head = Vector256.Create(
                0UL,
                (ulong)Unsafe.ReadUnaligned<ushort>(ref value) << 48,
                Unsafe.ReadUnaligned<ulong>(ref Unsafe.Add(ref value, 2)),
                Unsafe.ReadUnaligned<ulong>(ref Unsafe.Add(ref value, 10))
            ).AsByte();
        }
        else
        {
            ref Vector128<ulong> head128 = ref Unsafe.As<Vector256<byte>, Vector128<ulong>>(ref head);

            head128 = Vector128.Create(
                0UL,
                (ulong)Unsafe.ReadUnaligned<ushort>(ref value) << 48
            );

            Unsafe.Add(ref head128, 1) = Vector128.Create(
                Unsafe.ReadUnaligned<ulong>(ref Unsafe.Add(ref value, 2)),
                Unsafe.ReadUnaligned<ulong>(ref Unsafe.Add(ref value, 10))
            );
        }

but can be broken down to simple stuff:

  • for V256 - we create one vector that's it
  • for V128 we basically interpret something as 128 then create 2 other 128's.

This code could be made generic statics with those and we could extract and inline all of this Something like:

    public partial EvmExceptionType PushBytes<TOp, TOpTTracingInst>(ref byte value)
        where TTracingInst : struct, global::Nethermind.Core.IFlag
        where TOp : IOpCount // or something else
    {
        if (TTracingInst.IsActive)
        {
            _tracer.TraceBytes(in value, TOp.Count);
        }

        uint headOffset = (uint)Head;
        uint newOffset = headOffset + 1;
        ref Vector256<byte> head = ref Unsafe.As<byte, Vector256<byte>>(ref Unsafe.Add(ref MemoryMarshal.GetReference(_bytes), (nint)(headOffset * WordSize)));
        if (newOffset >= MaxStackSize)
        {
            return EvmExceptionType.StackOverflow;
        }

        Head = (int)newOffset;

        if (Vector256.IsHardwareAccelerated)
        {
            head = TOp.Create256Vector();
       }
        else
        {
            Unsafe.Add(ref head128, 1) = TOp.Create128Vector()
        }

        return EvmExceptionType.None;
    }

pass the correct params, aggressive inline those and you are done without all the obfuscation of code generation

public static EvmExceptionType Push<TTracingInst>(int length, ref EvmStack stack, int programCounter, ReadOnlySpan<byte> code)
where TTracingInst : struct, IFlag
{
throw new NotSupportedException($"Use the {nameof(InstructionPush2)} opcode instead");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@benaadams benaadams force-pushed the optimize-stack-push branch from 8255d30 to ceec4f8 Compare January 10, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants