Improve program entrypoint#176
Conversation
c235845 to
aa4201c
Compare
|
i just found two more 1 CU/account optimizations, so hold your horses here. One of them might be unusable unfortunately but we will see. |
Nice – this one will go after #166 goes in. |
joncinque
left a comment
There was a problem hiding this comment.
Looks good overall! Just some small comments
| /// Align a pointer to the BPF alignment of `u128`. | ||
| macro_rules! align_pointer { | ||
| ($ptr:ident) => { | ||
| (($ptr as usize + (BPF_ALIGN_OF_U128 - 1)) & !(BPF_ALIGN_OF_U128 - 1)) as *mut u8 | ||
| }; | ||
| } |
There was a problem hiding this comment.
Why does this align to u128? Shouldn't it align to u64 per the serialization spec?
There was a problem hiding this comment.
That is the name of the constant on the SDK. 😊 I think it is meant to represent the alignment of an u128 in BPF, which is 8. We could rename it if that makes it more clear – agree that the name is a bit confusing.
There was a problem hiding this comment.
I mainly want to make sure I'm not missing anything, but I guess you could use any usize whose value is equal to 8 😅
9361601 to
fad2c71
Compare
|
@cavemanloverboy Found a few more savings with a small tweak in the code. |
f6f4596 to
e8b191a
Compare
ad37c78 to
a9285b4
Compare
|
@joncinque Put the PR back to draft to test a suggestion from @cavemanloverboy |
|
Current benchmark:
|
0f21920 to
01a09f4
Compare
|
let him cook |
|
@joncinque PR updated and ready for another review. 😊 |
| // There might be remininag accounts to process. | ||
| if to_process > 3 { | ||
| // 4..3 accounts left to process. | ||
| if to_process > 4 { | ||
| process_accounts!(4 => (input, accounts, accounts_slice)); | ||
| } else { | ||
| process_accounts!(3 => (input, accounts, accounts_slice)); | ||
| } | ||
| } else { | ||
| // 2..1 accounts left to process. | ||
| if to_process > 2 { | ||
| process_accounts!(2 => (input, accounts, accounts_slice)); | ||
| } else if to_process > 1 { | ||
| process_accounts!(1 => (input, accounts, accounts_slice)); | ||
| } | ||
| } |
There was a problem hiding this comment.
minor
Have you considered a match for this?
| // There might be remininag accounts to process. | |
| if to_process > 3 { | |
| // 4..3 accounts left to process. | |
| if to_process > 4 { | |
| process_accounts!(4 => (input, accounts, accounts_slice)); | |
| } else { | |
| process_accounts!(3 => (input, accounts, accounts_slice)); | |
| } | |
| } else { | |
| // 2..1 accounts left to process. | |
| if to_process > 2 { | |
| process_accounts!(2 => (input, accounts, accounts_slice)); | |
| } else if to_process > 1 { | |
| process_accounts!(1 => (input, accounts, accounts_slice)); | |
| } | |
| } | |
| // There might be remaining accounts to process. | |
| match to_process { | |
| 5 => process_accounts!(4 => (input, accounts, accounts_slice)), | |
| 4 => process_accounts!(3 => (input, accounts, accounts_slice)), | |
| 3 => process_accounts!(2 => (input, accounts, accounts_slice)), | |
| 2 => process_accounts!(1 => (input, accounts, accounts_slice)), | |
| 1 => (), | |
| _ => { | |
| // SAFETY: `while` loop above makes sure that `to_process` has 1 to 5 | |
| // entries left. | |
| unsafe { core::hint::unreachable_unchecked() } | |
| } | |
| }; |
There was a problem hiding this comment.
the point of this was to manually unroll the binary search. did you verify that this match statement does not increase cus? if it doesn't increase it, i'm in favor of this so long as we add a comment that the compiler can figure out the binary search (in case someone changes in the future)
There was a problem hiding this comment.
I must admit, I didn't realize it.
Maybe a short note explaining the optimization could help the future readers.
There was a problem hiding this comment.
A match statement increases CUs – in the end it generates a "standard" jump table, which in the worse case will do 4 comparisons. The "manual" one uses 3 comparisons at most, and 2 for most of the values.
I will add a comment explaining the rationale of the nested if statements.
There was a problem hiding this comment.
This is probably out of scope for the PR by now.
I started wondering if there are some alternatives to the "binary search" that have similarly good properties.
In particular, I noticed that as written we produce 16 identical code blocks for the account processing, in total.
Not sure if any of them gets optimized away, but if you are saying that the binary search tree makes it into the final code, then all the blocks are probably there as well.
The very first code block is an optimization for the case when there is only one account.
But the rest 15 are used to process accounts.
And the longest "uninterrupted" sequence of accounts we process is 5 accounts at a time.
I guess, the size of the program is less important, but it still costs something to deploy it.
On x86, jump table is a single jump: https://godbolt.org/z/f7E5P3YKs
This:
match iterations {
3 => { process(); process(); process(); }
2 => { process(); process(); }
1 => { process(); }
0 => {}
_ => unreachable!(),
}Turns into this:
.LBB1_1:
lea rax, [rip + .LJTI1_0]
movsxd rcx, dword ptr [rax + 4*rbx]
add rcx, rax
jmp rcx
.LBB1_3:
call example::process::h15029326abde9722
.LBB1_4:
call example::process::h15029326abde9722
.LBB1_5:
call example::process::h15029326abde9722
.LBB1_6:
add rsp, 16
pop rbx
ret
.LJTI1_0:
.long .LBB1_6-.LJTI1_0
.long .LBB1_5-.LJTI1_0
.long .LBB1_4-.LJTI1_0
.long .LBB1_3-.LJTI1_0There are no comparisons.
But maybe with the SBF backend can not produce computed jumps here?
The instruction set seems to have the necessary instruction.
If I pass the index into process(), then it gets a bit more confusing.
Though, process_n_accounts!(@process_account => (input, accounts, accounts_slice)) calls are identical. All the state change is a side effect of the call.
Though, there is a lot of code that is inlined, so the compiler might miss the fact that they are indeed identical.
There was a problem hiding this comment.
What version of platform tools are you using? I merged the switch simplify pass recently anza-xyz/llvm-project#153. It is available on platform tools v1.49 onwards.
There was a problem hiding this comment.
I'd encourage you to test with v1.50, because it enables a pass to simplify branches, so you'll see even different results.
There was a problem hiding this comment.
Using platform-tools v1.50 we get even more improvements:
| Name | CUs | Delta |
|--------------|-----|-------|
| Account (0) | 9 | -- |
| Account (1) | 13 | -- |
| Account (2) | 21 | -1 |
| Account (3) | 34 | -- |
| Account (4) | 42 | -1 |
| Account (5) | 49 | -1 |
| Account (6) | 64 | -6 |
| Account (7) | 68 | -5 |
| Account (8) | 75 | -3 |
| Account (16) | 140 | -12 |
| Account (32) | 258 | -20 |
| Account (64) | 501 | -37 |
There was a problem hiding this comment.
Yeah, the compiler is a bit unpredictable – sometimes you change a single line and things are significantly different. 😅
Fernando The Compiler Whisperer.
There was a problem hiding this comment.
@cavemanloverboy example generates different code in v1.50:
This is for the case with 16 match arms. The compiler builds a lookup table:
entrypoint:
ldxdw r1, [r1 + 0]
jgt r1, 16, LBB0_3
mov64 r2, r1
lsh64 r2, 32
rsh64 r2, 32
mov64 r3, 129023
rsh64 r3, r2
and64 r3, 1
jeq r3, 0, LBB0_3
lsh64 r1, 3
lddw r2, .Lswitch.table.entrypoint
add64 r2, r1
ldxdw r1, [r2 + 0]
mov64 r2, 1
call sol_log
LBB0_3:
mov64 r0, 0
exit
illia-bobyr
left a comment
There was a problem hiding this comment.
Out of scope for this PR
I find it a bit hard to quickly spot all the differences between parse() and parse_into<MAX_TX_ACCOUNTS>().
I can see that the skip extra accounts logic is only present in the parse_into() version.
Is this the only difference?
I think if this is the case, there is probably a way to remove the skip logic from the parse_into<MAX_TX_ACCOUNTS>() case.
Specifically, only when MAX_ACCOUNTS equals MAX_TX_ACCOUNTS.
As MAX_ACCOUNTS is a generic constant, the compiler will create a new instance of this function for every specific value of MAX_ACCOUNTS.
Plus, it is marked as #[inline(always)] allowing further optimizations.
I think if
while to_skip > 0 {is augmented with a compile time check, maybe like this:
if MAX_ACCOUNTS < MAX_TX_ACCOUNTS {
while to_skip > 0 {the compiler should remove this code completely when MAX_ACCOUNTS equals MAX_TX_ACCOUNTS.
And it should also remove the to_skip calculation, as it becomes dead code.
Also, would it make sense to add a compile time assertion, to make sure that MAX_ACCOUNTS is not above MAX_TX_ACCOUNTS?
I wonder if a change like that would be enough to end up with just a single version of the parse() function.
It looks like it has a considerable overlap with parse_into() and you are making changes in both functions in parallel.
illia-bobyr
left a comment
There was a problem hiding this comment.
Did one more pass and found a few minor things.
But overall it is good :)
joncinque
left a comment
There was a problem hiding this comment.
Great work! This might be one of the few cases where a macro is uniquely suited to do the job, and it's really well factored.
Problem
The current program entrypoint does not translate to very efficient bytecode, as can be seen from an assembly implementation of entrypoint – e.g., cavey's
asmr.Solution
Tweak the implementation to improve its efficiency, borrowing ideas from the assembly implementation (credits to @cavemanloverboy).
One key difference is that the entrypoint includes inlined code to parse accounts, which reduces the number of jumps required and therefore reduces CUs.
Results