Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack Size increased with 1.18.x (platform-tools) #1186

Open
adrena-orex opened this issue May 4, 2024 · 14 comments
Open

Stack Size increased with 1.18.x (platform-tools) #1186

adrena-orex opened this issue May 4, 2024 · 14 comments

Comments

@adrena-orex
Copy link

Problem

Programs built with solana 1.18.x experience new stack size related issues.

Details

We started having issues after migrating from solana 1.16.16 and anchor 0.28.0 to solana 1.18.11 and anchor 0.29.0.

Started getting messages when building informing us of our ix using too much stack, i.e:

Error: Function _ZN145_$LT$adrena..instructions..add_custody..AddCustody$u20$as$u20$anchor_lang..Accounts$LT$adrena..instructions..add_custody..AddCustodyBumps$GT$$GT$12try_accounts17h13b564b67762dd31E Stack offset of 4232 exceeded max offset of 4096 by 136 bytes, please minimize large stack variables

Then at runtime:

'Program 76FLzSMB8xPPt2y4ZKWeMiQBnXnqvU2xyJtc4PogsxMa consumed 2523 of 200000 compute units',
'Program 76FLzSMB8xPPt2y4ZKWeMiQBnXnqvU2xyJtc4PogsxMa failed: Access violation in stack frame 5 at address 0x200005ff0 of size 8'

Hotfixed the issue by splitting IXs and migrating accounts to use zero_copy.

We have noticed that the init of accounts takes a lot of stack, 800 bytes+. Can't have more than 3 accounts to init within the same instruction atm.

Note: Even when testnet tests works, it may still fail on devnet.

Setup:

orex> solana --version
solana-cli 1.18.11 (src:c78a9720; feat:3469865029, client:SolanaLabs)
orex> anchor --version
anchor-cli 0.29.0
orex> solana cluster-version                                                                                             
1.18.12
orex> solana config get
Config File: /Users/orex/.config/solana/cli/config.yml
RPC URL: https://api.devnet.solana.com 
WebSocket URL: wss://api.devnet.solana.com/ (computed)
Keypair Path: /Users/orex/.config/solana/id.json 
Commitment: processed

Using https://github.com/anza-xyz/platform-tools/releases v1.41 to compile.

Related

Found an issue here talking about platform tools growing stack usage.

Related to #1158

Proposed Solution

@LucasSte
Copy link

I believe the problem is fixed is platform tools version v1.41. You can check the version of your tools using running cargo-build-sbf --version.

Either way, can you provide an example code that triggers the problem?

@etodanik
Copy link

@LucasSte we have an existing program that got into this problem when moving from 1.17 to 1.18.15, even though platform tools is 1.41, there's still significantly more stack usage compared to 1.17.

In our case anchor's init constraints were the culprit, so we fixed it by moving some ATA account inits to separate IX, but I believe 1.18 is still vastly overusing stack.

@LucasSte
Copy link

@LucasSte we have an existing program that got into this problem when moving from 1.17 to 1.18.15, even though platform tools is 1.41, there's still significantly more stack usage compared to 1.17.

In our case anchor's init constraints were the culprit, so we fixed it by moving some ATA account inits to separate IX, but I believe 1.18 is still vastly overusing stack.

Did you run into the stack access violation error as well? If so, would mind providing an example of the offending code?

@blockiosaurus
Copy link

We're seeing this all over the Metaplex Program Library as users and ourselves migrate to 1.18

@Woody4618
Copy link

Here is a simple reproduction of the stack frame problem, or at least of one possible occasion. When running the tests in this repository against a local validator that has feature: EenyoWx9UMXYKpR8mW5Jmfmy2fRjzUtM7NduYMY8bx33 diesabled the stack frame error appears, when running them locally with the feature enabled the error does not appear. You can find the code here: https://github.com/Woody4618/stackframe_repro/blob/main/README.md

@LucasSte
Copy link

Here is a simple reproduction of the stack frame problem, or at least of one possible occasion. When running the tests in this repository against a local validator that has feature: EenyoWx9UMXYKpR8mW5Jmfmy2fRjzUtM7NduYMY8bx33 diesabled the stack frame error appears, when running them locally with the feature enabled the error does not appear. You can find the code here: https://github.com/Woody4618/stackframe_repro/blob/main/README.md

The feature disables stack frame gaps, amongst other changes, to enhance the network performance. The gaps serve to detect invalid memory accesses by programs. If you enable the feature, the protection goes away, but it does not mean your contract is behaving correctly.

The Rust compiler does not guarantee a stable ABI, so default sizes may change between releases. Tempering with the compiler to change allocation would overshadow a language feature: Rust allows you to choose where to allocate items (e.g. by using Box).

The main issue I've found was the try_accounts from Anchor, which needs to be split into smaller methods to decrease stack usage.

coral-xyz/anchor#3060

@etodanik
Copy link

@LucasSte I agree that try_accounts is stack heavy, but we should also not regress like that on an already limited stack memory usage efficiency. We'll post our repro shortly as well, but it also has to do with Anchor accounts.

@Woody4618
Copy link

Woody4618 commented Jun 28, 2024

Will the feature be enabled on mainet soon? Otherwise it would help maybe disable it on localhost and devnet, so people dont get surprised when deploying to mainnet? Then they can fix the problems, by splitting instructions and boxing accounts savely.

@LucasSte
Copy link

Will the feature be enabled on mainet soon? Otherwise it would help maybe disable it on localhost and devnet, so people dont get surprised when deploying to mainnet? Then they can fix the problems, by splitting instructions and boxing accounts savely.

I've found out that the compiler has been missing stack errors. This is why you may not see a compiler error, but your contract may error out during execution. The next release will solve this. If you see the error there, your contract will hit the same error on main net.

@LucasSte
Copy link

LucasSte commented Jul 8, 2024

I discussed this issue with @acheroncrypto it looks like PR coral-xyz/anchor#2939 fixes the stack problems with try_accounts

@Woody4618
Copy link

Woody4618 commented Sep 9, 2024

Hey @LucasSte there is another ecosystem team that has stack frame issues when upgrading from 1.17 to 1.18. They created a great reproduction:
https://github.com/oeble/access_violation_repro

It has warnings like these: Error: Function ZN105$LT$vprog..InitializeStrategy$u20$as$u20$anchor_lang..Accounts$LT$vprog..InitializeStrategyBumps$GT$$GT$12try_accounts17hb64a220c71e28f36E Stack offset of 5168 exceeded max offset of 4096 by 1072 bytes, please minimize large stack variables

Are these warnings now always turn into runtime errors with version 1.18? Whats the best way to fix these? Is the only option to decrease size of the accounts?

@LucasSte
Copy link

LucasSte commented Sep 9, 2024

Are these warnings now always turn into runtime errors with version 1.18?

These warnings are an indication that your program will misbehave during execution. You do not always see the runtime error in a program with such warning, because the offending case depends on which code path you execute. In general, all stack warnings emitted will lead to a runtime error.

Whats the best way to fix these? Is the only option to decrease size of the accounts?

I identified that the compiler was missing stack warning for many cases. This rendered any fix attempt by Anchor or other developers difficult. In Agave v2.0, we shipped platform-tools v1.42, with fixes to warn about all possible scenarios the contract may cause a runtime error. This allowed the Anchor team to merge coral-xyz/anchor#2939, which decreases the stack usage of function try_accounts (the warning @Woody4618 mentioned points to this function).

I am not aware of Anchor's release schedule, but I can say that upgrading to platform tools v1.42 and using an Anchor version with such a fix will solve most problems.

@acheroncrypto
Copy link

As @LucasSte mentioned, this problem is mostly fixed from Anchor's side, but the next version is currently blocked by mpl-token-metadata being incompatible with the Solana v2 crates coral-xyz/anchor#3219 (comment). Fortunately, there is already a PR to resolve this (metaplex-foundation/mpl-token-metadata#141).

@blockiosaurus
Copy link

Yup, we're on it. As long as there isn't a blocker updating our client crate and leaving the program on 1.x then it should be done soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants