Skip to content

snapshot: Airgap TransctionError in the status cache#7559

Merged
joncinque merged 5 commits intoanza-xyz:masterfrom
joncinque:snapshot-err
Aug 19, 2025
Merged

snapshot: Airgap TransctionError in the status cache#7559
joncinque merged 5 commits intoanza-xyz:masterfrom
joncinque:snapshot-err

Conversation

@joncinque
Copy link
Copy Markdown

@joncinque joncinque commented Aug 15, 2025

Problem

The current tip of master fails to deserialize the status cache because InstructionError::BorshIoError no longer contains a string value.

Summary of Changes

Gap the SDK's types from the snapshot's types by introducing SnapshotTransactionError and SnapshotInstructionError. Note that the conversion requires some clones, so I'm not sure if this will be a bottleneck during snapshot generation.

Long term, the plan should be to remove the TransactionError from snapshots in the status cache, but we can do that in future work.

Addresses #6457.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 15, 2025

Codecov Report

❌ Patch coverage is 21.53846% with 204 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.4%. Comparing base (8412666) to head (4c09818).
⚠️ Report is 2502 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #7559     +/-   ##
=========================================
- Coverage    83.4%    83.4%   -0.1%     
=========================================
  Files         811      811             
  Lines      365014   365271    +257     
=========================================
+ Hits       304586   304661     +75     
- Misses      60428    60610    +182     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@steviez
Copy link
Copy Markdown

steviez commented Aug 16, 2025

I merged #7556 to get master functional again & so we can take our time on proper review for this one without more canaries / teammates hitting the snapshot incompatibility

Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these copies/clones and collects, how's the perf of the changes? Mainly the times for doing the status cache serialization. I wouldn't expect a big change, but I do think it'd be good to know.

Comment on lines +62 to +68
type SnapshotStatus<T> = HashMap<Hash, (usize, Vec<(status_cache::KeySlice, T)>)>;
type SnapshotSlotDelta<T> = (Slot, bool, SnapshotStatus<T>);
type SnapshotBankSlotDelta = SnapshotSlotDelta<Result<(), SnapshotTransactionError>>;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these airgap types, we use the name Serde (vs Snapshot here). Not a blocker for this PR, as we can change it after the fact. More just making a note for myself.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah darn sorry, I tried to follow the conventions that I saw

derive(AbiExample, AbiEnumVisitor)
)]
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)]
enum SnapshotTransactionError {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We use the prefix "Serde" for all these snapshot-airgap-like types. I'd recommend we do that here too.

We could also merge this PR as-is and then do the rename separately if there are no other changes need, thus avoiding a CI run.

Comment on lines +77 to +101
.map(|slot_delta| {
let status_map = slot_delta.2.lock().unwrap();
let snapshot_status_map = status_map
.iter()
.map(|(key, value)| {
(
*key,
(
value.0,
value
.1
.iter()
.map(|(key_slice, result)| {
(
*key_slice,
result.clone().map_err(SnapshotTransactionError::from),
)
})
.collect::<Vec<_>>(),
),
)
})
.collect::<HashMap<_, _>>();
(slot_delta.0, slot_delta.1, snapshot_status_map)
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled down the code so that I could actually understand all this. It works! IMO it's not particularly easy to read. Not something that requires a change, but something to note. Maybe we simplify in a subsequent PR.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these types are awful to work with. If you have any suggestions to make it simpler, I'm all for it!

@brooksprumo
Copy link
Copy Markdown

With these copies/clones and collects, how's the perf of the changes? Mainly the times for doing the status cache serialization. I wouldn't expect a big change, but I do think it'd be good to know.

Are there any nodes running this change already? If yes, I can check the metrics.

@steviez
Copy link
Copy Markdown

steviez commented Aug 19, 2025

Are there any nodes running this change already? If yes, I can check the metrics.

Given that it is unmerged, none of the "managed fleet" is. I do have this running on a node right now tho; identity is 4ELyKNZ5k6udAxaZCkZNWV6G22dpPrtf5PF7y3eWVV5s

@brooksprumo
Copy link
Copy Markdown

Looks like the time to serialize the status cache is pretty similar in comparison to one of my dev boxes. Sufficiently similar IMO.

Screenshot 2025-08-19 at 12 24 29 PM

Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Approving. This unblocks updating the SDK to v3 within agave. We can improve this code as needed in subsequent PRs.

@joncinque
Copy link
Copy Markdown
Author

With these copies/clones and collects, how's the perf of the changes? Mainly the times for doing the status cache serialization. I wouldn't expect a big change, but I do think it'd be good to know.

I updated bench_status_cache_serialize to use serialize_status_cache and with a bigger status cache, which seemed to be the technique used in other benches:

#[bench]
fn bench_status_cache_serialize_max(bencher: &mut Bencher) {
    // Fill up the status cache to better match what intense runtime usage would
    // look like.
    let max_cache_entries = MAX_CACHE_ENTRIES as u64;
    let mut status_cache = BankStatusCache::default();
    fill_status_cache(&mut status_cache, max_cache_entries, 100_000);

    assert!(status_cache.roots().contains(&0));
    let path = tempfile::NamedTempFile::new().unwrap().into_temp_path();
    bencher.iter(|| {
        let _ = serialize_status_cache(&status_cache.root_slot_deltas(), &path).unwrap();
    });
}

I also updated the original bench to:


#[bench]
fn bench_status_cache_serialize(bencher: &mut Bencher) {
    let mut status_cache = BankStatusCache::default();
    status_cache.add_root(0);
    status_cache.clear();
    for hash_index in 0..100 {
        let blockhash = Hash::new_from_array([hash_index; HASH_BYTES]);
        let mut id = blockhash;
        for _ in 0..100 {
            id = hash(id.as_ref());
            let mut sigbytes = Vec::from(id.as_ref());
            id = hash(id.as_ref());
            sigbytes.extend(id.as_ref());
            let sig = Signature::try_from(sigbytes).unwrap();
            status_cache.insert(&blockhash, sig, 0, Ok(()));
        }
    }
    assert!(status_cache.roots().contains(&0));
    let path = tempfile::NamedTempFile::new().unwrap().into_temp_path();
    bencher.iter(|| {
        let _ = serialize_status_cache(&status_cache.root_slot_deltas(), &path).unwrap();
    });
}

On my dev machine, here's master:

test bench_status_cache_serialize        ... bench:   1,126,136.25 ns/iter (+/- 179,630.11)
test bench_status_cache_serialize_max    ... bench:  12,492,705.60 ns/iter (+/- 1,725,556.09)

Here's this PR:

test bench_status_cache_serialize        ... bench:   1,324,236.80 ns/iter (+/- 14,871.04)
test bench_status_cache_serialize_max    ... bench:  13,125,669.80 ns/iter (+/- 1,459,285.

So looks like an extra 5%-10% for just writing the status cache.

@joncinque
Copy link
Copy Markdown
Author

Well looks like I got an approval in the interim, so I'll merge this and put in a PR for the changes you suggested. Do you want me to include the updated benches?

@joncinque joncinque merged commit fb50a4b into anza-xyz:master Aug 19, 2025
40 checks passed
@joncinque joncinque deleted the snapshot-err branch August 19, 2025 16:37
@brooksprumo
Copy link
Copy Markdown

Do you want me to include the updated benches?

I didn't even know we had a bench for serializing the status cache, hah! The "max" version seems useful. Switching to use serialize_status_cache(), I'm indifferent, I can see a case for either/both.

joncinque added a commit that referenced this pull request Aug 19, 2025
* snapshot: Rename status cache types, add bench

#### Problem

As pointed out #7559, the snapshot types were incorrectly named, and
there's a bench missing for the specific serialize logic.

#### Summary of changes

Rename everything from `Snapshot*` to `Serde*`.

Add a bench for serializing a max status cache to a file.

* Update frozen abi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants