Skip to content

Track transaction performance track #202

Closed
lijunwangs wants to merge 42 commits intoanza-xyz:masterfrom
lijunwangs:track_transaction_performance_track_txn_in_banking_stage_deep-agave
Closed

Track transaction performance track #202
lijunwangs wants to merge 42 commits intoanza-xyz:masterfrom
lijunwangs:track_transaction_performance_track_txn_in_banking_stage_deep-agave

Conversation

@lijunwangs
Copy link
Copy Markdown

@lijunwangs lijunwangs commented Mar 12, 2024

Problem

Enable the system to track transaction processing performance through various stage based on probability. This is the resubmit of the PR solana-labs#34789 against Agave

Summary of Changes

Based on the https://docs.google.com/document/d/1ig1rC0dk-ddi33JIqG9EZ4ZSq9xAJpT9fQTPaZFi_vw/edit.
We use a randomly generated 12 bits integer as a mask to match the transaction's signature. If it is matched, we mark the packet for tracking for performance in the Meta's mask. This is used efficiently down stream without doing mask matching. For these matched packets we report processing time for: fetch, sigverify and banking stage.

At the banking stage, the start time is measured when the packet is received from the queue and stored in ImmutableDeserializedPacket. To track its execution through the banking stage, ExtendedSanitizedTransaction is used which is a wrapper over SanitizedTransaction.

trying txn mask matching

output txn to figure out why txn is not exactly matched

Use 62 and 61 portion

track fetch performance using random txn mask

track sigverify performance using random txn mask

track banking stage performance using random txn mask

adding missing cargo lock file

add debug messages

Revert "add debug messages"

This reverts commit 96aead5.

fixed some clippy issues

check-crate issue

Fix a clippy issue

Fix a clippy issue

debug why txns in banking stage shows fewer performance tracking points

debug why txns in banking stage shows fewer performance tracking points

debug why txns in banking stage shows fewer performance tracking points

debug why txns in banking stage shows fewer performance tracking points

get higher PPS for testing purpose

more debug messages on why txn is skipped

display if tracer packet in log

add debug before calling processing_function

debug at the initial of banking stage

track if a txn is forwarded

dependency order

missing cargo file

clean up debug messages

Do not use TRACER_PACKET, use its own bit

rename some functions

addressed some comments from Trent

Update core/src/banking_stage/immutable_deserialized_packet.rs

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

addressed some comments from Trent

Do not use binary_search, do simple compare in one loop
@mergify mergify Bot requested a review from a team March 12, 2024 05:42
@apfitzge
Copy link
Copy Markdown

Changing transaction type just for this seems like a non-starter to me, especially with 99.999% of txs not actually using the field.

Why did we not consider a less invasive option like some sort of shared struct? Seems like we should be able to just cache most-recent timestamp for signatures at specific points in code, measure since last point and update the timestamp.

@lijunwangs
Copy link
Copy Markdown
Author

lijunwangs commented Mar 14, 2024

Changing transaction type just for this seems like a non-starter to me, especially with 99.999% of txs not actually using the field.

Why did we not consider a less invasive option like some sort of shared struct? Seems like we should be able to just cache most-recent timestamp for signatures at specific points in code, measure since last point and update the timestamp.

I am open to concrete suggestions. What is the alternative to the problem I am solving (measure the elapse of time between to points in very different code locations)? And does it really make a difference in term of overhead?

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 88.88889% with 44 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (6b3d35e) to head (53dee5b).
Report is 26 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #202     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         837      839      +2     
  Lines      226496   226761    +265     
=========================================
+ Hits       185544   185753    +209     
- Misses      40952    41008     +56     

@apfitzge
Copy link
Copy Markdown

apfitzge commented Mar 14, 2024

Overall I still feel this is too invasive of a change for simply adding a metric: 1.2k lines + 61 files. It seems like we're taking the wrong approach here to me.

WRT banking stage performance, I have privately given concerns that the metric collected is not useful. Execution performance of random transactions will effectively be random under some bound that is dependent on CPU. If it is only useful when collected in aggregate, then I think we may already be able to calculate such a time from current metrics; we know number of transactions received, dropped, processed per second => can get mean time transaction spent in banking if that's all we're after.

Additionally as commented above, if we land on per-tx Option in the transaction type then we should bite the bullet and move to the RuntimeTransaction we are already planning to use as a "SanitizedTransaction with metadata". That was something @tao-stones, @t-nelson and I have looked at but has largely been held back by higher priority work.

Memory:
We have buffers of 700k transactions in banking stage.
Arguably these buffers are too big, but with this change we're adding on order of 10MB of memory to the buffer, where all but around 3200 bytes are useless. 10MB isn't a lot but seems like we can definitely come up with an alternative that accomplishes same thing and is not so invasive as changing the transaction type.

A concrete alternative to per tx Option:
We can create a relatively simple, struct to cache timestamps for the marked transactions; see psuedo-rust below.
insert at beginning of stage, remove when transaction is done or dropped.
That way we don't need to change the transaction type everywhere; index calculation should be simple and fast, locks only on index collisions which should be very rare.
We could even make this shared between the stages so we can measure how long packets are sitting in the channels between stages.

Psuedo-code for cache
pub struct SignatureTimestampCache {
    data: Arc<[Mutex<Option<(Signature, Instant)>>]>
}

impl SignatureTimestampCache {
    /// Initialize a new cache with fixed capacity of 1024.
    pub fn new() -> Self {
        let mut data = Vec::with_capacity(1024);
        for _ in 0..1024 {
            data.push(Mutex::new(None));
        }
        SignatureTimestampCache {
            data: data.into()
        }
    }

    /// Insert a signature and timestamp into the cache.
    /// If the signature is already in the cache, ignore.
    pub fn insert(&self, signature: Signature, timestamp: Instant) {
        // check should_track_transacton(signature) here, not sure
        // if we have that flag already cached somewhere
        let index = self.index(&signature);
        let mut lock = self.data[index].lock().unwrap();
        if lock.is_none() {
            *lock = Some((signature, timestamp));
        }
    }

    /// Get the timestamp for a given signature.
    /// If the signature is not in the cache, return None.
    /// Should be called when we are done with processing transaction.
    pub fn take(&self, signature: &Signature) -> Option<Instant> {
        // check should_track_transacton(signature) here, not sure
        // if we have that flag already cached somewhere
        let index = self.index(signature);
        let mut lock = self.data[index].lock().unwrap();
        if let Some((cached_signature, timestamp)) = lock {
            if cached_signature == *signature {
                return Some(lock.take().unwrap().1);
            }
        }
        None
    }


    /// Calculate the index of the cache for a given signature.
    /// This is a simple modulo operation.
    fn index(&self, signature: &Signature) -> usize {
        // TODO: Can do something in here to make this not overlap
        //       with the mask bytes.
        let calculated_u64 = u64::from_le_bytes(signature[8..=16]);
        (calculated_u64 as usize) % 1024
    }
}

EDIT: @tao-stones @t-nelson We should discuss here what we think the best approach for adding these metrics is - I think the discussions have been disjointed so far. Please add your thoughts so we can all come to an agreement on path forward.

@lijunwangs
Copy link
Copy Markdown
Author

Overall I still feel this is too invasive of a change for simply adding a metric: 1.2k lines + 61 files. It seems like we're taking the wrong approach here to me.

WRT banking stage performance, I have privately given concerns that the metric collected is not useful. Execution performance of random transactions will effectively be random under some bound that is dependent on CPU. If it is only useful when collected in aggregate, then I think we may already be able to calculate such a time from current metrics; we know number of transactions received, dropped, processed per second => can get mean time transaction spent in banking if that's all we're after.

Additionally as commented above, if we land on per-tx Option in the transaction type then we should bite the bullet and move to the RuntimeTransaction we are already planning to use as a "SanitizedTransaction with metadata". That was something @tao-stones, @t-nelson and I have looked at but has largely been held back by higher priority work.

Memory: We have buffers of 700k transactions in banking stage. Arguably these buffers are too big, but with this change we're adding on order of 10MB of memory to the buffer, where all but around 3200 bytes are useless. 10MB isn't a lot but seems like we can definitely come up with an alternative that accomplishes same thing and is not so invasive as changing the transaction type.

A concrete alternative to per tx Option: We can create a relatively simple, struct to cache timestamps for the marked transactions; see psuedo-rust below. insert at beginning of stage, remove when transaction is done or dropped. That way we don't need to change the transaction type everywhere; index calculation should be simple and fast, locks only on index collisions which should be very rare. We could even make this shared between the stages so we can measure how long packets are sitting in the channels between stages.

Psuedo-code for cache

pub struct SignatureTimestampCache {
    data: Arc<[Mutex<Option<(Signature, Instant)>>]>
}

impl SignatureTimestampCache {
    /// Initialize a new cache with fixed capacity of 1024.
    pub fn new() -> Self {
        let mut data = Vec::with_capacity(1024);
        for _ in 0..1024 {
            data.push(Mutex::new(None));
        }
        SignatureTimestampCache {
            data: data.into()
        }
    }

    /// Insert a signature and timestamp into the cache.
    /// If the signature is already in the cache, ignore.
    pub fn insert(&self, signature: Signature, timestamp: Instant) {
        // check should_track_transacton(signature) here, not sure
        // if we have that flag already cached somewhere
        let index = self.index(&signature);
        let mut lock = self.data[index].lock().unwrap();
        if lock.is_none() {
            *lock = Some((signature, timestamp));
        }
    }

    /// Get the timestamp for a given signature.
    /// If the signature is not in the cache, return None.
    /// Should be called when we are done with processing transaction.
    pub fn take(&self, signature: &Signature) -> Option<Instant> {
        // check should_track_transacton(signature) here, not sure
        // if we have that flag already cached somewhere
        let index = self.index(signature);
        let mut lock = self.data[index].lock().unwrap();
        if let Some((cached_signature, timestamp)) = lock {
            if cached_signature == *signature {
                return Some(lock.take().unwrap().1);
            }
        }
        None
    }


    /// Calculate the index of the cache for a given signature.
    /// This is a simple modulo operation.
    fn index(&self, signature: &Signature) -> usize {
        // TODO: Can do something in here to make this not overlap
        //       with the mask bytes.
        let calculated_u64 = u64::from_le_bytes(signature[8..=16]);
        (calculated_u64 as usize) % 1024
    }
}

EDIT: @tao-stones @t-nelson We should discuss here what we think the best approach for adding these metrics is - I think the discussions have been disjointed so far. Please add your thoughts so we can all come to an agreement on path forward.

The major problem of your proposed approach is performance concerns, it will requires each transaction to look this up while taking a lock as opposed to the simple if let Some(start_time) = tx.start_time(): in my approach. We would like to keep the performance overhead as low as possible. Secondly, the handling the potential collisions is also problematic. The changes looks big. But the ExtendedSanitizedTransaction changes are rather mechanical -- mostly just to appease the compiler due to signature change. @t-nelson @tao-stones , please chime in as we really need to get on the same page. Also, I am planning to split this PR to 3 parts to reduce the PR change size: 1st for streamer, 2nd for the verifier and 3rd for the banking stage.

@apfitzge
Copy link
Copy Markdown

apfitzge commented Mar 14, 2024

Overall I still feel this is too invasive of a change for simply adding a metric: 1.2k lines + 61 files. It seems like we're taking the wrong approach here to me.
WRT banking stage performance, I have privately given concerns that the metric collected is not useful. Execution performance of random transactions will effectively be random under some bound that is dependent on CPU. If it is only useful when collected in aggregate, then I think we may already be able to calculate such a time from current metrics; we know number of transactions received, dropped, processed per second => can get mean time transaction spent in banking if that's all we're after.
Additionally as commented above, if we land on per-tx Option in the transaction type then we should bite the bullet and move to the RuntimeTransaction we are already planning to use as a "SanitizedTransaction with metadata". That was something @tao-stones, @t-nelson and I have looked at but has largely been held back by higher priority work.
Memory: We have buffers of 700k transactions in banking stage. Arguably these buffers are too big, but with this change we're adding on order of 10MB of memory to the buffer, where all but around 3200 bytes are useless. 10MB isn't a lot but seems like we can definitely come up with an alternative that accomplishes same thing and is not so invasive as changing the transaction type.
A concrete alternative to per tx Option: We can create a relatively simple, struct to cache timestamps for the marked transactions; see psuedo-rust below. insert at beginning of stage, remove when transaction is done or dropped. That way we don't need to change the transaction type everywhere; index calculation should be simple and fast, locks only on index collisions which should be very rare. We could even make this shared between the stages so we can measure how long packets are sitting in the channels between stages.
Psuedo-code for cache

pub struct SignatureTimestampCache {
    data: Arc<[Mutex<Option<(Signature, Instant)>>]>
}

impl SignatureTimestampCache {
    /// Initialize a new cache with fixed capacity of 1024.
    pub fn new() -> Self {
        let mut data = Vec::with_capacity(1024);
        for _ in 0..1024 {
            data.push(Mutex::new(None));
        }
        SignatureTimestampCache {
            data: data.into()
        }
    }

    /// Insert a signature and timestamp into the cache.
    /// If the signature is already in the cache, ignore.
    pub fn insert(&self, signature: Signature, timestamp: Instant) {
        // check should_track_transacton(signature) here, not sure
        // if we have that flag already cached somewhere
        let index = self.index(&signature);
        let mut lock = self.data[index].lock().unwrap();
        if lock.is_none() {
            *lock = Some((signature, timestamp));
        }
    }

    /// Get the timestamp for a given signature.
    /// If the signature is not in the cache, return None.
    /// Should be called when we are done with processing transaction.
    pub fn take(&self, signature: &Signature) -> Option<Instant> {
        // check should_track_transacton(signature) here, not sure
        // if we have that flag already cached somewhere
        let index = self.index(signature);
        let mut lock = self.data[index].lock().unwrap();
        if let Some((cached_signature, timestamp)) = lock {
            if cached_signature == *signature {
                return Some(lock.take().unwrap().1);
            }
        }
        None
    }


    /// Calculate the index of the cache for a given signature.
    /// This is a simple modulo operation.
    fn index(&self, signature: &Signature) -> usize {
        // TODO: Can do something in here to make this not overlap
        //       with the mask bytes.
        let calculated_u64 = u64::from_le_bytes(signature[8..=16]);
        (calculated_u64 as usize) % 1024
    }
}

EDIT: @tao-stones @t-nelson We should discuss here what we think the best approach for adding these metrics is - I think the discussions have been disjointed so far. Please add your thoughts so we can all come to an agreement on path forward.

The major problem of your proposed approach is performance concerns, it will requires each transaction to look this up while taking a lock as opposed to the simple if let Some(start_time) = tx.start_time(): in my approach. We would like to keep the performance overhead as low as possible. Secondly, the handling the potential collisions is also problematic. The changes looks big. But the ExtendedSanitizedTransaction changes are rather mechanical -- mostly just to appease the compiler due to signature change. @t-nelson @tao-stones , please chime in as we really need to get on the same page. Also, I am planning to split this PR to 3 parts to reduce the PR change size: 1st for streamer, 2nd for the verifier and 3rd for the banking stage.

In the cache I gave psuedo-code for locks are per element, only contention is on collision. Again there should be next to none of those - we have ~700k transactions with 1/4096 chance of being a marked transaction; this means we have roughly under 200 marked transactions at a given time. Cache size is 1024.
The mask check and the index "calculation" can be done extremely cheaply, it's not going to be much more expensive than the single if.

@lijunwangs
Copy link
Copy Markdown
Author

|In the cache I gave psuedo-code for locks are per element, only contention is on collision. Again there should be next to none of those - we have ~700k transactions with 1/4096 chance of being a marked transaction; this means we have roughly under 200 marked transactions at a given time. Cache size is 1024.
The mask check and the index "calculation" can be done extremely cheaply, it's not going to be much more expensive than the single if.

Given a transaction, to determine its start time, don't you have to load the signature, getting the portion of the of the signature mask, calculate the mod and then taking an exclusive lock into SignatureTimestampCache to look it up? And then again you have to compare the signature if it is matched? How many more instructions?

@apfitzge
Copy link
Copy Markdown

Given a transaction, to determine its start time, don't you have to load the signature, getting the portion of the of the signature mask, calculate the mod and then taking an exclusive lock into SignatureTimestampCache to look it up? And then again you have to compare the signature if it is matched? How many more instructions?

The mask is extremely cheap, it's literally pointer offset load into register, & for bitshift, cmp against TXN_MASK. This is not going to have a material effect, it's like 2 additional cycles per tx - and would allow for tracking of the transaction throughout the entire pipeline instead of only within stages individually, without needing to change types everywhere.

@steviez
Copy link
Copy Markdown

steviez commented Mar 22, 2024

What about just sticking a timestamp in the Meta:

agave/sdk/src/packet.rs

Lines 39 to 46 in 24fe473

#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, AbiExample)]
#[repr(C)]
pub struct Meta {
pub size: usize,
pub addr: IpAddr,
pub port: u16,
pub flags: PacketFlags,
}

Meta isn't actually sent over the wire so we should be able to change the type unless I'm forgetting something. For each packet, we simply set the timestamp when we pull it out. For QUIC, here:

if packet_accum.is_none() {
let mut meta = Meta::default();
meta.set_socket_addr(remote_addr);
*packet_accum = Some(PacketAccumulator {
meta,
chunks: Vec::new(),
});
}

For UDP, here:

for (addr, hdr, pkt) in izip!(addrs, hdrs, packets.iter_mut()).take(nrecv) {
pkt.meta_mut().size = hdr.msg_len as usize;
if let Some(addr) = cast_socket_addr(&addr, &hdr) {
pkt.meta_mut().set_socket_addr(&addr.to_std());
}
}

Then, when we're already looping through packets in each stage, we can just compute the deltas between stored timestamps and now().

On a 64-bit platform, we technically have 4 idle bytes within Meta right now due to alignment requirement for the usize:

8 (usize) + 17 (IpAddr) + 2 (u16) + 1 (PacketFlags) = 28 bytes

The Instant type requires 16-bytes, but we could probably get this down to 12 bytes (8 for seconds, 4 for ns). With the reordering shown below, this would mean that we'd only be growing the Meta by 8 bytes.

pub struct Timestamp {
    sec: u64,
    ns: u32,
}

pub struct Meta {
    pub size: usize,
    pub addr: IpAddr,
    pub flags: PacketFlags,
    pub port: u16,
    pub timestamp: Timestamp,
}

This solution would be much more general, and require no extra data structures; just accumulate the stats per packet. Also, the generality of the solution would mean we could easily get this kind of metric for the other network protocols like gossip and turbine. I'm going to take a stab at implementing this today, will probably start by using Instant for simplicity, but we should be able to drop in a custom Timestamp struct if we want to save those extra 4 bytes per packet

@mergify
Copy link
Copy Markdown

mergify Bot commented Dec 30, 2024

If this PR represents a change to the public RPC API:

  1. Make sure it includes a complementary update to rpc-client/ (example)
  2. Open a follow-up PR to update the JavaScript client @solana/web3.js (example)

Thank you for keeping the RPC clients in sync with the server API @lijunwangs.

@joncinque
Copy link
Copy Markdown

This PR contains changes to the solana sdk, which will be moved to a new repo within the next week, when v2.2 is branched from master.

Please merge or close this PR as soon as possible, or re-create the sdk changes when the new repository is ready at https://github.com/anza-xyz/solana-sdk

@t-nelson
Copy link
Copy Markdown

t-nelson commented Feb 4, 2025

pretty sure this should just be closed in any case

@KirillLykov
Copy link
Copy Markdown

KirillLykov commented Jun 17, 2025

Is this PR still active? I'm confused -- from the description, looks like #257 is doing a similar thing. if we have a system for tracking the performance of the transactions based on signature, how to use it? Am I right that all the data is stored in influxdb?

@lijunwangs
Copy link
Copy Markdown
Author

Is this PR still active? I'm confused -- from the description, looks like #257 is doing a similar thing. if we have a system for tracking the performance of the transactions based on signature, how to use it? Am I right that all the data is stored in influxdb?

I got some ping from Max on its importance -- may resurrect the PR. #257 only does it at the streamer. It is not exactly tracking a particular txn. Rather it randomly samples some transactions and measure and report the performance at various stage. only QUIC TPU has it for now.

@KirillLykov
Copy link
Copy Markdown

@lijunwangs yeah, I write for the same reason. Since we are in v3.0 we are free to break public interface of some structures (Meta for example), which seems to be a good opportunity for this change. The hard part if the cost of tracking all the txs (signature -> timestamps) and the pressure on the influxdb with all these data.

@MaxResnick
Copy link
Copy Markdown

To me this seems like an extremely useful. This would be great to have and having read the thread i think the performance cost is relatively light for how much we will likely be able to improve performance with more granular data. Tracking an individual transaction throughout the whole pipeline is much more useful than sampling. I also think it would be useful if this was reported in metrics so we dont have to wait for a canary slot to test.

@t-nelson
Copy link
Copy Markdown

this had concept ack, implementation nack. it's never going in as done here

@steviez
Copy link
Copy Markdown

steviez commented Jun 19, 2025

I had an implementatinon for this where I stuck a timestamp into some free bytes in PacketMeta for when the packet was first observed by QUIC receive. We could then do now() - packet.timestamp() to figure out how long it took for the packet to reach whatever stage was making that call.

My solution here was nice since there was no central data structure that created contention between all the threads in the different stages. I hooked this up for TVU and tested it out with some success; I did see some weird cases (ie timestamp=0 since I used saturating math) that I believe was inherent to getting system clock in different threads. Or I might have been doing something dumb, it has been at least 12 months so not really sure. I can see if I still have the branch somewhere

@KirillLykov
Copy link
Copy Markdown

I had an implementatinon for this where I stuck a timestamp into some free bytes in PacketMeta for when the packet was first observed by QUIC receive. We could then do now() - packet.timestamp() to figure out how long it took for the packet to reach whatever stage was making that call.

My solution here was nice since there was no central data structure that created contention between all the threads in the different stages. I hooked this up for TVU and tested it out with some success; I did see some weird cases (ie timestamp=0 since I used saturating math) that I believe was inherent to getting system clock in different threads. Or I might have been doing something dumb, it has been at least 12 months so not really sure. I can see if I still have the branch somewhere

Sounds good! I wonder if back in time you've measure the added overhead.

@apfitzge
Copy link
Copy Markdown

I had an implementatinon for this where I stuck a timestamp into some free bytes in PacketMeta for when the packet was first observed by QUIC receive.

My solution here was nice since there was no central data structure that created contention between all the threads in the different stages.

IIRC I had more concern about timestamping (even optionally) every packet than using a central data structure - the central data structure should have very little if any contention if we've done the mask & cache in a reasonable way. Particularly this avoids us having to plumb the meta everywhere through the code - rn we toss it when we get to ForwardStage or BankingStage and lose it everywhere downstream of there.

#202 (comment) I had some pseudo code here, but I think realistically we could even make it simpler if we don't need 100% accuracy on conflicts (again conflicts are likely to be rare):

pub struct SignatureTimestampCache {
    data: Arc<[AtomicU64]>
}

where we index into the data based on some random bytes of sig (not overlapping with mask) + modulo. Data size can be something like 1024, 4096, 8192 (reasonably small but not too small) to avoid us needing conflicts. u64 timestamps (we're good for another 500 years in terms of bits...no need to use 16 bytes like rust Instant) on the time pkt came to quic stage. Every stage past quic can just read the timestamp and calc difference from cached entry time.

@github-actions
Copy link
Copy Markdown

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the stale label Jan 12, 2026
@mergify mergify Bot added the community label Jan 12, 2026
@github-actions github-actions Bot removed the stale label Jan 19, 2026
@steviez steviez added the stale label Jan 20, 2026
@github-actions
Copy link
Copy Markdown

This pull request was closed because it has been stale for 7 days with no activity.

@github-actions github-actions Bot closed this Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants