tools: pingpong total latency#4757
Conversation
| case st := <-pps.sentTxid: | ||
| if len(txidList) < txidLatencySampleSize { | ||
| index := len(txidList) | ||
| txidList = append(txidList, st.txid) | ||
| byTxid[st.txid] = txidSendTimeIndexed{ | ||
| st, | ||
| index, | ||
| } | ||
| } else { | ||
| // random replacement | ||
| evict := rand.Intn(len(txidList)) | ||
| delete(byTxid, txidList[evict]) | ||
| txidList[evict] = st.txid | ||
| byTxid[st.txid] = txidSendTimeIndexed{ | ||
| st, | ||
| evict, | ||
| } | ||
| } |
There was a problem hiding this comment.
I really like this random replacement scheme, just thinking out loud -- if your sample size is smaller than the number of data points, why not just do a circular buffer? Advantages I see would be that the datapoints would be still well-ordered and you would not be missing any data for the range of time the sample was collected. The way it works now makes it so that the most recent datapoints are most-likely to be included, and the least recent datapoints are least-likely to be included, which would also be the case with a circular buffer.
There was a problem hiding this comment.
if the rate is larger than the buffer then a circular buffer could lose almost all the data. With a buffer of 10_000 but 26_000 transactions in a block it would only know about the most recent transactions and only measure their latency. Better to measure over a longer duration.
There was a problem hiding this comment.
sorry why not make it 26000 then?
There was a problem hiding this comment.
Old habit from working in RAM-scare environments. And to make up some more justification: maybe I don't even want to log all of the txns, but just a sample, because we also don't need to process a full 6000 TPS of this data.
| func (pps *WorkerState) txidLatencyBlockWaiter(ctx context.Context, ac *libgoal.Client) { | ||
| done := ctx.Done() | ||
| restart: | ||
| // I wish Go had macros |
There was a problem hiding this comment.
// something something vim your way to Go Macros :)
| fmt.Fprintf(os.Stderr, "block waiter w: %v", err) | ||
| time.Sleep(5 * time.Second) | ||
| goto restart |
There was a problem hiding this comment.
Looks like this loop feeds blocks to the latencyBlocks handling, which in turn calls time.Now to figure out the latency from the recorded time to block time.
But, looking at this bit here, is it possible that this loop will be sleeping when goal publishes a new block for consumption? If that happens, the time.Now used for calculation would contain that delay, right?
Since this is error handling, I suspect that we don't really expect small temporary errors like that, but wanted to check anyhow.
There was a problem hiding this comment.
reduced the err restart time to 1 second; maybe the error condition will go away and we'll restart the API calls and not oversleep the round change.
Codecov Report
@@ Coverage Diff @@
## master #4757 +/- ##
==========================================
- Coverage 53.63% 52.88% -0.76%
==========================================
Files 432 432
Lines 54058 54166 +108
==========================================
- Hits 28996 28647 -349
- Misses 22813 23243 +430
- Partials 2249 2276 +27
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
--aout append path
|
after a bunch of updates I think this is again ready for review and it would be good to get this extra measurement into any new tests |
| out := pps.latencyOuts[len(pps.latencyOuts)-1] | ||
| for { | ||
| select { | ||
| case st := <-pps.sentTxid: |
There was a problem hiding this comment.
true sampling should be done on the pps.sentTxid writer side. Otherwise 10k samples will be fully overwritten in few rounds under full TPS.
|
circleci is dumb |
|
circleci is still dumb |
Summary
Measure the total latency of a transaction. Measure from the moment the txn send API returns to the moment we see the txn in a comitted block.
Test Plan
This is a test. Run on local private cluster and maybe aws test cluster.