Skip to content
This repository has been archived by the owner on May 15, 2024. It is now read-only.

Taking Vote Latency Analysis Out of Beta #14

Open
bryanwag opened this issue Nov 1, 2019 · 2 comments
Open

Taking Vote Latency Analysis Out of Beta #14

bryanwag opened this issue Nov 1, 2019 · 2 comments

Comments

@bryanwag
Copy link

bryanwag commented Nov 1, 2019

It is crucial to have a reliable and objective way to measure node performance because the scalability of the network depends heavily on node hardware, yet we currently have to rely on rep's self-report and stress tests to even get a basic idea on that metric. This method is neither reliable nor quantitative. Thus, it's easy to game and hard to factor into Ninja Score to quickly inform average users of their rep quality.

Given that vote latency analysis is a promising proxy for node performance, we as a community should investigate how much we can improve on current implementation so it can be production-ready. In fact, many reps that struggled quite a bit during the last 40 CPS stress tests still have high Ninja Scores and are continuously gaining weight as a result.

People have suggested setting up different nodes across the world dedicating to measure vote latency of peers to improve reliability and reduce bias from Internet latency. There should also be ways to aggregate data from existing peers and model out the confounding factors. It sounds like a pretty standard date science problem to me, but I could be wrong.

Unfortunately I'm neither a dev nor data scientist so I cannot give much further suggestions on implementation details. But I believe we need to get this ready before the next waves of newcomers arrive to avoid a large group of people choosing the wrong reps. Therefore, I invite the devs and data analysts to join this discussion and help solving this glaring issue in the ecosystem right now.

@bryanwag
Copy link
Author

bryanwag commented Nov 20, 2019

To further improve the usefulness of average vote latency, I suggest the following:

Each vote latency data point is paired with the concurrent median BPS data from NanoTicker or some other reliable source (or does using individual BPS more accurately reflect the load it experiences? Is it possible to retrieve such data trustlessly from each rep?). So it forms a tuple: (vote latency, BPS) for each data point. When calculating average, each vote latency data point is weighted by (BPS)^2 (or any reasonable way to amplify high load and reduce the impact of <1BPS data), so average vote latency will be dominated by data collected during high network load. IMO this is much more meaningful to help us differentiate rep hardware. Otherwise, a subpar node with years of vote latency data collected at network usage of 0.1 BPS might still show up as "very fast" because high load occurs very infrequently and doesn't affect the average much (just like how a two-year node that went down during a few hours of high load doesn't affect their uptime score much).

One problem this method might cause is that if we use some absolute threshold to determine the speed of vote latency, most nodes' average vote latency score will most likely be dominated by slow speed during high load. But since our objective is to rank the rep relatively to each other and not measure the absolute speed, it makes sense to group them based on their performance relatively to each other. So after you gather all the weighted average data for all reps, you can use some straightforward 1D clustering algorithm (k-means or maybe more optimized methods for 1D?) to cluster them into the groups you have now (e.g. very fast, etc). This way, even if a rep with beefy hardware has longer vote latency during high load, as long as it's faster than most other reps we should still rank it higher than them and label it "very fast".

Another problem is that this method is very sensitive to data accuracy during high load, which we know is not reliable right now because the node that collects these data can fall behind too. So it again circles back to how we can make vote latency reliable even during high load.

@My1
Copy link

My1 commented Apr 6, 2022

In fact, many reps that struggled quite a bit during the last 40 CPS stress tests still have high Ninja Scores and are continuously gaining weight as a result

is there data on what nodes struggled and can one check if their rep of choice has been affected?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants