[WIP] Using scheduled start-time instead of actual start-time #56
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently the FailoverTestRug uses the actual starting time to determine the latency.
The consequence is that the system can still be suffering from a form of coordinated omission and therefor the latency numbers will look more positive than they actually are.
Explanation:
The load generator does 100 request/second, so every 10 ms there will be a request.
Every request to the remote system is handled in exactly 3 ms.
Imagine that the clocks in ms and currently is at 50000 and the next request is scheduled at 50010. If there is a stall of 100 ms just before calling now, then now will return 50100, so the value 50100 is stored in the generationTimestamps and that is the value passed to the echo message. Since a request takes 3 ms, the latency of that request will be be 50103-50100=3 ms. This is because it is based on the actual starting time of that request.
But the scheduled starting time of the request was 50010, so the actual latency is 50103-50010=93 ms. That is a 30x difference.
It doesn't only affect this call, but all calls that should have been made during the 100 ms pause. These calls still get made (which is good) but assuming there are no further stalls, than with the current code, the measured latency for the 9 other calls will be 3,3,3,...,3,3,3,3. But in reality it should be 87,77,67,...,13,3. So the measured latencies of calls that should have happened during the stall, are incorrect.
This can be seen as a form of coordinated omission. The main difference is that it isn't caused by the remote system, but is caused by the local system. And you can get stalls in the local system.
What should be done is to use the scheduled starting time of a request to determine the latency and not the actual starting time of the request.