-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
major redesign, reprising connection as DuplexConnection #102
Conversation
I'll take a deeper look at this in a few days. In the meantime-- All the srt-protocol integration tests are meant to run much faster than realtime, they never check the wall clock they just simulate time passing--so you're saying that now the algorithm takes a full CPU core to run? Also, I've been using this test to see how well it can deal with higher-bandwidth connections--I should check it in and just set it to be ignored but for now you can grab it with I suggest you run it with |
Also nice apparently proptest ICE's the nightly compiler! 🥳 rust-lang/rust#85128 |
The problem is that sending data should be limited by the SND timer, this is enables congestion control to function, but the current implementation can potentially run the sender algorithm every time the next "action" is pulled, regardless of the send timer. Sending should only happen when the clock ticks forward far enough to trigger a SND, but this seems to have added a bit more computational cost when running the faster than real time simulation tests. The algorithm shouldn't take a full CPU to run, the tests just run noticeably slower. This could be due to inefficiencies in the test harnesses/simulation code too. I haven't spent much time optimizing yet.
Thanks, that is helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is much cleaner, and I would be happy to merge it with some more work.
@@ -118,7 +108,7 @@ pub struct Receiver { | |||
receive_buffer: RecvBuffer, | |||
|
|||
/// Shutdown flag. This is set so when the buffer is flushed, it returns Async::Ready(None) | |||
shutdown_flag: bool, | |||
shutdown_flag: Option<Instant>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment here--it's no longer a flag--and I'm not sure what particular instant this refers to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is a shutdown with a timeout. There were many tests that would fail intermittently/randomly due to injected packet loss or race conditions, this was particularly true of the tokio tests. The connect and rendezvous algorithm tests were failing intermittently for this same reason. We need to implement timeouts for them as well.
// list before progressing further through the sender algorithm. This | ||
// appears to be inconsistent with the UDT spec. Is it consistent | ||
// with the reference implementation? | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right, as it doesn't set the snd timer period--I think it is more correct to not return here. Not sure though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and this is something we ought to cover with a focused unit test, but I also think the "every 16th packet rapid send" part of the algorithm also should not be run except for when we send a new data packet. I believe this is why I had this returning here. I intended to go back and fix it. This is how the reference implementation works.
fn handle_handshake_packet(&mut self, handshake: HandshakeControlInfo, now: Instant) { | ||
if let Some(control_type) = self.handshake.handle_handshake(&handshake) { | ||
pub fn handle_handshake_packet(&mut self, handshake: HandshakeControlInfo, now: Instant) { | ||
if let Some(control_type) = self.handshake.handle_handshake(handshake) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving this logic into DuplexConnection
? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I was incrementally refactoring, just hadn't gotten to it yet. I also intend to move all the connection Drain/Shutdown/Close logic to DuplexConnection as well, the keepalive timer too, and a few other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I want to leave this on the Sender for now because that DuplexConnection doesn't have the means for buffering packets to send yet, and maybe it never will. For this PR I'm going to just leave this the way it is. Eventually I will probably explore managing buffers externally using some form of context or dependency injection, perhaps via a "send" Fn, arena allocator, etc.
fn on_snd(&mut self, now: Instant) { | ||
let now_ts = self.transmit_buffer.timestamp_from(now); | ||
let window_length = self.metrics.rtt + self.settings.send_tsbpd_latency; | ||
self.send_buffer.release_late_packets(now_ts, window_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does releasing late packets like this require sending a DropRequest packet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe so. I implemented this as a solution to fixing a stall condition while draining on close/shutdown, yet this should be handled by the send_buffer.pop() logic further down in on_snd, so I should either properly implement DropRequest or remove for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not have that as part of this PR. Either leave it with a TODO comment and linked issue or remove and leave a TODO with a linked issue.
} else if self.status.should_drain() && self.send_buffer.len() == 1 { | ||
if let Some(packet) = self.send_buffer.pop() { | ||
self.send_data(packet, now); | ||
self.lr_acked_packet = self.transmit_buffer.next_sequence_number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to at least remove this redundancy.
So things that need to happen before I merge this:
As well as any other cleanup you want to do. Things that don't need to be part of this PR but could be or could be new issues:
|
Codecov Report
@@ Coverage Diff @@
## master #102 +/- ##
==========================================
+ Coverage 81.37% 82.03% +0.66%
==========================================
Files 42 42
Lines 6373 6764 +391
==========================================
+ Hits 5186 5549 +363
- Misses 1187 1215 +28
Continue to review full report at Codecov.
|
I'll fix up the not_enough_latency test |
Waiting for CI to run, apparently there's an issue with github actions rn. Will merge if everything looks good |
@russelltg
It appears that we have started to step on toes, so I'm submitting this mostly finished draft pull request where I have been exploring how to clean up the tokio I/O loop, which lead me down a path of reintroducing a connection struct that this time encapsulates the Sender and Receiver. I know this is a very large change set, but please consider the benefits from the design direction I am exploring here. I am contradicting my prior advocacy for keeping the sender and receiver algorithms free of communication, concurrent capable. Since we are now just running the whole I/O loop from one thread, concurrency is less important anyhow. I hope you agree that this has significantly simplified the code. I have two different versions of the I/O loop that are alternated somewhat randomly at runtime. One approach uses an Action/Input enum pair and a single function to drive the loop, this is similar to the prior design of the Sender/Receiver enums. The other approach uses a set of command and query methods to drive the loop. I prefer the ergonomics of the the enum approach. It made debug tracing the code trivial too.
One of the larger tests you added recently doesn't compile with all the changes I made, so I commented it out. I don't have the time to go back and fix it, but I can later, I just felt it was important that you take a look at the changes I have made before our paths diverge too far.
I also fixed several bugs that I ran into while testing. For instance, we weren't respecting SND in the sender algorithm, which broke congestion control. The timestamp_rollover test ran much faster than real time with this bug, but with the fix it no longer does. I have set it to ignored for now. We should replace this long running test with a few more focused unit tests. In fact, it might already be mostly covered by the existing TimeStamp unit tests.
I'd like to finish this up and move onto a first pass at metrics, so quick feedback would be appreciated.