Fix udp port check retry and check all udp ports#10385
Fix udp port check retry and check all udp ports#10385ryoqun merged 8 commits intosolana-labs:masterfrom
Conversation
|
Odd, test started to fail exactly due to this change... |
mvines
left a comment
There was a problem hiding this comment.
Thanks! Would you mind adding a new test for this around test_get_public_ip_addr() as well please? Clearly there's not enough test coverage here!
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
fdd0d05 to
8f30ba5
Compare
|
|
||
| if let Some(ref cluster_entrypoint) = cluster_entrypoint { | ||
| let udp_sockets = [ | ||
| node.sockets.tpu.first(), |
There was a problem hiding this comment.
@mvines Well, this was broken and caused actual CI failures in this pr...
That's because we open many sockets on the same port and there is no guarantee os feeds the echo-backed packets into the first socket. We must recv from all of them.
| let udp_sockets = [ | ||
| node.sockets.tpu.first(), | ||
| /* | ||
| Enable these ports when `IpEchoServerMessage` supports more than 4 UDP ports: |
There was a problem hiding this comment.
I've started to support these as well as a side-effect of addressing same-port-shared sockets.
For this, I only needed just chunks().
| Err(err) => warn!("udp recv failure: {}", err), | ||
| } | ||
| }); | ||
| match receiver.recv_timeout(Duration::from_secs(5)) { |
There was a problem hiding this comment.
Well, I dunno why channel() is needed.. I've simplified it just by socket.set_read_timeout()....
| let port = udp_socket.local_addr().unwrap().port(); | ||
| let udp_socket = udp_socket.try_clone().expect("Unable to clone udp socket"); | ||
| let (sender, receiver) = channel(); | ||
| std::thread::spawn(move || { |
There was a problem hiding this comment.
Well, not join()-ing is a bad taste. I'm pretty sure this leaks a thread or it reads an actual data from the socket at arbitrary time after the validator really start to boot...
There was a problem hiding this comment.
- Oops, I'll fix this for tcp as well...
|
|
||
| #[derive(Serialize, Deserialize, Default)] | ||
| pub(crate) struct IpEchoServerMessage { | ||
| tcp_ports: [u16; 4], // Fixed size list of ports to avoid vec serde |
There was a problem hiding this comment.
Let's strive for not breaking ABI. ;)
mvines
left a comment
There was a problem hiding this comment.
looking much better than my previous code :)
|
|
||
| if let Some(ref cluster_entrypoint) = cluster_entrypoint { | ||
| let udp_sockets = [ | ||
| node.sockets.tpu.first(), |
| let udp_sockets = [ | ||
| node.sockets.tpu.first(), | ||
| /* | ||
| Enable these ports when `IpEchoServerMessage` supports more than 4 UDP ports: |
| &node.sockets.repair, | ||
| &node.sockets.serve_repair, | ||
| ]; | ||
| udp_sockets.extend(node.sockets.tpu.iter().take(3)); |
Codecov Report
@@ Coverage Diff @@
## master #10385 +/- ##
=========================================
- Coverage 81.7% 81.7% -0.1%
=========================================
Files 297 297
Lines 69981 70045 +64
=========================================
+ Hits 57210 57261 +51
- Misses 12771 12784 +13 |
| &node.sockets.repair, | ||
| &node.sockets.serve_repair, | ||
| ]; | ||
| udp_sockets.extend(node.sockets.tpu.iter()); |
There was a problem hiding this comment.
I'm pretty sure there is more elegant way to write this?
|
@mvines I've added tests and polished the impl a bit!! |
mvines
left a comment
There was a problem hiding this comment.
Thanks for all the effort here, this is so much better now
* Don't start if udp port is really closed * Fully check all udp ports * Remove test code....... * Add tests and adjust impl a bit * Add comment * Move comment a bit * Move a bit * clean ups (cherry picked from commit a39df7e) # Conflicts: # validator/src/main.rs
* Don't start if udp port is really closed * Fully check all udp ports * Remove test code....... * Add tests and adjust impl a bit * Add comment * Move comment a bit * Move a bit * clean ups (cherry picked from commit a39df7e)

Problem
Validator starts nevertheless some of its udp port are closed....
Also, it doesn't test all of listening ports.
Summary of Changes
Really abort as failure after the maximum number of retries.
Also, test all the ports.
Context
Found via last-minute checking of #10209
Follow-up #10181, #10291.