-
Notifications
You must be signed in to change notification settings - Fork 63
background task for service zone nat #4857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
d13a4ea
112765e
085b0db
402f513
56ba5be
c794b1a
fe1e3e8
337bb23
929ba7c
d62b814
9eb1111
b92b0bd
1c6e60e
22f36d6
3ebf9df
5811317
28f61b0
d3501dc
a193845
8c6a23e
6cd49f7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,18 @@ use std::net::{IpAddr, SocketAddr}; | |
| use std::num::NonZeroU32; | ||
| use std::sync::Arc; | ||
|
|
||
| // Minumum number of boundary NTP zones that should be present in a valid | ||
| // set of service zone nat configurations. | ||
| const MIN_NTP_COUNT: usize = 1; | ||
|
|
||
| // Minumum number of nexus zones that should be present in a valid | ||
| // set of service zone nat configurations. | ||
| const MIN_NEXUS_COUNT: usize = 3; | ||
|
|
||
| // Minumum number of external DNS zones that should be present in a valid | ||
| // set of service zone nat configurations. | ||
| const MIN_EXTERNAL_DNS_COUNT: usize = 1; | ||
|
|
||
| /// Background task that ensures service zones have nat entries | ||
| /// persisted in the NAT RPW table | ||
| pub struct ServiceZoneNatTracker { | ||
|
|
@@ -93,6 +105,9 @@ impl BackgroundTask for ServiceZoneNatTracker { | |
| }; | ||
|
|
||
| let mut ipv4_nat_values: Vec<Ipv4NatValues> = vec![]; | ||
| let mut ntp_count = 0; | ||
| let mut nexus_count = 0; | ||
| let mut dns_count = 0; | ||
|
|
||
| for (sled_id, zones_found) in collection.omicron_zones { | ||
| let (_, sled) = match LookupPath::new(opctx, &self.datastore) | ||
|
|
@@ -121,6 +136,7 @@ impl BackgroundTask for ServiceZoneNatTracker { | |
| zones_found.zones; | ||
| let zones: Vec<sled_agent_client::types::OmicronZoneConfig> = | ||
| zones_config.zones; | ||
|
|
||
| for zone in zones { | ||
| let zone_type: OmicronZoneType = zone.zone_type; | ||
| match zone_type { | ||
rcgoodfellow marked this conversation as resolved.
Show resolved
Hide resolved
rcgoodfellow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
@@ -157,6 +173,7 @@ impl BackgroundTask for ServiceZoneNatTracker { | |
|
|
||
| // Append ipv4 nat entry | ||
| ipv4_nat_values.push(nat_value); | ||
| ntp_count += 1; | ||
| } | ||
| OmicronZoneType::Nexus { nic, external_ip, .. } => { | ||
| let external_ip = match external_ip { | ||
|
|
@@ -189,6 +206,7 @@ impl BackgroundTask for ServiceZoneNatTracker { | |
|
|
||
| // Append ipv4 nat entry | ||
| ipv4_nat_values.push(nat_value); | ||
| nexus_count += 1; | ||
| }, | ||
| OmicronZoneType::ExternalDns { nic, dns_address, .. } => { | ||
| let socket_addr: SocketAddr = match dns_address.parse() { | ||
|
|
@@ -235,6 +253,7 @@ impl BackgroundTask for ServiceZoneNatTracker { | |
|
|
||
| // Append ipv4 nat entry | ||
| ipv4_nat_values.push(nat_value); | ||
| dns_count += 1; | ||
| }, | ||
| // we explictly list all cases instead of using a wildcard, | ||
| // that way if someone adds a new type to OmicronZoneType that | ||
|
|
@@ -265,6 +284,38 @@ impl BackgroundTask for ServiceZoneNatTracker { | |
| }); | ||
| } | ||
|
|
||
| if dns_count < MIN_EXTERNAL_DNS_COUNT { | ||
| error!( | ||
| &log, | ||
| "generated config for fewer than the minimum allowed number of dns zones"; | ||
| ); | ||
| return json!({ | ||
| "error": "generated config for fewer than the minimum allowed number of dns zones" | ||
| }); | ||
| } | ||
|
|
||
| if ntp_count < MIN_NTP_COUNT { | ||
| error!( | ||
| &log, | ||
| "generated config for fewer than the minimum allowed number of ntp zones"; | ||
| ); | ||
| return json!({ | ||
| "error": "generated config for fewer than the minimum allowed number of ntp zones" | ||
|
|
||
| }); | ||
| } | ||
|
|
||
| if nexus_count < MIN_NEXUS_COUNT { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Example where this could go awry:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is correct. Will we not attempt to move the service zone to a new sled in this situation?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As soon as we implement service re-provisioning, this would happen, but I think the scope of "new service provisions" is going to start with only Crucible and NTP. Until that is fully implemented, this check would just stop NAT propagation |
||
| error!( | ||
| &log, | ||
| "generated config for fewer than the minimum allowed number of nexus zones"; | ||
| ); | ||
| return json!({ | ||
| "error": "generated config for fewer than the minimum allowed number of nexus zones" | ||
|
|
||
| }); | ||
| } | ||
|
|
||
| // reconcile service zone nat entries | ||
| let result = match self.datastore.ipv4_nat_sync_service_zones(opctx, &ipv4_nat_values).await { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We discussed this in chat, but this whole flow has some implications for our calls to
I think that's okay, but we should add documentation around the
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, for posterity, I created this drawing to map out the data flow here. My "TLDR" of the above is that I wanted to ensure we avoided having loops in this data flow graph: (Source: https://docs.google.com/drawings/d/19MkoKsgZ8vuPng6uKaCF1hG2hiHI9735jv9ThLgajVM/edit?usp=sharing )
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is great feedback, adding some documentation to it now. I think one day we may get to a place where we can lift the NAT logic out of the service zone creation functionality in sled-agent, which could help us move towards a more consistent pattern of interacting with this table. |
||
| Ok(num) => num, | ||
|
|
||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should do this -- it's possible that one of the three Nexus instances have gone away, and we'd still want our NAT entries to be up-to-date.
This is a reasonable goal to try to achieve in the "graceful shutdown" case -- if we want three Nexus instances to run, we should provision a 4th one before removing one of the original 3 -- but if a sled is yanked from the rack, we do have two Nexus instances. That's just a truth! We should aspire to have the blueprint creator create a new Nexus service, provision it, and get the NAT entries populated, but it's very possible to run under our redundancy expectations. That's why we use redundancy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would setting all of the minimums to
1be a reasonable compromise? If we don't have at least 1Nexus,NTP, andExternalDnszone that can be found in inventory, I would think we have more serious problems, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatting with some folks in the update sync today, it sounds like the inventory system -- which gathers the inventory in a "collection", which contains a lot of objects -- may, at any time, simply "not report a sled" within that collection. Theoretically, that means we could see an "inventory collection" that doesn't contain any sleds which contain Nexus, NTP, external DNS, etc.
This has some weird implications for depending on the inventory system as the source-of-truth here, but without a full implementation of blueprints, I acknowledge that there isn't a great alternative yet.
So: "Could we set all the minimums to 1?" That would stop us from propagating the state of the inventory system if we saw a blip that eliminated all of these critical zones. So in that sense, it's arguably better than not doing the check! I also think it avoids "breaking" NAT when we're under-provisioned, as I mentioned in the case below.
If we're on the same page that, eventually, the right source-of-truth is "info from the blueprint, somehow", I think this is a reasonable intermediate step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's my understanding that blueprints are the future, inventory is what we're using for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, I appreciate you tolerating all this churn. Getting RPWs + NAT propagation right is tricky, and having this portion of the system be "not-totally-ready" does make this extra hairy. Thank you for pushing through regardless)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! I know this is critical so I appreciate the extra eyes and feedback!