Skip to content
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
b78ff98
[nexus] Split Nexus configuration (package vs runtime)
smklein Jun 8, 2022
cca5795
Merge branch 'main' into nexus-argsplit
smklein Jun 8, 2022
fccc15c
Ensure postgres config was just a rename
smklein Jun 8, 2022
2443215
Merge branch 'main' into nexus-argsplit
smklein Jun 8, 2022
a077bd4
review feedback
smklein Jun 8, 2022
f91cea1
Merge branch 'main' into nexus-argsplit
smklein Jun 8, 2022
d16eda2
DNS client
smklein Jun 8, 2022
8db30b7
Add concurrency
smklein Jun 8, 2022
3a0c6ba
comment
smklein Jun 8, 2022
33b3e02
fmt
smklein Jun 8, 2022
3eb57dc
lockfile
smklein Jun 8, 2022
39aa9ff
Merge branch 'main' into nexus-argsplit
smklein Jun 15, 2022
dd04a67
s/runtime/deployment
smklein Jun 15, 2022
63b6379
Merge branch 'nexus-argsplit' into dns-client
smklein Jun 15, 2022
e1dc941
[nexus][sled-agent] Generate rack ID in RSS, plumb it through Nexus
smklein Jun 15, 2022
a4309ac
need rack_id in the test config too
smklein Jun 15, 2022
02f592d
Merge branch 'main' into nexus-argsplit
smklein Jun 20, 2022
ff2d7b9
[internal-dns] Avoid 'picking ports'
smklein Jun 20, 2022
a261155
Merge branch 'nexus-argsplit' into dns-client
smklein Jun 20, 2022
6cc7864
Merge branch 'fix-internal-dns-api' into dns-client
smklein Jun 20, 2022
2a035a5
Changes from rss-handoff
smklein Jun 20, 2022
e84faaf
Merge branch 'dns-client' into rack-id
smklein Jun 20, 2022
c3a49bb
[nexus] Add a new user for background tasks
smklein Jun 20, 2022
1e0b8fe
Merge branch 'main' into nexus-argsplit
smklein Jun 21, 2022
da4a2b8
Merge branch 'nexus-argsplit' into fix-internal-dns-api
smklein Jun 21, 2022
d7b10cf
Merge branch 'fix-internal-dns-api' into dns-client
smklein Jun 21, 2022
bb9a3af
Merge branch 'dns-client' into rack-id
smklein Jun 21, 2022
fed4a3d
Merge branch 'rack-id' into background-work-user
smklein Jun 21, 2022
4df23c2
jgallagher feedback
smklein Jun 21, 2022
71f3aac
Merge branch 'fix-internal-dns-api' into dns-client
smklein Jun 21, 2022
5556d5f
Patch tests
smklein Jun 21, 2022
226fd94
Merge branch 'fix-internal-dns-api' into dns-client
smklein Jun 21, 2022
6126e41
merge
smklein Jun 21, 2022
b01bffd
Merge branch 'dns-client' into rack-id
smklein Jun 21, 2022
d09c8d5
Merge branch 'rack-id' into background-work-user
smklein Jun 21, 2022
e4f434f
Merge branch 'main' into nexus-argsplit
smklein Jun 21, 2022
62fccb2
Merge branch 'nexus-argsplit' into fix-internal-dns-api
smklein Jun 21, 2022
1905985
Merge branch 'fix-internal-dns-api' into dns-client
smklein Jun 21, 2022
1a0b61b
Merge branch 'dns-client' into rack-id
smklein Jun 21, 2022
f5ee394
Merge branch 'rack-id' into background-work-user
smklein Jun 21, 2022
d6e3c9d
background-work -> service-balancer
smklein Jun 22, 2022
fd8286a
Merge branch 'main' into dns-client
smklein Jun 22, 2022
bed0269
Merge branch 'dns-client' into rack-id
smklein Jun 22, 2022
ef6072d
Merge branch 'rack-id' into background-work-user
smklein Jun 22, 2022
b959c39
Merge branch 'main' into dns-client
smklein Jun 23, 2022
470da8b
review feedback
smklein Jun 24, 2022
a23a036
Merge branch 'dns-client' into rack-id
smklein Jun 24, 2022
56d2e1c
Merge branch 'rack-id' into background-work-user
smklein Jun 24, 2022
13b9825
Merge branch 'main' into dns-client
smklein Jun 24, 2022
e1a912f
Merge branch 'dns-client' into rack-id
smklein Jun 24, 2022
28d87f5
Merge branch 'rack-id' into background-work-user
smklein Jun 24, 2022
5fa89fe
Merge branch 'main' into dns-client
smklein Jun 24, 2022
a5fb65a
Merge branch 'dns-client' into rack-id
smklein Jun 24, 2022
a5784c1
Merge branch 'rack-id' into background-work-user
smklein Jun 24, 2022
f7d7796
Merge branch 'main' into dns-client
smklein Jun 24, 2022
01a5fa5
Merge branch 'dns-client' into rack-id
smklein Jun 24, 2022
52357a6
Merge branch 'rack-id' into background-work-user
smklein Jun 24, 2022
71b40e7
Merge branch 'main' into background-work-user
smklein Jun 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions common/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ pub enum Database {
pub struct DeploymentConfig {
/// Uuid of the Nexus instance
pub id: Uuid,
/// Uuid of the Rack where Nexus is executing.
pub rack_id: Uuid,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really supposed to be a rack, or is this the instance of the control plane (which we currently take to be the Fleet, although that's not quite right either)? (I think it'd be a bad idea to depend on rack_id == a unique identifier for the control plane and I've been trying to avoid that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended to identify the rack on which this Nexus is running.

RSS initializes this value, and transfers it to nexus , as of #1217

Note, the major change of that patch is the source of that rack ID, rather than the existence of a rack ID at all. Previously the rack ID was randomly generated by nexus in nexus/server/lib.rs:

let rack_id = Uuid::new_v4();

So, that change moves the source of the rack ID from "random every time nexus boots" to "set once during RSS initialization, then stored in the DB".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the major change of that patch is the source of that rack ID, rather than the existence of a rack ID at all.

Got it. If I'm reading the code in "main" correctly right now, it seems like maybe this can go away entirely? We store "rack_id" in the Nexus struct but only so that we can implement racks_list and rack_get. If we're moving the rack record to the database, then I imagine we don't need these any more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a way to instruct Nexus "which rack should we be querying from the DB" to check the RSS handoff described in RFD 278. This is currently done by passing this rack_id value, and checking it on boot.

Additionally, when performing background tasks in a subsequent PR (#1241), Nexus attempts to manipulate state which is local to the current rack - such as asking the question, "do I have enough CRDB instances on my rack?"

Some state is still larger than rack scope - for example, internal DNS servers are allocated within a subnet that is shared AZ-wide - these are allocated without referencing the rack ID.

However, in general, I had the expectation that "each rack would be running at least one Nexus," so it could be in charge of managing rack-wide state. Is this wrong? Should a single Nexus be trying to ensure that all services are running across all racks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a way to instruct Nexus "which rack should we be querying from the DB" to check the RSS handoff described in RFD 278. This is currently done by passing this rack_id value, and checking it on boot.

Are you talking about this part of RFD 278:

Nexus "remembers" that this handoff has completed by recording the service as initialized within CRDB. This is done by marking a column in the associated row of the "Rack" table.

I think the scope of this is "the control plane instance", not "the rack". If we had two racks, whether they were set up together or not, I think we're only going to go through the RFD 57 process (and Nexus handoff) once. So I think this state should probably be global in CockroachDB, not scoped to a rack.

Additionally, when performing background tasks in a subsequent PR (#1241), Nexus attempts to manipulate state which is local to the current rack - such as asking the question, "do I have enough CRDB instances on my rack?"
...
However, in general, I had the expectation that "each rack would be running at least one Nexus," so it could be in charge of managing rack-wide state. Is this wrong? Should a single Nexus be trying to ensure that all services are running across all racks?

I'd assumed it was the latter. The original goal was that Nexus instances are fungible at least within an AZ. That's also why RFD 248 considers "nexus" a single DNS service, rather than having each group-of-Nexuses-within-a-rack have its own DNS name.

I gather you're assuming (1) there's at least one Nexus in each rack and (2) Nexus instances are only responsible for managing their own rack. I think that's got several downsides. We can have multiple Nexus instances within a rack, so we still need to build some coordination to avoid them stomping over each other. Once we've built that, it's extra work (that we otherwise wouldn't need to do) to constrain their responsibilities to only cover their own rack. It's also extra work to enforce the deployment constraints and verify them with runtime monitoring. I think it'd be a lot simpler to say that the Nexus instances need to coordinate on changes (which they need to do anyway, as I mentioned) and then all we need to ensure is that there are enough Nexus instances in enough different racks to maintain our availability goals. That leaves the deployment system with much more flexibility about where to put them.

I think eliminating constraint (2) also has a big impact on efficiency. With that constraint, we basically need two Nexus instances in each rack to survive a single failure and we need three instances to survive a double failure. So a 10-rack deployment would need 30 Nexus instances to survive 2 failures. Without this constraint (i.e., letting Nexus instances operate on all racks, with coordination) you'd only need 5 instances to survive any two failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called a ClusterId in RFD 238.

Copy link
Collaborator Author

@smklein smklein Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand how you want me to eliminate this constraint in the current implementation.

To fully satisfy the one-to-many relationship of Nexuses to racks, we'd need to:

  • Build a concept of a fleet, and have that be populated somewhere
  • Ensure that when RSS performs initialization, it does not format CRDB instances from scratch, but rather, define the mechanism for integrating with the fleet-wide DB. This means going back to the design of RFD 278, because we'd need to distinguish the "adding a rack to an initialized fleet" case from "adding a rack to an uninitialized fleet".
  • The same "integration into the rack handoff logic", but for Clickhouse
  • Rebuild the mechanism for "service balancing" to work across all racks in a fleet

To be clear, I think that is the right long-term direction, but I'm also concerned it's a lot of work for something that isn't relevant for v1.

My prototype implementation of service management within Nexus can specify ServiceRedundancy as an enum, and it explicitly specifies Per-Rack, because that's all we have right now. When we approach the multi-rack case, we could change this redundancy value to "per-fleet" or "per-az" - but I hesitate to include that now, since we won't really be able to validate that without building out more genuine end-to-end multi-rack support.

So, backing up: what's the action item here? Should I be deferring the RSS handoff patches, and instead favor deleting all locally-stored notions of "Rack ID"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like I'm missing something important about RSS's involvement in a multi-rack world, so let's table this for now.

I remain a little worried about this:

To fully satisfy the one-to-many relationship of Nexuses to racks, we'd need to:

I think we're absolutely going to need to support multiple Nexuses in the v1 product, even being only one rack. Otherwise, the control plane can't survive a single failure. I don't really understand the pieces you said have to happen. Maybe we can discuss at some point.

Copy link
Collaborator Author

@smklein smklein Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, to be explicit, in the current implementation:

  • We already support "multiple Nexuses running on one rack".
  • We do not support "one nexus controlling multiple racks", because our mechanism for populating services is based on "ensuring enough services exist within rack_id's (Nexus') rack". This can be changed in the future.

When RSS initializes a new rack, it executes operations like "create a new CRDB instance, and format all storage - resulting in an empty cluster".

For a single-rack use-case, this works fine - if we're initializing the rack, there is no other fleet-wide metadata storage to consider.

For a multi-rack scenario, it isn't always appropriate to initialize CRDB "from scratch" here. If other racks exist, we wouldn't want to create partitioned clusters of CRDB, As you referenced in RFD 61:

[CRDB should be] Deployed as a single fleet (across multiple racks) with enough Instances in enough racks to satisfy availability and scale requirements. Sharding will be relevant here, but we know it’s important that a rack’s last known state is available in the database even when the rack itself is completely unavailable.

I assume we'd want some way of specifying: "here is the region, here is the existing fleet, if CRDB nodes already exist, join them instead of starting a totally new cluster".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #1276

/// Dropshot configuration for external API server
pub dropshot_external: ConfigDropshot,
/// Dropshot configuration for internal API server
Expand Down
8 changes: 8 additions & 0 deletions common/src/sql/dbinit.sql
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ CREATE TABLE omicron.public.sled (
time_deleted TIMESTAMPTZ,
rcgen INT NOT NULL,

/* FK into the Rack table */
rack_id UUID NOT NULL,

/* The IP address and bound port of the sled agent server. */
ip INET NOT NULL,
port INT4 CHECK (port BETWEEN 0 AND 65535) NOT NULL,
Expand All @@ -83,6 +86,11 @@ CREATE TABLE omicron.public.sled (
last_used_address INET NOT NULL
);

/* Add an index which lets us look up sleds on a rack */
CREATE INDEX ON omicron.public.sled (
rack_id
) WHERE time_deleted IS NULL;

/*
* Services
*/
Expand Down
1 change: 1 addition & 0 deletions nexus/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ address = "[::1]:8123"
[deployment]
# Identifier for this instance of Nexus
id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
rack_id = "c19a698f-c6f9-4a17-ae30-20d711b8f7dc"

[deployment.dropshot_external]
# IP address and TCP port on which to listen for the external API
Expand Down
10 changes: 10 additions & 0 deletions nexus/src/app/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,16 @@ impl Nexus {
&self.opctx_external_authn
}

/// Returns an [`OpContext`] used for balancing services.
pub fn opctx_for_service_balancer(&self) -> OpContext {
OpContext::for_background(
self.log.new(o!("component" => "ServiceBalancer")),
Arc::clone(&self.authz),
authn::Context::internal_service_balancer(),
Arc::clone(&self.db_datastore),
)
}

/// Used as the body of a "stub" endpoint -- one that's currently
/// unimplemented but that we eventually intend to implement
///
Expand Down
2 changes: 1 addition & 1 deletion nexus/src/app/sled.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ impl super::Nexus {
address: SocketAddrV6,
) -> Result<(), Error> {
info!(self.log, "registered sled agent"; "sled_uuid" => id.to_string());
let sled = db::model::Sled::new(id, address);
let sled = db::model::Sled::new(id, address, self.rack_id);
self.db_datastore.sled_upsert(sled).await?;
Ok(())
}
Expand Down
11 changes: 11 additions & 0 deletions nexus/src/authn/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ pub use crate::db::fixed_data::user_builtin::USER_EXTERNAL_AUTHN;
pub use crate::db::fixed_data::user_builtin::USER_INTERNAL_API;
pub use crate::db::fixed_data::user_builtin::USER_INTERNAL_READ;
pub use crate::db::fixed_data::user_builtin::USER_SAGA_RECOVERY;
pub use crate::db::fixed_data::user_builtin::USER_SERVICE_BALANCER;
use crate::db::model::ConsoleSession;

use crate::authz;
Expand Down Expand Up @@ -170,6 +171,11 @@ impl Context {
Context::context_for_builtin_user(USER_DB_INIT.id)
}

/// Returns an authenticated context for Nexus-driven service balancing.
pub fn internal_service_balancer() -> Context {
Context::context_for_builtin_user(USER_SERVICE_BALANCER.id)
}

fn context_for_builtin_user(user_builtin_id: Uuid) -> Context {
Context {
kind: Kind::Authenticated(Details {
Expand Down Expand Up @@ -217,6 +223,7 @@ mod test {
use super::USER_INTERNAL_API;
use super::USER_INTERNAL_READ;
use super::USER_SAGA_RECOVERY;
use super::USER_SERVICE_BALANCER;
use super::USER_TEST_PRIVILEGED;
use super::USER_TEST_UNPRIVILEGED;
use crate::db::fixed_data::user_builtin::USER_EXTERNAL_AUTHN;
Expand Down Expand Up @@ -251,6 +258,10 @@ mod test {
let actor = authn.actor().unwrap();
assert_eq!(actor.actor_id(), USER_DB_INIT.id);

let authn = Context::internal_service_balancer();
let actor = authn.actor().unwrap();
assert_eq!(actor.actor_id(), USER_SERVICE_BALANCER.id);

let authn = Context::internal_saga_recovery();
let actor = authn.actor().unwrap();
assert_eq!(actor.actor_id(), USER_SAGA_RECOVERY.id);
Expand Down
7 changes: 7 additions & 0 deletions nexus/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,7 @@ mod test {
max_vpc_ipv4_subnet_prefix = 27
[deployment]
id = "28b90dc4-c22a-65ba-f49a-f051fe01208f"
rack_id = "38b90dc4-c22a-65ba-f49a-f051fe01208f"
[deployment.dropshot_external]
bind_address = "10.1.2.3:4567"
request_body_max_bytes = 1024
Expand All @@ -348,6 +349,9 @@ mod test {
Config {
deployment: DeploymentConfig {
id: "28b90dc4-c22a-65ba-f49a-f051fe01208f".parse().unwrap(),
rack_id: "38b90dc4-c22a-65ba-f49a-f051fe01208f"
.parse()
.unwrap(),
dropshot_external: ConfigDropshot {
bind_address: "10.1.2.3:4567"
.parse::<SocketAddr>()
Expand Down Expand Up @@ -407,6 +411,7 @@ mod test {
address = "[::1]:8123"
[deployment]
id = "28b90dc4-c22a-65ba-f49a-f051fe01208f"
rack_id = "38b90dc4-c22a-65ba-f49a-f051fe01208f"
[deployment.dropshot_external]
bind_address = "10.1.2.3:4567"
request_body_max_bytes = 1024
Expand Down Expand Up @@ -448,6 +453,7 @@ mod test {
address = "[::1]:8123"
[deployment]
id = "28b90dc4-c22a-65ba-f49a-f051fe01208f"
rack_id = "38b90dc4-c22a-65ba-f49a-f051fe01208f"
[deployment.dropshot_external]
bind_address = "10.1.2.3:4567"
request_body_max_bytes = 1024
Expand Down Expand Up @@ -503,6 +509,7 @@ mod test {
max_vpc_ipv4_subnet_prefix = 100
[deployment]
id = "28b90dc4-c22a-65ba-f49a-f051fe01208f"
rack_id = "38b90dc4-c22a-65ba-f49a-f051fe01208f"
[deployment.dropshot_external]
bind_address = "10.1.2.3:4567"
request_body_max_bytes = 1024
Expand Down
9 changes: 6 additions & 3 deletions nexus/src/db/datastore.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2990,6 +2990,7 @@ impl DataStore {
let builtin_users = [
// Note: "db_init" is also a builtin user, but that one by necessity
// is created with the database.
&*authn::USER_SERVICE_BALANCER,
&*authn::USER_INTERNAL_API,
&*authn::USER_INTERNAL_READ,
&*authn::USER_EXTERNAL_AUTHN,
Expand Down Expand Up @@ -4034,8 +4035,9 @@ mod test {
0,
0,
);
let rack_id = Uuid::new_v4();
let sled_id = Uuid::new_v4();
let sled = Sled::new(sled_id, bogus_addr.clone());
let sled = Sled::new(sled_id, bogus_addr.clone(), rack_id);
datastore.sled_upsert(sled).await.unwrap();
sled_id
}
Expand Down Expand Up @@ -4391,14 +4393,15 @@ mod test {
let opctx =
OpContext::for_tests(logctx.log.new(o!()), datastore.clone());

let rack_id = Uuid::new_v4();
let addr1 = "[fd00:1de::1]:12345".parse().unwrap();
let sled1_id = "0de4b299-e0b4-46f0-d528-85de81a7095f".parse().unwrap();
let sled1 = db::model::Sled::new(sled1_id, addr1);
let sled1 = db::model::Sled::new(sled1_id, addr1, rack_id);
datastore.sled_upsert(sled1).await.unwrap();

let addr2 = "[fd00:1df::1]:12345".parse().unwrap();
let sled2_id = "66285c18-0c79-43e0-e54f-95271f271314".parse().unwrap();
let sled2 = db::model::Sled::new(sled2_id, addr2);
let sled2 = db::model::Sled::new(sled2_id, addr2, rack_id);
datastore.sled_upsert(sled2).await.unwrap();

let ip = datastore.next_ipv6_address(&opctx, sled1_id).await.unwrap();
Expand Down
7 changes: 7 additions & 0 deletions nexus/src/db/fixed_data/role_assignment.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ lazy_static! {
*FLEET_ID,
role_builtin::FLEET_ADMIN.role_name,
),
RoleAssignment::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can limit privileges more than this...but I imagine it's not worth much of our time right now to pick this apart. Up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm gonna defer this while we're still sorting out the fundamental operations Nexus needs to take. Right now, everything seems to be lumped into the FLEET_ADMIN role, but I'm not really sure how to split that up without a clearer idea of the other "fleet-wide" ops.

IdentityType::UserBuiltin,
user_builtin::USER_SERVICE_BALANCER.id,
role_builtin::FLEET_ADMIN.resource_type,
*FLEET_ID,
role_builtin::FLEET_ADMIN.role_name,
),

// The "internal-read" user gets the "viewer" role on the sole
// Fleet. This will grant them the ability to read various control
Expand Down
11 changes: 11 additions & 0 deletions nexus/src/db/fixed_data/user_builtin.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,15 @@ lazy_static! {
"used for seeding initial database data",
);

/// Internal user for performing operations to manage the
/// provisioning of services across the fleet.
pub static ref USER_SERVICE_BALANCER: UserBuiltinConfig =
UserBuiltinConfig::new_static(
"001de000-05e4-4000-8000-00000000bac3",
"service-balancer",
"used for Nexus-driven service balancing",
);

/// Internal user used by Nexus when handling internal API requests
pub static ref USER_INTERNAL_API: UserBuiltinConfig =
UserBuiltinConfig::new_static(
Expand Down Expand Up @@ -82,9 +91,11 @@ mod test {
use super::USER_INTERNAL_API;
use super::USER_INTERNAL_READ;
use super::USER_SAGA_RECOVERY;
use super::USER_SERVICE_BALANCER;

#[test]
fn test_builtin_user_ids_are_valid() {
assert_valid_uuid(&USER_SERVICE_BALANCER.id);
assert_valid_uuid(&USER_DB_INIT.id);
assert_valid_uuid(&USER_INTERNAL_API.id);
assert_valid_uuid(&USER_EXTERNAL_AUTHN.id);
Expand Down
5 changes: 4 additions & 1 deletion nexus/src/db/model/sled.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ pub struct Sled {
time_deleted: Option<DateTime<Utc>>,
rcgen: Generation,

pub rack_id: Uuid,

// ServiceAddress (Sled Agent).
pub ip: ipv6::Ipv6Addr,
pub port: SqlU16,
Expand All @@ -30,7 +32,7 @@ pub struct Sled {
}

impl Sled {
pub fn new(id: Uuid, addr: SocketAddrV6) -> Self {
pub fn new(id: Uuid, addr: SocketAddrV6, rack_id: Uuid) -> Self {
let last_used_address = {
let mut segments = addr.ip().segments();
segments[7] += omicron_common::address::RSS_RESERVED_ADDRESSES;
Expand All @@ -40,6 +42,7 @@ impl Sled {
identity: SledIdentity::new(id),
time_deleted: None,
rcgen: Generation::new(),
rack_id,
ip: ipv6::Ipv6Addr::from(addr.ip()),
port: addr.port().into(),
last_used_address,
Expand Down
1 change: 1 addition & 0 deletions nexus/src/db/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,7 @@ table! {
time_deleted -> Nullable<Timestamptz>,
rcgen -> Int8,

rack_id -> Uuid,
ip -> Inet,
port -> Int4,
last_used_address -> Inet,
Expand Down
8 changes: 3 additions & 5 deletions nexus/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ use external_api::http_entrypoints::external_api;
use internal_api::http_entrypoints::internal_api;
use slog::Logger;
use std::sync::Arc;
use uuid::Uuid;

#[macro_use]
extern crate slog;
Expand Down Expand Up @@ -82,15 +81,15 @@ impl Server {
/// Start a nexus server.
pub async fn start(
config: &Config,
rack_id: Uuid,
log: &Logger,
) -> Result<Server, String> {
let log = log.new(o!("name" => config.deployment.id.to_string()));
info!(log, "setting up nexus server");

let ctxlog = log.new(o!("component" => "ServerContext"));

let apictx = ServerContext::new(rack_id, ctxlog, &config)?;
let apictx =
ServerContext::new(config.deployment.rack_id, ctxlog, &config)?;

let http_server_starter_external = dropshot::HttpServerStarter::new(
&config.deployment.dropshot_external,
Expand Down Expand Up @@ -167,8 +166,7 @@ pub async fn run_server(config: &Config) -> Result<(), String> {
} else {
debug!(log, "registered DTrace probes");
}
let rack_id = Uuid::new_v4();
let server = Server::start(config, rack_id, &log).await?;
let server = Server::start(config, &log).await?;
server.register_as_producer().await;
server.wait_for_finish().await
}
6 changes: 2 additions & 4 deletions nexus/test-utils/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,6 @@ pub async fn test_setup_with_config(
config: &mut omicron_nexus::Config,
) -> ControlPlaneTestContext {
let logctx = LogContext::new(test_name, &config.pkg.log);
let rack_id = Uuid::parse_str(RACK_UUID).unwrap();
let log = &logctx.log;

// Start up CockroachDB.
Expand All @@ -104,9 +103,8 @@ pub async fn test_setup_with_config(
nexus_config::Database::FromUrl { url: database.pg_config().clone() };
config.pkg.timeseries_db.address.set_port(clickhouse.port());

let server = omicron_nexus::Server::start(&config, rack_id, &logctx.log)
.await
.unwrap();
let server =
omicron_nexus::Server::start(&config, &logctx.log).await.unwrap();
server
.apictx
.nexus
Expand Down
1 change: 1 addition & 0 deletions nexus/tests/config.test.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ max_vpc_ipv4_subnet_prefix = 29
# Identifier for this instance of Nexus.
# NOTE: The test suite always overrides this.
id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
rack_id = "c19a698f-c6f9-4a17-ae30-20d711b8f7dc"

#
# NOTE: for the test suite, the port MUST be 0 (in order to bind to any
Expand Down
3 changes: 3 additions & 0 deletions nexus/tests/integration_tests/users_builtin.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ async fn test_users_builtin(cptestctx: &ControlPlaneTestContext) {

let u = users.remove(&authn::USER_DB_INIT.name.to_string()).unwrap();
assert_eq!(u.identity.id, authn::USER_DB_INIT.id);
let u =
users.remove(&authn::USER_SERVICE_BALANCER.name.to_string()).unwrap();
assert_eq!(u.identity.id, authn::USER_SERVICE_BALANCER.id);
let u = users.remove(&authn::USER_INTERNAL_API.name.to_string()).unwrap();
assert_eq!(u.identity.id, authn::USER_INTERNAL_API.id);
let u = users.remove(&authn::USER_INTERNAL_READ.name.to_string()).unwrap();
Expand Down
1 change: 1 addition & 0 deletions sled-agent/src/bootstrap/agent.rs
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ impl Agent {
&self.sled_config,
self.parent_log.clone(),
sled_address,
request.rack_id,
)
.await
.map_err(|e| {
Expand Down
7 changes: 7 additions & 0 deletions sled-agent/src/bootstrap/params.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,20 @@ use super::trust_quorum::ShareDistribution;
use omicron_common::address::{Ipv6Subnet, SLED_PREFIX};
use serde::{Deserialize, Serialize};
use std::borrow::Cow;
use uuid::Uuid;

/// Configuration information for launching a Sled Agent.
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct SledAgentRequest {
/// Uuid of the Sled Agent to be created.
pub id: Uuid,

/// Portion of the IP space to be managed by the Sled Agent.
pub subnet: Ipv6Subnet<SLED_PREFIX>,

/// Uuid of the rack to which this sled agent belongs.
pub rack_id: Uuid,

/// Share of the rack secret for this Sled Agent.
// TODO-cleanup This is currently optional because we don't do trust quorum
// shares for single-node deployments (i.e., most dev/test environments),
Expand Down
3 changes: 3 additions & 0 deletions sled-agent/src/rack_setup/service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,7 @@ impl ServiceInner {
(request, (idx, bootstrap_addr))
});

let rack_id = Uuid::new_v4();
let allocations = requests_and_sleds.map(|(request, sled)| {
let (idx, bootstrap_addr) = sled;
info!(
Expand All @@ -373,7 +374,9 @@ impl ServiceInner {
bootstrap_addr,
SledAllocation {
initialization_request: SledAgentRequest {
id: Uuid::new_v4(),
subnet,
rack_id,
trust_quorum_share: maybe_rack_secret_shares
.as_mut()
.map(|shares_iter| {
Expand Down
Loading