-
Notifications
You must be signed in to change notification settings - Fork 1.5k
*: implement new bootstrap process #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be based on master index.
modules/ignition/outputs.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaves the bootstrap s3 bucket behind ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaves the bootstrap s3 bucket behind ??
Maybe we can have the bootstrap node serve these directly? But even if that's useful, we may want to punt it to follow-up work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the bootstrap node and the S3 bucket are left behind. The follow-up work will take care of that.
Documentation/dev/libvirt-howto.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth mentioning in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to link the upstream PR once you fike one for your fixup commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if these have made it into a PR or not, but #214 took us back to the upstream repo. I'm not sure what the upstream changes were, maybe dmacvicar/terraform-provider-libvirt#376 and dmacvicar/terraform-provider-libvirt#382? Although we may be reverting #214 in #219.
abhinavdahiya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing work!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return works if and only if setting up WorkerIPs is the last thing this function does. That's true as of this commit, but its brittle (this commit breaks that assumption for MasterIPs above; wheb MasterIPs is set before this function, this function will erroneously ignore WorkerIPs). I'd rather use a pattern like:
if len(l.WorkerIPs) > 0 {
if len(l.WorkerIPs) != workerCount {
return fmt.Errorf("length of WorkerIPs doesn't match worker count")
}
} else {
// build `WorkerIPs
}Also, with this logic being used for maters and workers, it may be worth spinning out into a helper function. Something like:
func generateIPs(ips []string, count int, network SomeType, offset int) errorThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this logic is very broken. I didn't put much effort into it because we are going to throw this all away pretty soon.
installer/pkg/workflow/utils.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the rest of these constants seem to be listed in alphabetical order. Can we shift this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have been the last awscli consumer. Can you drop it from config.tf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
steps/bootstrap/aws/main.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ${a == b ? false : true } could (I think) be written more compactly as ${a != b}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, yeah. I copied that from elsewhere. There are a lot of strange boolean expressions in this codebase....
modules/ignition/outputs.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaves the bootstrap s3 bucket behind ??
Maybe we can have the bootstrap node serve these directly? But even if that's useful, we may want to punt it to follow-up work.
steps/etcd/aws/etcd.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we dropping steps/etcd completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comes later. In order to drop it completely, I would have to break all of our tests. I'd rather get this merged, update the config in the release repo, and then drop etcd (I have the commit queued up).
Documentation/dev/libvirt-howto.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to link the upstream PR once you fike one for your fixup commits.
steps/topology/aws/main.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can drop this line.
steps/topology/aws/main.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: No need to add new "Tectonic" namespacing. Can this be just tnc_a?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO stale now that there are no more etcd nodes?
images/tectonic-builder/Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This awscli is distinct from the awscli image you're dropping from config.tf. I think we should drop the config.tf entry, since you're dropping its rm-assets.sh consumer. This entry was added in 084ff71 (coreos/tectonic-installer#980), and the motivation was supporting role changes in smoke tests (now over here). So I think we want to leave this line alone in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Sorry, I just get so excited when I can delete things.
tests/smoke/aws/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can drop jq and remove the wrapping [] with:
$ aws ec2 describe-instances --query "Reservations[].Instances[] | [?Tags[? Key == 'Name' && Value == '${CLUSTER_NAME}-master-0']].PublicIpAddress" --output text
52.15.184.15There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, this is from #88.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this "do we already have some, and if so, do we have the right number?" check could move into generateIPs to save you some repetition here.
|
/hold |
|
/hold cancel |
|
/lgtm |
|
LGTM! Massive progress! |
|
/lgtm re-applying the label. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, crawford, yifan-gu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
New changes are detected. LGTM label has been removed. |
Instead of generating the worker IP addresses in Terraform, generate them alongside the master IP addresses and pass them in as a variable.
Instead of generating the etcd IP addresses in Terraform, generate them alongside the other IP addresses and pass them in as a variable.
Instead of shelling out to `virsh net-update` repeatedly, define all of
the DNS records upfront in the network config. This has two major
advantages:
1. This removes the race between network creation and network
modification.
2. This allows for RRDNS. There is a bug in `virsh net-update` that
prevents two hosts from sharing the same hostname even though the
network config allows it.
NOTE: This requires the use of a fork of terraform-provider-libvirt.
Once this new scheme has been validated, the changes will be
submitted upstream.
This will allow users to reboot their host and have the cluster come back up automatically.
This also removes the rm-assets service since the bootstrap node itself will be removed after the cluster is bootstrapped. Because rm-assets was the only service that was actually enabled, tectonic.service instead needs to be disabled. config.tf is not bumped here because there is no corresponding change in TNC. Even if the master config was served from TNC, there would be a race condition between the etcd cluster becoming healthy and the master requesting its config. Rather than attempting to address this race, I'm ignoring it since the etcd members are going to be moved to the master soon, which will remove the race.
|
New changes are detected. LGTM label has been removed. |
|
@crawford https://github.com/coreos-inc/tectonic-operators/pull/437, moved router from tectonic-ingress to openshift-ingress, was merged before your etcd change. You are bringing those changes too here in this PR. The router never runs as ingress operator is unable to install components as #149 is still pending. maybe try to cherry pick cdda678 commit and see if that installs router.. :/ |
Makes installer consistent with 4.0 direction.
Now that all of the records are created upfront in the topology step and the bootstrap node uses a static location for its config (no more CNAMEs), there is no need to explicitly manipulate the records later.
This puts the etcd members on the master nodes instead of the dedicated etcd nodes.
Instead of failing with an extremely opaque error message, this will report which pool was unrecognized. Users are likely to hit this now that etcd has been removed.
Now that the etcd members have been moved to the master nodes, there is no need for the etcd nodes. The installer still understands the etcd config even though it isn't used. This is to prevent the tests from breaking. This code will be removed in the near future.
506ca08 (*: move etcd members to the master nodes, 2018-08-16, openshift#162) converted tectonic_ignition_master to tectonic_ignition_masters (among many other things). But one of the purposes of the bootstrap node was to allow us to avoid special-casing individual master nodes. This commit moves us back towards identical masters. I'm not clear on the purpose of the etcd_index query, which dates back to 5ce1215 (add tf support for running etcd from tnc, 2018-03-27, coreos/tectonic-installer#3079). We'll see how the CI tests do without it.
This implements the new bootstrap process, which uses a dedicated bootstrap node instead of overloading the master nodes. This also moves the etcd members to the master nodes and removes the etcd nodes.
NOTE: This requires the use of a fork of terraform-provider-libvirt.
Once this new scheme has been validated, the changes will be
submitted upstream.