Skip to content

Conversation

@crawford
Copy link
Contributor

@crawford crawford commented Aug 22, 2018

This implements the new bootstrap process, which uses a dedicated bootstrap node instead of overloading the master nodes. This also moves the etcd members to the master nodes and removes the etcd nodes.

NOTE: This requires the use of a fork of terraform-provider-libvirt.
Once this new scheme has been validated, the changes will be
submitted upstream.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 22, 2018
@abhinavdahiya
Copy link
Contributor

/ok to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be based on master index.

Copy link
Contributor

@abhinavdahiya abhinavdahiya Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves the bootstrap s3 bucket behind ??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves the bootstrap s3 bucket behind ??

Maybe we can have the bootstrap node serve these directly? But even if that's useful, we may want to punt it to follow-up work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the bootstrap node and the S3 bucket are left behind. The follow-up work will take care of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth mentioning in the PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to link the upstream PR once you fike one for your fixup commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if these have made it into a PR or not, but #214 took us back to the upstream repo. I'm not sure what the upstream changes were, maybe dmacvicar/terraform-provider-libvirt#376 and dmacvicar/terraform-provider-libvirt#382? Although we may be reverting #214 in #219.

Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing work!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return works if and only if setting up WorkerIPs is the last thing this function does. That's true as of this commit, but its brittle (this commit breaks that assumption for MasterIPs above; wheb MasterIPs is set before this function, this function will erroneously ignore WorkerIPs). I'd rather use a pattern like:

if len(l.WorkerIPs) > 0 {
	if len(l.WorkerIPs) != workerCount {
		return fmt.Errorf("length of WorkerIPs doesn't match worker count")
	}
} else {
	// build `WorkerIPs
}

Also, with this logic being used for maters and workers, it may be worth spinning out into a helper function. Something like:

func generateIPs(ips []string, count int, network SomeType, offset int) error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this logic is very broken. I didn't put much effort into it because we are going to throw this all away pretty soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the rest of these constants seem to be listed in alphabetical order. Can we shift this up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have been the last awscli consumer. Can you drop it from config.tf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ${a == b ? false : true } could (I think) be written more compactly as ${a != b}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, yeah. I copied that from elsewhere. There are a lot of strange boolean expressions in this codebase....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves the bootstrap s3 bucket behind ??

Maybe we can have the bootstrap node serve these directly? But even if that's useful, we may want to punt it to follow-up work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we dropping steps/etcd completely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comes later. In order to drop it completely, I would have to break all of our tests. I'd rather get this merged, update the config in the release repo, and then drop etcd (I have the commit queued up).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to link the upstream PR once you fike one for your fixup commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can drop this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: No need to add new "Tectonic" namespacing. Can this be just tnc_a?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO stale now that there are no more etcd nodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This awscli is distinct from the awscli image you're dropping from config.tf. I think we should drop the config.tf entry, since you're dropping its rm-assets.sh consumer. This entry was added in 084ff71 (coreos/tectonic-installer#980), and the motivation was supporting role changes in smoke tests (now over here). So I think we want to leave this line alone in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Sorry, I just get so excited when I can delete things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can drop jq and remove the wrapping [] with:

$ aws ec2 describe-instances --query "Reservations[].Instances[] | [?Tags[? Key == 'Name' && Value == '${CLUSTER_NAME}-master-0']].PublicIpAddress" --output text
52.15.184.15

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, this is from #88.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this "do we already have some, and if so, do we have the right number?" check could move into generateIPs to save you some repetition here.

@crawford
Copy link
Contributor Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 23, 2018
@crawford
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 23, 2018
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
@yifan-gu
Copy link
Contributor

LGTM! Massive progress!

@yifan-gu
Copy link
Contributor

yifan-gu commented Aug 24, 2018

/lgtm re-applying the label.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford, yifan-gu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,crawford,yifan-gu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@crawford crawford added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
Instead of generating the worker IP addresses in Terraform, generate
them alongside the master IP addresses and pass them in as a variable.
Instead of generating the etcd IP addresses in Terraform, generate them
alongside the other IP addresses and pass them in as a variable.
Instead of shelling out to `virsh net-update` repeatedly, define all of
the DNS records upfront in the network config. This has two major
advantages:

  1. This removes the race between network creation and network
     modification.
  2. This allows for RRDNS. There is a bug in `virsh net-update` that
     prevents two hosts from sharing the same hostname even though the
     network config allows it.

NOTE: This requires the use of a fork of terraform-provider-libvirt.
      Once this new scheme has been validated, the changes will be
      submitted upstream.
This will allow users to reboot their host and have the cluster come
back up automatically.
This also removes the rm-assets service since the bootstrap node itself
will be removed after the cluster is bootstrapped. Because rm-assets was
the only service that was actually enabled, tectonic.service instead
needs to be disabled.

config.tf is not bumped here because there is no corresponding change in
TNC. Even if the master config was served from TNC, there would be a
race condition between the etcd cluster becoming healthy and the master
requesting its config. Rather than attempting to address this race, I'm
ignoring it since the etcd members are going to be moved to the master
soon, which will remove the race.
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Aug 24, 2018

@crawford
e2e-aws uses this https://github.com/openshift/release/blob/master/ci-operator/templates/cluster-launch-installer-e2e.yaml#L214-L244 to decide when cluster is booted completely.

https://github.com/coreos-inc/tectonic-operators/pull/437, moved router from tectonic-ingress to openshift-ingress, was merged before your etcd change. You are bringing those changes too here in this PR.

The router never runs as ingress operator is unable to install components as #149 is still pending. maybe try to cherry pick cdda678 commit and see if that installs router.. :/

smarterclayton and others added 5 commits August 23, 2018 22:36
Makes installer consistent with 4.0 direction.
Now that all of the records are created upfront in the topology step and
the bootstrap node uses a static location for its config (no more
CNAMEs), there is no need to explicitly manipulate the records later.
This puts the etcd members on the master nodes instead of the dedicated
etcd nodes.
Instead of failing with an extremely opaque error message, this will
report which pool was unrecognized. Users are likely to hit this now
that etcd has been removed.
Now that the etcd members have been moved to the master nodes,
there is no need for the etcd nodes. The installer still understands the
etcd config even though it isn't used. This is to prevent the tests from
breaking. This code will be removed in the near future.
@crawford crawford added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
@openshift-merge-robot openshift-merge-robot merged commit baedda4 into openshift:master Aug 24, 2018
@crawford crawford deleted the bootstrap branch August 24, 2018 06:07
wking added a commit to wking/openshift-installer that referenced this pull request Aug 27, 2018
506ca08 (*: move etcd members to the master nodes, 2018-08-16, openshift#162)
converted tectonic_ignition_master to tectonic_ignition_masters (among
many other things).  But one of the purposes of the bootstrap node was
to allow us to avoid special-casing individual master nodes.  This
commit moves us back towards identical masters.

I'm not clear on the purpose of the etcd_index query, which dates back
to 5ce1215 (add tf support for running etcd from tnc, 2018-03-27,
coreos/tectonic-installer#3079).  We'll see how the CI tests do
without it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants