Skip to content

Use the new cluster hosted ironic pod to deploy the BMO#635

Merged
hardys merged 2 commits intoopenshift-metal3:masterfrom
imain:ironic_baremetal_pod
Jul 15, 2019
Merged

Use the new cluster hosted ironic pod to deploy the BMO#635
hardys merged 2 commits intoopenshift-metal3:masterfrom
imain:ironic_baremetal_pod

Conversation

@imain
Copy link
Copy Markdown
Contributor

@imain imain commented Jun 20, 2019

Switch to using the new ironic + baremetal operator pod.

See metal3-io/baremetal-operator#212 for some discussion re the new yaml files added here.

Comment thread 08_deploy_bmo.sh Outdated
Comment thread 08_deploy_bmo.sh Outdated
POD_NAME=$(oc --config ocp/auth/kubeconfig get pods -n openshift-machine-api | grep metal3-baremetal-operator | cut -f 1 -d ' ')

# Make sure our pod is running.
echo "Waiting for baremetal-operator pod to become ready" | lolcat
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to remove the lolcat here, it was removed in ebd39f7#diff-6f654b7476bc32edaa73bbb71bf791de

@hardys
Copy link
Copy Markdown

hardys commented Jul 1, 2019

Ok this isn't working for me, when the baremetalhost objects get registered, it seems the masters aren't marked externally provisioned anymore, so they get reprovisioned, and I think this expains why you see the controlplane go down at that point ;)

$ oc get baremetalhost --namespace openshift-machine-api
NAME                 STATUS   PROVISIONING STATUS   MACHINE                 BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       provisioning          ostest-master-0         ipmi://192.168.111.1:6230   unknown            true     
openshift-master-1   OK       provisioning          ostest-master-1         ipmi://192.168.111.1:6231   unknown            true     
openshift-master-2   OK       provisioning          ostest-master-2         ipmi://192.168.111.1:6232   unknown            true     
openshift-worker-0   OK       registering           ostest-worker-0-c5254   ipmi://192.168.111.1:6233            

@hardys
Copy link
Copy Markdown

hardys commented Jul 1, 2019

Note I'm testing with https://github.com/hardys/dev-scripts/tree/pr635 which is rebased against master, and applies the related BMO PR

@hardys
Copy link
Copy Markdown

hardys commented Jul 1, 2019

@dhellmann raised metal3-io/cluster-api-provider-baremetal#93 which may explain what I'm seeing, but I'm still not entirely clear why that would cause this only when running Ironic via the BMO and not in the case where we're using the virthost ironic via the proxy container?

@hardys
Copy link
Copy Markdown

hardys commented Jul 1, 2019

Ah it's perhaps working because we maintain the state from the terraform driven master deployment, then the BMO sees that matching nodes already exist and are provisioned, so it doesn't re-register them and drive through the provisioning workflow?

@dhellmann
Copy link
Copy Markdown
Member

Right, when the operator is being backed by ironic on the provisioning host it sees different state information and behaves differently. metal3-io/cluster-api-provider-baremetal#94 should fix the actuator to make it stop adding the image to masters, which should prevent the reprovisioning behavior you're seeing.

@hardys
Copy link
Copy Markdown

hardys commented Jul 2, 2019

Right, when the operator is being backed by ironic on the provisioning host it sees different state information and behaves differently. metal3-io/cluster-api-provider-baremetal#94 should fix the actuator to make it stop adding the image to masters, which should prevent the reprovisioning behavior you're seeing.

Thanks @dhellmann - what's the best way to test that while we work on landing the fix then rebasing to get it into the openshift fork, then rebasing kni-installer to pick that up (ugh, so many steps to pick up a bugfix..)?

The docs mention killing the MAO then running the provider manually (which I did try but ran into issues getting things to build), is there some way we can test fixes like this in-place, I guess we'd maybe have to create a special release payload with a custom MAO referencing a test build? @stbenjam perhaps you have some ideas here too?

@hardys
Copy link
Copy Markdown

hardys commented Jul 8, 2019

Ok after rebasing, and applying both metal3-io/baremetal-operator#212 and #653 this works \o/

One thing I did notice is that the worker reboots several times, I need to check the logs but it seems like we power it on when the node is registered, then off, then on for introspection, then off, then on for deployment, then off, then finally boot into RHCOS.

I was expecting us to use the fasttrack deployment and leave the node on after introspection, and it's also questionable whether we should power on a node that's online but not yet assigned any image?

Those issues are unrelated to this PR though, I'll dig into it and raise a BMO issue with more details.

@hardys
Copy link
Copy Markdown

hardys commented Jul 8, 2019

$ oc get nodes
NAME       STATUS   ROLES    AGE    VERSION
master-0   Ready    master   36m    v1.14.0+04ae0f405
master-1   Ready    master   36m    v1.14.0+04ae0f405
master-2   Ready    master   36m    v1.14.0+04ae0f405
worker-0   Ready    master   3m2s   v1.14.0+04ae0f405

$ oc get baremetalhost --namespace openshift-machine-api
NAME                 STATUS   PROVISIONING STATUS      CONSUMER          BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ostest-master-0   ipmi://192.168.111.1:6230                      true     
openshift-master-1   OK       externally provisioned   ostest-master-1   ipmi://192.168.111.1:6231                      true     
openshift-master-2   OK       externally provisioned   ostest-master-2   ipmi://192.168.111.1:6232                      true     
openshift-worker-0   OK       provisioned              ostest-master-0   ipmi://192.168.111.1:6233   unknown            true     

@hardys
Copy link
Copy Markdown

hardys commented Jul 8, 2019

So next steps here I think are to merge both metal3-io/baremetal-operator#212 and #653 then rebase this PR and address my comment re lolcat.

I think we'll also need to update run_ci.sh so that we collect the logs for all the containers in the BMO, and consider how we can retrieve the IPA logs from the shared volume on the cluster for CI debugging.

@hardys
Copy link
Copy Markdown

hardys commented Jul 8, 2019

The other thing to consider is how do we test changes to the images in CI? E.g we currently rebuild the ironic-image when testing an ironic-image PR, but this will always use the one that's in the openshift release payload (when the BMO is started via the MAO, but currently the upstream quay repos).

@derekhiggins any thoughts on that?

@imain imain force-pushed the ironic_baremetal_pod branch 2 times, most recently from df52bba to 496ad6d Compare July 9, 2019 01:37
@hardys
Copy link
Copy Markdown

hardys commented Jul 9, 2019

This is now failing for me like:

$ oc logs metal3-baremetal-operator-58b4c4d558-q28ll -c baremetal-operator --namespace openshift-machine-api
{"level":"error","ts":1562673586.4834816,"logger":"cmd","msg":"","error":"error listening on 127.0.0.1:8080: listen tcp 127.0.0.1:8080: bind: address already in use","stacktrace":"github.com/metal3-io/baremetal-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nmain.main\n\t/go/src/github.com/metal3-io/baremetal-operator/cmd/manager/main.go:73\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198"}

@hardys
Copy link
Copy Markdown

hardys commented Jul 9, 2019

metal3-io/baremetal-operator#253 should resolve the port conflict

@imain imain force-pushed the ironic_baremetal_pod branch 2 times, most recently from c55ed9d to 780b788 Compare July 11, 2019 15:25
@imain imain changed the title WIP: Use the new baremetal ironic pod to deploy the BMO Use the new cluster hosted ironic pod to deploy the BMO Jul 11, 2019
@russellb russellb added the CI check this PR with CI label Jul 11, 2019
@metal3ci
Copy link
Copy Markdown

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/881/

@metal3ci
Copy link
Copy Markdown

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/884/

@imain
Copy link
Copy Markdown
Contributor Author

imain commented Jul 11, 2019

SHIP IT! ;)

Copy link
Copy Markdown

@hardys hardys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry one small issue but otherwise lgtm!

Comment thread ironic_bmo_configmap.yaml Outdated
@derekhiggins
Copy link
Copy Markdown
Collaborator

lgtm, ironic and components running on master node and used to deploy worker

> [notstack@host11 dev-scripts]$ export OS_URL=http://172.22.0.3:6385/                                                                                                                                               
> [notstack@host11 dev-scripts]$ openstack baremetal node list
> +--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
> | UUID                                 | Name               | Instance UUID                        | Power State | Provisioning State | Maintenance |
> +--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
> | fd35eb67-28a5-4523-a896-140b00e361a8 | openshift-master-0 | None                                 | power on    | manageable         | False       |
> | 0e4c3346-8aba-4ca5-86a6-41c2e461b044 | openshift-master-1 | None                                 | power on    | manageable         | False       |
> | bbbca4d8-0878-42f2-970c-6c3815ec4b4d | openshift-master-2 | None                                 | power on    | manageable         | False       |
> | 8509eabd-f667-4461-8553-6dbaebd035e8 | openshift-worker-0 | 8509eabd-f667-4461-8553-6dbaebd035e8 | power on    | active             | False       |
> +--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
> 

@imain imain force-pushed the ironic_baremetal_pod branch from 6174b52 to 056aa41 Compare July 12, 2019 20:57
@metal3ci
Copy link
Copy Markdown

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/887/

@dtantsur
Copy link
Copy Markdown
Member

We should really start using adoption, otherwise it's very confusing to see running masters as manageable

@hardys
Copy link
Copy Markdown

hardys commented Jul 15, 2019

We should really start using adoption, otherwise it's very confusing to see running masters as manageable

Yeah that is a known issue, tracked via metal3-io/baremetal-operator#185

@hardys hardys merged commit 8d22002 into openshift-metal3:master Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI check this PR with CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants