Use the new cluster hosted ironic pod to deploy the BMO#635
Use the new cluster hosted ironic pod to deploy the BMO#635hardys merged 2 commits intoopenshift-metal3:masterfrom
Conversation
| POD_NAME=$(oc --config ocp/auth/kubeconfig get pods -n openshift-machine-api | grep metal3-baremetal-operator | cut -f 1 -d ' ') | ||
|
|
||
| # Make sure our pod is running. | ||
| echo "Waiting for baremetal-operator pod to become ready" | lolcat |
There was a problem hiding this comment.
We'll have to remove the lolcat here, it was removed in ebd39f7#diff-6f654b7476bc32edaa73bbb71bf791de
|
Ok this isn't working for me, when the baremetalhost objects get registered, it seems the masters aren't marked externally provisioned anymore, so they get reprovisioned, and I think this expains why you see the controlplane go down at that point ;) |
|
Note I'm testing with https://github.com/hardys/dev-scripts/tree/pr635 which is rebased against master, and applies the related BMO PR |
|
@dhellmann raised metal3-io/cluster-api-provider-baremetal#93 which may explain what I'm seeing, but I'm still not entirely clear why that would cause this only when running Ironic via the BMO and not in the case where we're using the virthost ironic via the proxy container? |
|
Ah it's perhaps working because we maintain the state from the terraform driven master deployment, then the BMO sees that matching nodes already exist and are provisioned, so it doesn't re-register them and drive through the provisioning workflow? |
|
Right, when the operator is being backed by ironic on the provisioning host it sees different state information and behaves differently. metal3-io/cluster-api-provider-baremetal#94 should fix the actuator to make it stop adding the image to masters, which should prevent the reprovisioning behavior you're seeing. |
Thanks @dhellmann - what's the best way to test that while we work on landing the fix then rebasing to get it into the openshift fork, then rebasing kni-installer to pick that up (ugh, so many steps to pick up a bugfix..)? The docs mention killing the MAO then running the provider manually (which I did try but ran into issues getting things to build), is there some way we can test fixes like this in-place, I guess we'd maybe have to create a special release payload with a custom MAO referencing a test build? @stbenjam perhaps you have some ideas here too? |
|
Ok after rebasing, and applying both metal3-io/baremetal-operator#212 and #653 this works \o/ One thing I did notice is that the worker reboots several times, I need to check the logs but it seems like we power it on when the node is registered, then off, then on for introspection, then off, then on for deployment, then off, then finally boot into RHCOS. I was expecting us to use the fasttrack deployment and leave the node on after introspection, and it's also questionable whether we should power on a node that's online but not yet assigned any image? Those issues are unrelated to this PR though, I'll dig into it and raise a BMO issue with more details. |
|
|
So next steps here I think are to merge both metal3-io/baremetal-operator#212 and #653 then rebase this PR and address my comment re lolcat. I think we'll also need to update run_ci.sh so that we collect the logs for all the containers in the BMO, and consider how we can retrieve the IPA logs from the shared volume on the cluster for CI debugging. |
|
The other thing to consider is how do we test changes to the images in CI? E.g we currently rebuild the ironic-image when testing an ironic-image PR, but this will always use the one that's in the openshift release payload (when the BMO is started via the MAO, but currently the upstream quay repos). @derekhiggins any thoughts on that? |
df52bba to
496ad6d
Compare
|
This is now failing for me like: |
|
metal3-io/baremetal-operator#253 should resolve the port conflict |
c55ed9d to
780b788
Compare
|
Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/881/ |
780b788 to
6174b52
Compare
|
Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/884/ |
|
SHIP IT! ;) |
hardys
left a comment
There was a problem hiding this comment.
Sorry one small issue but otherwise lgtm!
|
lgtm, ironic and components running on master node and used to deploy worker |
Switch to using the new ironic + baremetal operator pod.
6174b52 to
056aa41
Compare
|
Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/887/ |
|
We should really start using adoption, otherwise it's very confusing to see running masters as |
Yeah that is a known issue, tracked via metal3-io/baremetal-operator#185 |
Switch to using the new ironic + baremetal operator pod.
See metal3-io/baremetal-operator#212 for some discussion re the new yaml files added here.