Skip to content

Conversation

@sadasu
Copy link
Contributor

@sadasu sadasu commented Oct 12, 2021

When the Provisioning Network is not being used (ProvisioningNetwork=Disabled or ProvisioningNetwork=Enabled and VirtualMediaViaExternalNetwork = true), the image customization controller deployment uses the host IP of the master on which metal3 deployment is running as the IP address of Ironic.
When the metal3 deployment moves to a different master, we need to make sure that the metal3-image-customization deployment uses this new host IP as Ironic's IP address.

CBO already watches for changes in the Deployment() resource and when the metal3 deployment moves, a reconcile loop would be triggered, Ironic's IP will once again be evaluated and the metal3-image-customization deployment will be updated with this new IP.

Co-Authored-By: Zane Bitter [email protected]

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested review from markmc and stbenjam October 12, 2021 22:02
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2021
container := corev1.Container{
Name: "image-customization-controller",
Image: images.ImageCustomizationController,
Command: []string{"/image-customization-controller"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently passing in the default as the images-bind-addr. How does the user get to override this value? As an input parameter to CBO (can be set once) or as a config in Provisioning CR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images-publish-addr should be set to the URL of the Service that points at this. That's what appears in the URLs that the controller puts into the PreprovisioningImage Status.
images-bind-addr can be left at the default, as long as the Service points at the default port.

@sadasu sadasu force-pushed the start-image-customization-controller branch 2 times, most recently from 669bd25 to d6e92e5 Compare November 29, 2021 17:32
container := corev1.Container{
Name: "image-customization-controller",
Image: images.ImageCustomizationController,
Command: []string{"/image-customization-controller"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images-publish-addr should be set to the URL of the Service that points at this. That's what appears in the URLs that the controller puts into the PreprovisioningImage Status.
images-bind-addr can be left at the default, as long as the Service points at the default port.

}
} else {
hostIP, err := getPodHostIP(info.Client.CoreV1(), info.Namespace)
if err == nil && hostIP != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we let this error bubble up so that we will retry, instead of just setting a blank URL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry will automatically happen during the next reconcile loop. We might get an error when the metal3 pod is not available temporarily. So, my understanding is that retrying immediately is not going to help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get err != nil it's because we failed to do Pods.List(), so that could just be a dropped network connection. No reason to think we'll be triggered to reconcile again.
For the empty hostIP, I agree we could wait for the reconcile. I do wonder whether it's worth the expense of doing a deploy that will definitely fail (this env var is required). I think I will change the controller to make it optional and report the errors in the CRD, rather than having the controller refuse to start and all the CRDs frozen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still to be resolved.
openshift/image-customization-controller#28 is the change to improve the error reporting, so we can continue to handle hostIP == "" like this, but we need to return err in the case where err != nil. It might be easier to pass the ironic IP in to createImageCustomizationContainer() instead of calling out from inside it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made updates. Hopefully this is what you are looking for.

@zaneb
Copy link
Member

zaneb commented Dec 3, 2021

Forgot to mention, we also need to add the -build-preprov-image option to the baremetal-operator command line (to enable the preprovisioning image feature), and remove the deployRamdiskUrl environment variable from it (to stop it from starting its own internal controller).

},
buildSSHKeyEnvVar(info.SSHKey),
pullSecret,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in openshift/installer#5425 (comment), it looks like we'll need to mount the file /etc/containers/registries.conf (could be as a (read-only) hostPath volume) somewhere in the container, and set the env var REGISTRIES_CONF_PATH to point to it (added in openshift/image-customization-controller#20).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope what I have now is what you are looking for.

@sadasu sadasu force-pushed the start-image-customization-controller branch from 719994f to 95c6940 Compare December 8, 2021 16:31
@sadasu
Copy link
Contributor Author

sadasu commented Dec 8, 2021

Forgot to mention, we also need to add the -build-preprov-image option to the baremetal-operator command line (to enable the preprovisioning image feature), and remove the deployRamdiskUrl environment variable from it (to stop it from starting its own internal controller).

Commit 95c6940 should take care of this.

@sadasu sadasu force-pushed the start-image-customization-controller branch 3 times, most recently from aa36a1d to b08c2b1 Compare December 8, 2021 22:22
Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove createInitContainerConfigureCoreOSIPA() now, as that is just adding a similar Ignition file (with no network config) as is built by the image builder. Also we have to get rid of it so that openshift/ironic-image#243 can have an effect.

}
} else {
hostIP, err := getPodHostIP(info.Client.CoreV1(), info.Namespace)
if err == nil && hostIP != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get err != nil it's because we failed to do Pods.List(), so that could just be a dropped network connection. No reason to think we'll be triggered to reconcile again.
For the empty hostIP, I agree we could wait for the reconcile. I do wonder whether it's worth the expense of doing a deploy that will definitely fail (this env var is required). I think I will change the controller to make it optional and report the errors in the CRD, rather than having the controller refuse to start and all the CRDs frozen.

@sadasu
Copy link
Contributor Author

sadasu commented Dec 10, 2021

/retest

@zaneb
Copy link
Member

zaneb commented Dec 11, 2021

The CRD fix merged but we are still waiting on a CI build.

@sadasu
Copy link
Contributor Author

sadasu commented Dec 12, 2021

/retest

@sadasu sadasu force-pushed the start-image-customization-controller branch from e9937a2 to fa099bf Compare December 13, 2021 15:04
@sadasu sadasu force-pushed the start-image-customization-controller branch from 0f9b2a3 to d4d43d7 Compare December 16, 2021 13:57
@sadasu sadasu changed the title WIP: Start image customization controller Start image customization controller Dec 16, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2021
fmt.Sprintf("http://%s.%s.svc.cluster.local/",
imageCustomizationService, info.Namespace)},
SecurityContext: &corev1.SecurityContext{
Privileged: pointer.BoolPtr(true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was worth a comment I thought, because we will eventually want to get rid of it when we give i-c-c its own volume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
} else {
hostIP, err := getPodHostIP(info.Client.CoreV1(), info.Namespace)
if err == nil && hostIP != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still to be resolved.
openshift/image-customization-controller#28 is the change to improve the error reporting, so we can continue to handle hostIP == "" like this, but we need to return err in the case where err != nil. It might be easier to pass the ironic IP in to createImageCustomizationContainer() instead of calling out from inside it.

The machine-os-images container supplies the latest ISO and PXE images
for pre-provisioning from within the release payload. Use this instead
of downloading preprovisioning images or using the RHEL IPA. Extract the
images to the same location (and with the same names) in the image cache
volume that we were using previously for IPA images. This change is
transparent to the tasks using the images.

Continue to run the second-level image cache service and download the
Provisioning OS QCOW (if specified). Existing clusters installed prior
to 4.10 will contain MachineSets that still rely on the QCOW for
provisioning new Machines.
@sadasu sadasu force-pushed the start-image-customization-controller branch from d4d43d7 to d0a0439 Compare December 16, 2021 17:44
},
{
Name: ironicBaseUrl,
Value: "https://" + hostForIP(ironicIP),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we said we were going to pass an empty string if the metal3 host IP couldn't be found?

}
}

func hostForIP(ipAddr string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is now only being called once, and we probably want some logic to skip adding the "http://" if it's empty, maybe we should just inline this logic in createImageCustomizationContainer.

@sadasu sadasu force-pushed the start-image-customization-controller branch from d0a0439 to 39a83e3 Compare December 16, 2021 20:12
@zaneb
Copy link
Member

zaneb commented Dec 16, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2021
Make sure to report any errors that occur while determining Ironic's IP
address that needs to be passed to the image-customization-controller.
@sadasu sadasu force-pushed the start-image-customization-controller branch from 39a83e3 to 85621b1 Compare December 16, 2021 20:40
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2021
@zaneb
Copy link
Member

zaneb commented Dec 16, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2021
@zaneb
Copy link
Member

zaneb commented Dec 16, 2021

/retest-required

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@zaneb
Copy link
Member

zaneb commented Dec 17, 2021

Tests are blocked by openshift/cluster-cloud-controller-manager-operator#156 - this should be able to merge once that fix is available.

@zaneb
Copy link
Member

zaneb commented Dec 17, 2021

/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-serial-ipv4

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants