-
Notifications
You must be signed in to change notification settings - Fork 200
agent: introduce openshift-appliance flow #1547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: introduce openshift-appliance flow #1547
Conversation
|
Hi @danielerez. Thanks for your PR. I'm waiting for a openshift-metal3 member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
f396b5c to
afea0fb
Compare
|
/cc @rwsu |
afea0fb to
52bd20e
Compare
|
/ok-to-test |
|
@danielerez can you please share the min config to try it out? I'm using: but got a failure when running the |
common.sh
Outdated
| if [[ ! -z ${AGENT_E2E_TEST_BOOT_MODE} ]]; then | ||
| if [[ $AGENT_E2E_TEST_BOOT_MODE != "ISO" && $AGENT_E2E_TEST_BOOT_MODE != "PXE" ]]; then | ||
| if [[ $AGENT_E2E_TEST_BOOT_MODE != "ISO" && $AGENT_E2E_TEST_BOOT_MODE != "PXE" && $AGENT_E2E_TEST_BOOT_MODE != "DISKIMAGE" ]]; then | ||
| printf "Found invalid value \"$AGENT_E2E_TEST_BOOT_MODE\" for AGENT_E2E_TEST_BOOT_MODE. Supported values: ISO (default), PXE." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| printf "Found invalid value \"$AGENT_E2E_TEST_BOOT_MODE\" for AGENT_E2E_TEST_BOOT_MODE. Supported values: ISO (default), PXE." | |
| printf "Found invalid value \"$AGENT_E2E_TEST_BOOT_MODE\" for AGENT_E2E_TEST_BOOT_MODE. Supported values: ISO (default), PXE, DISKIMAGE." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
agent/06_agent_create_cluster.sh
Outdated
| # This is required to allow qemu opening the disk image | ||
| if [ "${OPENSHIFT_CI}" == true ]; then | ||
| setfacl -m u:qemu:rx /root | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a certain amount of duplicated code with attach_agent_iso(), it'd be nice to remove it where possible, ie like this block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it could be pretty cumbersome to add this logic to 'attach_agent_iso' func. Seems it has a substantial different behaviour than in the appliance flow. But I'll indeed try to extract the duplicated code where possible.
agent/06_agent_create_cluster.sh
Outdated
| # Attach the appliance disk image and the config ISO | ||
| sudo virt-xml ${name} --remove-device --disk 1 | ||
| sudo virt-xml ${name} --add-device --disk "${disk_image}",device=disk,target.dev=sda | ||
| sudo virt-xml ${name} --add-device --disk "${config_image_dir}/agentconfig.noarch.iso",device=cdrom,target.dev=${config_image_drive} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably.. fixed.
|
|
||
| # Attach the appliance disk image and the config ISO | ||
| sudo virt-xml ${name} --remove-device --disk 1 | ||
| sudo virt-xml ${name} --add-device --disk "${disk_image}",device=disk,target.dev=sda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: what's the reason to remove and re-add the disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't need the default disk attached to the machine, as we're using our own generated disk image. I.e. need to replace the existing disk with our own.
|
|
||
| } | ||
|
|
||
| function attach_appliance_diskimage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function populates the $OCP_DIR with several artifacts, but the make clean is broken as the cache and temp folder are created under the root user. Please add the required cleaning steps so that it would be possible to remove everything when running the clean target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good catch, will add.
| ocpRelease: | ||
| version: {{ version }} | ||
| channel: candidate | ||
| cpuArchitecture: x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the appliance support arm64? A recent PR brought in support for arm64: #1518
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. We haven't actually tested yet other architecture, but the appliance should support it. We can make it configurable for later on then.
| version: {{ version }} | ||
| channel: candidate | ||
| cpuArchitecture: x86_64 | ||
| diskSizeGB: 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the diskSizeGB be set to a smaller size? 200G per VM fills up the disk on my local test machine. What would be a recommended minimum size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actaully only the virtual size of the disk, so it shouldn't consume that much space of disk. The actual size is ~40GiB.
52bd20e to
f0fe8f4
Compare
Looks like the unconfigured-ignition API is missing from the openshift-install binary. |
|
For builds after 7/12 you should be able to get a valid unconfigured ignition and config image built. For example using these values: |
agent/roles/manifests/vars/main.yml
Outdated
| agent_nodes_hostnames: "{{ lookup('env', 'AGENT_NODES_HOSTNAMES_STR') }}" | ||
| agent_use_ztp_manifests: "{{ lookup('env', 'AGENT_USE_ZTP_MANIFESTS') }}" | ||
| agent_test_cases: "{{ lookup('env', 'AGENT_TEST_CASES') }}" | ||
| agent_use_appliance_manifest: "{{ lookup('env', 'AGENT_USE_APPLIANCE_MANIFEST') }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can AGENT_E2E_TEST_BOOT_MODE be used here instead of adding an additional AGENT_USE_APPLIANCE_MANIFEST export?
If its really necessary to add an additional export will need to add it to config_example.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually just wanted to be explicit here, for keeping the convention of other manifests (e.g. AGENT_USE_ZTP_MANIFESTS). We can indeed use AGENT_E2E_TEST_BOOT_MODE if you think it'd be clearer(?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use AGENT_E2E_TEST_BOOT_MODE to simplify the number of exports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add it to vars/main.yml then.
| ipxeBaseURL: {{ pxe_server_url }} | ||
| {% endif %} | ||
| {% if networking_mode != "dhcp" %} | ||
| {% if networking_mode != "DHCP" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to rebase and pick up this fix.
9a2bd28 to
2c68bfa
Compare
| if [[ $AGENT_E2E_TEST_BOOT_MODE != "ISO" && $AGENT_E2E_TEST_BOOT_MODE != "PXE" ]]; then | ||
| printf "Found invalid value \"$AGENT_E2E_TEST_BOOT_MODE\" for AGENT_E2E_TEST_BOOT_MODE. Supported values: ISO (default), PXE." | ||
| if [[ $AGENT_E2E_TEST_BOOT_MODE != "ISO" && $AGENT_E2E_TEST_BOOT_MODE != "PXE" && $AGENT_E2E_TEST_BOOT_MODE != "DISKIMAGE" ]]; then | ||
| printf "Found invalid value \"$AGENT_E2E_TEST_BOOT_MODE\" for AGENT_E2E_TEST_BOOT_MODE. Supported values: ISO (default), PXE, DISKIMAGE." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: feeling that a case may be a little bit more readable with a longer list, ie something like
case "$AGENT_E2E_TEST_BOOT_MODE" in
"ISO" | "PXE" | "DISKIMAGE")
# Valid value
;;
*)
printf "Found invalid value \"$AGENT_E2E_TEST_BOOT_MODE\" for AGENT_E2E_TEST_BOOT_MODE. Supported values: ISO (default), PXE, DISKIMAGE."
exit 1
;;
esac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea:) Added.
2c68bfa to
fcdbb8a
Compare
|
/lgtm |
|
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @danielerez, I'm having trouble testing this PR. It is not clear to me how to set the release image version.
When I have export OPENSHIFT_RELEASE_IMAGE=quay.io/openshift-release-dev/ocp-release:4.14.0-ec.3-x86_64 in config_x.sh I see:
Jul 26 21:44:38 master-0 load-config-iso.sh[5429]: < releaseImage: quay.io/openshift-release-dev/ocp-release@sha256:3c050cb52fdd3e65c518d4999d238ec026ef724503f275377fee6bf0d33093ab
Jul 26 21:44:38 master-0 load-config-iso.sh[5429]: ---
Jul 26 21:44:38 master-0 load-config-iso.sh[5429]: > releaseImage: quay.io/openshift-release-dev/ocp-release:4.14.0-ec.3-x86_64
Jul 26 21:44:38 master-0 load-config-iso.sh[1535]: The cluster-image-set in archive does not match current release quay.io/openshift-release-dev/ocp-release@sha256:3c050cb52fdd3e65c518d4999d238ec026ef724503f275377fee6bf0d33093ab
Not setting any release image version, yields similar results.
agent/06_agent_create_cluster.sh
Outdated
| for (( n=0; n<${2}; n++ )) | ||
| do | ||
| name=${CLUSTER_NAME}_${1}_${n} | ||
| disk_image=${appliance_disk_image}_${n} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| disk_image=${appliance_disk_image}_${n} | |
| disk_image=${appliance_disk_image}_${1}_${n} |
When there are workers defined in the cluster. There is an issue where master-0's disk image name and worker-0's disk image name are the same causing a write lock error:
+(./agent/06_agent_create_cluster.sh:73): create_config_image(): cp -r ocp/ostest/configimage/auth ocp/ostest
+(./agent/06_agent_create_cluster.sh:130): attach_appliance_diskimage(): (( n=0 ))
+(./agent/06_agent_create_cluster.sh:130): attach_appliance_diskimage(): (( n<2 ))
+(./agent/06_agent_create_cluster.sh:132): attach_appliance_diskimage(): name=ostest_worker_0
+(./agent/06_agent_create_cluster.sh:133): attach_appliance_diskimage(): disk_image=ocp/ostest/appliance.raw_0
+(./agent/06_agent_create_cluster.sh:136): attach_appliance_diskimage(): sudo cp ocp/ostest/appliance.raw ocp/ostest/appliance.raw_0
+(./agent/06_agent_create_cluster.sh:139): attach_appliance_diskimage(): sudo virt-xml ostest_worker_0 --remove-device --disk 1
Domain 'ostest_worker_0' defined successfully.
+(./agent/06_agent_create_cluster.sh:140): attach_appliance_diskimage(): sudo virt-xml ostest_worker_0 --add-device --disk ocp/ostest/appliance.raw_0,device=disk,target.dev=sda
Domain 'ostest_worker_0' defined successfully.
+(./agent/06_agent_create_cluster.sh:141): attach_appliance_diskimage(): sudo virt-xml ostest_worker_0 --add-device --disk ocp/ostest/configimage/agentconfig.noarch.iso,device=cdrom,target.dev=sdd
Domain 'ostest_worker_0' defined successfully.
+(./agent/06_agent_create_cluster.sh:144): attach_appliance_diskimage(): sudo virt-xml ostest_worker_0 --edit target=sda --disk=boot_order=1 --start
ERROR Failed starting domain 'ostest_worker_0': internal error: qemu unexpectedly closed the monitor: 2023-07-26T21:10:15.312697Z qemu-kvm: -device ide-hd,bus=ide.0,drive=libvirt-2-format,id=sata0-0-0,bootindex=1: Failed to get "write" lock
Is another process using the image [/home/rwsu/go/src/github.com/openshift-metal3/dev-scripts/ocp/ostest/appliance.raw_0]?
Domain 'ostest_worker_0' defined successfully.
I think the disk_image name should include the "master" or "worker" substring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, missed it, good catch. Fixed.
This issue was actually fixed about a week ago. You might have a stale image of the appliance. I've just used the following configuration for running the flow: |
f60e7ff to
21ab07c
Compare
agent/06_agent_create_cluster.sh
Outdated
|
|
||
| # Build appliance with `debug-base-ignition` flag for using the custom openshift-install | ||
| # binary from assets directory. | ||
| sudo podman run -it --rm --privileged --net=host -v ${asset_dir}:/assets:Z ${APPLIANCE_IMAGE} build --debug-base-ignition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sudo podman run -it --rm --privileged --net=host -v ${asset_dir}:/assets:Z ${APPLIANCE_IMAGE} build --debug-base-ignition | |
| sudo podman run -it --rm --pull newer --privileged --net=host -v ${asset_dir}:/assets:Z ${APPLIANCE_IMAGE} build --debug-base-ignition |
Can we have podman pull a newer version of one is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea, added.
This didn't work for me after I pulled a newer appliance image. I had to manually set |
Added support for running the agent flow using the openshift-appliance[1] disk image. To invoke the flow, use the following env var in config: export AGENT_E2E_TEST_BOOT_MODE=DISKIMAGE The appliance flow is as follows: * Generate appliance-config.yaml[2] * Includes config required for building the appliance * Build the appliance disk image using the container image defined in APPLIANCE_IMAGE env var * Create the config-image ISO * Clone the appliance disk image for every node * As each node needs a seperate disk for exclusive i/o * Attach a cloned disk image to each node * Attach the config-image ISO to all nodes * Boot all machines from the disk image Note: based on config-image support PR[3] [1] https://github.com/openshift/appliance [2] https://github.com/openshift/appliance/blob/master/docs/user-guide.md#set-appliance-config [3] openshift-metal3#1533
21ab07c to
7e3d43e
Compare
I wonder what's the difference in the envs then:/ Any way, we'll have to test it on the actual CI env... |
|
Hey @rwsu @andfasano @bfournie, can you please take another look? I think this change is ready for approval now. |
rwsu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Not sure how CI will be configured to avoid the release image mismatch problem. Other than that, it looks ok to me.
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bfournie The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-metal-ipi-serial-ipv4 |
|
/lgtm |
|
/test e2e-metal-ipi-serial-ipv4 |
3 similar comments
|
/test e2e-metal-ipi-serial-ipv4 |
|
/test e2e-metal-ipi-serial-ipv4 |
|
/test e2e-metal-ipi-serial-ipv4 |
|
/retest |
|
@danielerez: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest-required |
Added support for running the agent flow using the openshift-appliance[1] disk image.
To invoke the flow, use the following env var in config:
export AGENT_E2E_TEST_BOOT_MODE=DISKIMAGEThe appliance flow is as follows:
Note: based on config-image support PR
[1] https://github.com/openshift/appliance
[2] https://github.com/openshift/appliance/blob/master/docs/user-guide.md#set-appliance-config