Skip to content

Conversation

@dustymabe
Copy link
Member

We recently did some golang migration [1] and happened to add the
cosa remote-session pieces [2] after that migration. This makes
it not easy to backport to other branches. For a period of time
let's use the latest COSA for our COSA pod for multi-arch builds
instead of the versioned COSA. We will still use the versioned COSA
on the remote so the actual building will happened with the correctly
versioned COSA (substituted below and passed as an argument to
cosa remote-session create --image=). However we still need to support
some of the new command line flags that were added to aid in the transition
to cosa remote-session. This PR backports various pieces needed.

[1] #2919
[2] #2979

dustymabe and others added 12 commits September 15, 2022 14:47
The command previously wouldn't allow us to get/set fedora-coreos.parent-commit
and fedora-coreos.parent-version because it would try to traverse the
`fedora-coreos.` path. Add an exception here similar to `coreos-assembler.`.

Now we can do something like this in the pipeline:

```
cosa meta \
    --set fedora-coreos.parent-commit=${parent_commit} \
    --set fedora-coreos.parent-version=${parent_version}
```

(cherry picked from commit 433a68d)
This will allow us to specify where the AWS_CONFIG_FILE is from
the CLI without having to set it in the environment.

(cherry picked from commit db00529)
This will allow us to specify the AWS_CONFIG_FILE from the command
line without having to set it in the environment.

(cherry picked from commit 9214b9b)
This will allow us to specify the AWS_CONFIG_FILE from the command
line without having to set it in the environment.

This required reworking the cosalib/s3 library into an S3() class
so that we could call boto3.client('s3') after setting the
AWS_CONFIG_FILE environment variable in cmd-buildfetch. Otherwise
it would get called as soon as the library was imported and setting
the environment variable would have no effect (too late).

(cherry picked from commit 9d55a20)
This will allow us to specify the AWS_CONFIG_FILE from the command
line without having to set it in the environment.

(cherry picked from commit 2b0b36d)
Right now cmd-build will just pick up whatever the latest symlink points
to and try to get previous build information from that, but what if
the absolute latest build doesn't have a build for the architecture we
are targetting? Currently cmd-build will fail with:

```
/usr/lib/coreos-assembler/cmd-build: line 163: /srv/builds/36.20220720.20.5/aarch64/meta.json: No such file or directory
error: failed to execute cmd-build: exit status 1
```

What we really need is to get the latest build for the given
architecture and use that in cmd-build instead.

(cherry picked from commit 2ca610c)
If we are starting out a brand new stream and the x86_64 build
has completed, but there was no build yet for $arch then the detected
previous buildid:

```
previous_buildid = parent_build or self.get_latest()
```

Will just be the build ID of the x86_64 build from `get_latest()`,
but this code tries to open a meta.json for it from $arch, which
won't exist. Here's the error we see:

```
Traceback (most recent call last):
  File "<string>", line 5, in <module>
  File "/usr/lib/coreos-assembler/cosalib/builds.py", line 134, in init_build_meta_json
    with open(metapath) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/srv/builds/36.20220714.dev.0/aarch64/meta.json'
error: failed to execute cmd-build: exit status 1
```

This should fix that error by verifying the detected buildid actually
has a build for that architecture before continuing.

(cherry picked from commit 786e87b)
CI hit a race where parallel `buildextend-*` would race to generate
`tmp/image.json`.

We need to lock this for efficiency (avoid duplicate work) and
correctness.

(cherry picked from commit 6222e3f)
Linux `tmpfs` still doesn't support `user.` xattrs, and so for
toolbox-like containers that use a (proper!) `tmpfs` for `/tmp`
instead of having it be `overlayfs`, we need to ensure that
we create our tempdir in `$workdir/tmp` - the workdir must have
`user.` xattr support for the `cache/repo-build` anyways.

Closes: coreos#2904
(cherry picked from commit e024805)
This was originally needed for `cosa sign` to force the re-import from
the tarball to ensure that the detached metadata was present in the tmp
repo after signing (3b192cc).

But now with the switch to `ostree container`, we operate directly on
the tmp repo and re-export the signed commit, so this is no longer
needed.

This also fixes `cosa generate-hashlist` also setting `force`, which I
don't think it actually needed.

(cherry picked from commit ccb4b73)
Prep for next patch.

(cherry picked from commit 68f31d5)
`flufl` requires specifying lock lifetimes, i.e. how long the lock
should be considered valid before others can also lock it. The default
is 15s. This surprisingly means that a critical section could in fact be
run concurrently if one of the locks takes more than 15s.

I don't think we want to use lock lifetimes here, so let's make sure we
always pass in a large number. The intended use case is that your
program doesn't deadlock, but I'd prefer we deadlock and have pipelines
fail instead after timing out at the Jenkins level.

This should fix the `NotLockedError: Already unlocked` error messages
we've had in the FCOS pipeline after parallelizing
`cosa generate-hashlist` and `cosa buildextend-qemu` (related:
coreos/fedora-coreos-pipeline#554).

What would be interesting would be lock timeouts instead: e.g. have
`with Lock(...)` automatically throw if it still hasn't acquired the
lock after X duration. That'd allow us to error out earlier than the
Jenkins timeout in case of something taking much longer than it really
should.

We could do this now but it'd require creating the lock separately from
using it in a context. Latest upstream `flufl` supports specifying this
as part of the constructor so let's wait for that instead:

https://gitlab.com/warsaw/flufl.lock/-/merge_requests/47
(cherry picked from commit 1151356)
@openshift-merge-robot
Copy link

@dustymabe: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Sep 16, 2022

@dustymabe: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rhcos dd4f704 link true /test rhcos

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dustymabe dustymabe changed the base branch from main to rhcos-4.11 September 16, 2022 14:44
Copy link
Member

@travier travier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a quick review

@dustymabe
Copy link
Member Author

CI is known broken for rhcos* PRs. Merging.

@dustymabe dustymabe merged commit b71d282 into coreos:rhcos-4.11 Sep 16, 2022
@dustymabe dustymabe deleted the dusty-cosa-remote-support-4.11 branch September 16, 2022 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants