Skip to content

cmd: check-host-config rewritten in Go (HMS-9805)#2043

Merged
supakeen merged 5 commits intoosbuild:mainfrom
lzap:check-host-config
Jan 13, 2026
Merged

cmd: check-host-config rewritten in Go (HMS-9805)#2043
supakeen merged 5 commits intoosbuild:mainfrom
lzap:check-host-config

Conversation

@lzap
Copy link
Contributor

@lzap lzap commented Nov 26, 2025

Yeah. Cozy winter evening happened.

I wrote the whole framework and implemented two checks and then asked AI to implement the rest. Only checkers in the check package were generated and reviewed by me, all the rest is my own code. Let me know how you like it. I intentionally did not make any changes to the behavior, this is pure refactoring which will open doors to more features like using YAML to find which tests are missing etc.

Also the output is just pure text, but it could be maybe gotest now or something that CI understands and can parse for nicer results. But one step after another, I would rather keep it simple for now.

Now, I really wanted this to be fast so everything runs in parallel, but logs from all goroutines are collected and presented in a nicer way sequentially so the output is not all messed up. I heavily use what is called context function mocking where the platform functions (Exec, Exists, Grep) can be overriden with a mock via special context value. The key type can be not exported making it impossible to mess around with the functions from other packages.

The hard part is now to integrate this in the CICD. We want to build the binaries for each PR and then upload them somewhere and take it from here. For now, I am settling up with simply running "go build" for each test and then copying the binaries over to the image the same way as we did for the script. If we cache GOCACHE maybe this might not be a bad idea at all.

You can safely run this locally, I have reviewed and tested everything:

$ go run ./cmd/check-host-config -config test/configs/all-customizations.json 
[main           ] Loading build config from test/configs/all-customizations.json
[files          ] Checking existence of file: /etc/systemd/system/custom.service
[files          ] Checking if file exists: /etc/systemd/system/custom.service
[files          ] File does not exist: /etc/systemd/system/custom.service
[hostname       ] Executing: hostname
[hostname       ] Command succeeded: hostname
[hostname       ] Comparing 'dev.home.lan' with expected hostname 'my-host'
[cacerts        ] Parsing CA cert 1
[cacerts        ] Extracting serial from CA cert 1: 27894af897dd2423607045716438a725f28a6d0b
[cacerts        ] Extracting CN from CA cert 1: Test CA for osbuild
[cacerts        ] Checking CA cert 1 anchor file serial '27894af897dd2423607045716438a725f28a6d0b'
[cacerts        ] Checking if file exists: /etc/pki/ca-trust/source/anchors/27894af897dd2423607045716438a725f28a6d0b.pem
[cacerts        ] File does not exist: /etc/pki/ca-trust/source/anchors/27894af897dd2423607045716438a725f28a6d0b.pem
[srv-disabled   ] Checking disabled service: bluetooth.service
[srv-disabled   ] Executing: systemctl is-enabled bluetooth.service
[srv-disabled   ] Command succeeded: systemctl
[srv-masked     ] Executing: systemctl list-unit-files --state=masked
[srv-masked     ] Command succeeded: systemctl
[srv-masked     ] Checking masked service: nfs-server
[users          ] Checking user: user1
[users          ] Executing: id user1
[users          ] Command failed: id (exit code: exit status 1)
[srv-enabled    ] Checking enabled service: sshd.service
[srv-enabled    ] Executing: systemctl is-enabled sshd.service
[srv-enabled    ] Command succeeded: systemctl
[srv-enabled    ] Service was enabled service=sshd.service state=enabled
[srv-enabled    ] Checking enabled service: custom.service
[srv-enabled    ] Executing: systemctl is-enabled custom.service
[srv-enabled    ] Command failed: systemctl (exit code: exit status 4)
[fw-srv-disabled] Checking disabled firewall service: telnet
[fw-srv-disabled] Executing: sudo firewall-cmd --query-service=telnet
[fw-srv-disabled] Command failed: sudo (exit code: exit status 1)
[fw-srv-disabled] Firewall service was disabled service=telnet state=no
[fw-srv-enabled ] Checking enabled firewall service: ftp
[fw-srv-enabled ] Executing: sudo firewall-cmd --query-service=ftp
[fw-srv-enabled ] Command failed: sudo (exit code: exit status 1)
[fw-ports       ] Checking enabled firewall port: 1337/udp
[fw-ports       ] Executing: sudo firewall-cmd --query-port=1337/udp
[fw-ports       ] Command failed: sudo (exit code: exit status 1)
Results:
✅ Firewall Services Disabled Check
⚠️  Hostname Check (warn: hostname does not match, got dev.home.lan expected my-host)
❌ Files Check (fail: file does not exist: /etc/systemd/system/custom.service)
❌ CA Certs Check (fail: file missing for cert 1 at /etc/pki/ca-trust/source/anchors/27894af897dd2423607045716438a725f28a6d0b.pem)
❌ Services Disabled Check (fail: service is not disabled: bluetooth.service state: enabled)
❌ Services Masked Check (fail: service is not masked: nfs-server)
❌ Users Check (fail: user does not exist: user1)
❌ Services Enabled Check (fail: service is not enabled: custom.service error: exit status 4)
❌ Firewall Services Enabled Check (fail: firewall service is not enabled: ftp error: exit status 1)
❌ Firewall Ports Check (fail: firewall port is not enabled: 1337:udp error: exit status 1)

The result codes indicate a test which failed as the first one, starting from 10 (1-9 are reserved). These goroutines can fail in random order, so it must not be used for any workarounds it is purely informative.

@lzap lzap added the WIP Work in progress. Don't run Gitlab CI. label Nov 26, 2025
@lzap lzap force-pushed the check-host-config branch from 78a731d to c58b3e5 Compare November 26, 2025 10:18
@lzap lzap removed the WIP Work in progress. Don't run Gitlab CI. label Nov 26, 2025
@lzap lzap force-pushed the check-host-config branch 2 times, most recently from 9c447e4 to f385256 Compare November 26, 2025 16:49
@lzap lzap changed the title cmd: check-host-config rewritten in Go cmd: check-host-config rewritten in Go (HMS-9805) Nov 26, 2025
@lzap lzap force-pushed the check-host-config branch 3 times, most recently from 26f944b to c8c2cae Compare November 27, 2025 09:59
@lzap lzap force-pushed the check-host-config branch from c8c2cae to 274a1ba Compare November 27, 2025 11:43
@lzap lzap force-pushed the check-host-config branch 11 times, most recently from 70673da to 3b8f8cb Compare December 2, 2025 17:10
@lzap lzap force-pushed the check-host-config branch 4 times, most recently from f507ee9 to 0eb7945 Compare December 3, 2025 14:29
Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this split into multiple commits. The changes outside of cmd/check-host-config/ can be separate from the introduction of the new tool.

The introduction of the tool itself could also be multiple commits, like introducing the main parts and then the individual checks, but I don't feel as strongly about this (one commit for the whole thing isn't too bad).

More generally: I like the check interface. Each little module is nicely defined and clear. But looking at everything going on in main() I have to wonder what it's all for. There seems to be a lot going on for something that should mostly be "call Run() on a list of checks and collect the results". Maybe it could be broken down a bit into functions so its easier to read if it's all necessary.

Comment on lines 25 to 27
if f := ExecFunc(ctx); f != nil {
return f(name, arg...)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very clear to me why this (and the other overrides like it) is needed. Do we use it anywhere, or is it just for mocking?

Copy link
Contributor Author

@lzap lzap Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for thread-safe mocking, yes.

See, I value a good mocking absolutely essential in this case where CICD is very expensive, I want to be able to easily create as many reliable tests as I need in order to improve chance there are no bugs before I push and wait 3 hours until I get my results. The context mocking is not very common, but pretty useful approach.

Btw this rewrite alone revealed two issues both in modularity and OpenSCAP when the checks were not entirely correct. So it already paid off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a package.go with further explanation:

// Package mockos provides mockable OS functions for testing. Use this package in
// host checks to allow mocking of OS interactions during unit tests. The package
// provides WithXXXFunc functions to set mock implementations in a context.Context,
// and corresponding XXXFunc functions to retrieve them. If no mock is set, the real
// OS functions are used.
package mockos

@lzap
Copy link
Contributor Author

lzap commented Dec 4, 2025

The introduction of the tool itself could also be multiple commits, like introducing the main parts and then the individual checks, but I don't feel as strongly about this (one commit for the whole thing isn't too bad).

While generally I agree with isolated commits, here it does not make too much sense as the commit makes sense as a whole. I do not like creating "artificial history" when it was not how the code was really written. Can do this if you insist.

But looking at everything going on in main() I have to wonder what it's all for.

Yeah, I see two confusing parts. Argument parsing is a big unreadable, I was aiming for both because I thought it is good to be backward compatible if someone uses this directly. But now that I think about it, let's break it. I am settling with Go flags because these are shorter to parse than argument + environmental variable.

The special goroutine which collects logs is not necessary, but I really want this to be readable. Because all checks are executed in parallel, the output would be totally unreadable without this. We really, really should do something with unreadable logs on CICD and we have to start somewhere. This check binary will be very often the one that fails and I want a clear output with as much logging as possible. This is why logging is quite elaborated and is being collected and ordered correctly.

Maybe it could be broken down a bit into functions so its easier to read if it's all necessary.

Absolutely, good idea. I moved check slice into checks.go and extracted code into functions.

@achilleas-k
Copy link
Member

This will make reviews harder

For me at least, a single final commit that changes some config files and modifies all the manifest checksums doesn't affect the reviewing much. As long as it's isolated from the rest of the changes, I can review the PR as if it's not there by viewing the rest of the commits individually (or in ranges, now that GH web UI supports it). So don't worry about that.

Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check always succeeds.

@achilleas-k
Copy link
Member

You mentioned at some point (can't remember where or when) that you had a hard time rebasing and rewriting this because you had to wait for CI jobs that can take hours.

Just ftr, for anyone reading this, it's possible to quickly test it locally by:

  1. Building an image: ./test/scripts/build-image fedora-43 generic-qcow2 ./test/configs/all-customizations.json.
  2. pip install . (in a venv preferably) to make vmtest available.
  3. Running the boot test on the image: ./test/scripts/boot-image fedora-43 generic-qcow2 ./test/configs/all-customizations.json

Step 3 takes about 15 secs to run on my machine, making it very easy to iterate locally on the code.

@lzap
Copy link
Contributor Author

lzap commented Jan 10, 2026

Damn, yeah, nice catch. I am bumping the seed once again then. Rebased.

@lzap lzap force-pushed the check-host-config branch 4 times, most recently from 1000881 to adcb6fa Compare January 11, 2026 16:34
achilleas-k
achilleas-k previously approved these changes Jan 12, 2026
Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! Nice work. I went through everything carefully again and have a few nitpicks and minor comments. Approving anyway, since they can be handled in follow-ups.

// still in the activating state. It calls systemctl list-units to get the list.
// This is only used in case of timeout to help with debugging.
func listBadUnits() string {
stdout, _, _, err := check.Exec("systemctl", "list-units", "--state=activating,failed", "--plain", "--no-legend", "--no-pager")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemctl list-units supports --output=json, which would be cleaner here. Though, I don't know what version introduced the feature, so maybe it's not available on RHEL 8.10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is RHEL9+ only.

Comment on lines +17 to +26
// Note that this test only checks for the existance of the filesystem
// customizatons target path not the content. For the simple case when
// "data" is provided we could check but for the "uri" case we do not
// know the content as the file usually comes from the host. The
// existing testing framework makes the content check difficult, so we
// settle for this for now. There is an alternative approach in
// https://github.com/osbuild/images/pull/1157/commits/7784f3dc6b435fa03951263e48ea7cfca84c2ebd
// that may eventually be considered that is more direct and runs
// runs locally but different from the existing paradigm so it
// needs further discussion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check permissions and ownership as well. And check data when it's embedded in the blueprint. Let's do it in a follow-up though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I aim for no changes in this PR, except fixing obvious bugs. There is much more we can do now pretty easily.

@lzap
Copy link
Contributor Author

lzap commented Jan 12, 2026

Rebased and squashed all your comments, thanks. The only exception being listing unit tests but this is not a big deal it is only in case of error. Any other new functionality I would like to do as a followups.

supakeen
supakeen previously approved these changes Jan 12, 2026
@supakeen supakeen enabled auto-merge January 12, 2026 18:56
lzap added 5 commits January 13, 2026 11:48
The check script is rewritten in Go to be more modular and easier to test.
No changes have been made to the checks themselves. Comments were carried
over from the script.
Switch over to the Go implementation in the boot-image script.
The package "jq" is no longer needed. Drop it from configs.

The sssd service is masked on RHEL 8.8+ because it does not start, see
the discussion in: https://access.redhat.com/solutions/7017538
@lzap
Copy link
Contributor Author

lzap commented Jan 13, 2026

Only resolved manifest conflicts after auto-merge was enabled, can I get acks again @achilleas-k @supakeen thanks :-)

@supakeen supakeen added this pull request to the merge queue Jan 13, 2026
Merged via the queue into osbuild:main with commit 27eddcd Jan 13, 2026
25 checks passed
@lzap lzap deleted the check-host-config branch January 13, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants