Skip to content

nixos/grub: generate BLS entries#95901

Merged
rnhmjoj merged 4 commits intoNixOS:masterfrom
rnhmjoj:grub-bls
Feb 28, 2025
Merged

nixos/grub: generate BLS entries#95901
rnhmjoj merged 4 commits intoNixOS:masterfrom
rnhmjoj:grub-bls

Conversation

@rnhmjoj
Copy link
Contributor

@rnhmjoj rnhmjoj commented Aug 21, 2020

Motivation for this change

See issue #94038. Basically being able to use bootctl and systemctl kexec with GRUB.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via nixosTests.grub
  • Tested via nixosTests.installer
  • Tested kexec works
  • Tested old entries are removed when running nixos-rebuild switch
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@rnhmjoj rnhmjoj changed the title Grub bls nixos/grub: generate BLS entries Aug 21, 2020
@rnhmjoj rnhmjoj requested review from nh2 and samueldr August 21, 2020 12:00
@samueldr
Copy link
Member

Can't or shouldn't we re-use the same generator for all BLS entries generation, rather than maybe accidentally having diverging implementations?

@ofborg ofborg bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 10.rebuild-linux: 1 This PR causes 1 package to rebuild on Linux. labels Aug 21, 2020
@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Aug 21, 2020

Uhm, I'm not sure: it certainly make sense but the systemd-boot generator is in python. Using it would make the bootloader script depends on a shell, perl and python interpreter, which is not great. If install-grub.pl were to be rewritten in python, it would be trivial, but I don't have enough familiarity with either perl or grub to do that.

@samueldr
Copy link
Member

samueldr commented Aug 21, 2020

It would make the bootloader script depend on python, but isn't NixOS' base closure already bringing in the same python? I guess that would be the question to answer.

Because, sure it would increase this specific NixOS module's closure size, but re-implementing the code increases the code complexity.

That is my main concern with the PR, but I guess it could be fine to go with the duplicated code for the time being, as long as others don't feel strongly against that.

@xaverdh
Copy link
Contributor

xaverdh commented Aug 21, 2020

After reading #48378, if nothing changed in the mean time, python should not be in the minimal closure unless using systemd-boot (due to the python generator).
Also I think people are slowly trying to replace perl in the core tools / closure, but its not there yet.

@danielfullmer
Copy link
Contributor

I also have some incoming changes to systemd-boot-builder.py to support "boot counters" here: #84204
This involves keeping track of the boot counter state associated with each entry, which are optional counters included in the filename after a + symbol: e.g. nixos-generation-200+2-1.conf.

Having the loader entry generation in two different languages would make changes like this much more difficult. I wonder if the aversion to python in the minimal closure has changed at all since the apparently successful refactor to enable mypy type checking in nixops.

@samueldr
Copy link
Member

I believe the closure concerns are entirely about size.

@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Jan 7, 2021

So, It doesn't look like there is a consensus on whether perl, python or neither should be used instead.

I agree it's not ideal to have the booloader code written in two different languages, but I wouldn't stall changes on this because I don't think the conflict will be resolved anytime soon. Morover both systemd-boot-builder.py and install-grub.pl are complex and rewriting or unifying them is already a huge task.

@xaverdh
Copy link
Contributor

xaverdh commented Jan 8, 2021

So, It doesn't look like there is a consensus on whether perl, python or neither should be used instead.

I agree it's not ideal to have the booloader code written in two different languages, but I wouldn't stall changes on this because I don't think the conflict will be resolved anytime soon. Morover both systemd-boot-builder.py and install-grub.pl are complex and rewriting or unifying them is already a huge task.

I agree.
I am currently learning rust by trying to rewrite setup-etc and update-users-groups and eventually switch-to-configuration in rust (here if you want to take a look at the current state). If the community decides that it's a good road to follow, I will have a look at these boot loader generators as well. But that's for a another day..

@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Jan 8, 2021

I am currently learning rust by trying to rewrite setup-etc and update-users-groups and eventually switch-to-configuration in rust

Using a compiled language with static typing sounds like a good idea: the scripts are quite complex and a proper compiler could help. It would also remove the interpreter from the closure.

@nh2
Copy link
Contributor

nh2 commented Jan 8, 2021

If install-grub.pl were to be rewritten in python

That would be amazing. The bootloader script has the potential to brick everyone's system, yet with the Perl stuff it is extremely easy to make mistakes in there.

In the PR I contributed to install-grub.pl recently, I spent multiple hours testing and consulted somebody who actually knew some Perl to get it right, and I still broke something and had to make a followup PR.

Similarly, I don't feel equipped to review this PR.

As a desktop user, and owner of important server infrastructure that needs to boot reliably, I'd accept a slightly larger system any day if in turn that Perl script dissapears from my "stuff that has a high potential to screw us" risk list.

If consensus can be reached for install-grub.pl to be rewritten in Python, I'm happy to spend some hours to implement it (I think I know quite well what that script does, I just can't modify it with confidence in Perl), or to sponsor e.g. 300 EUR to help fund the move.

@nh2
Copy link
Contributor

nh2 commented Jan 8, 2021

Using a compiled language with static typing sounds like a good idea: the scripts are quite complex and a proper compiler could help.

👍 For Python, optional static typing support (mypy) is good in nixpkgs, and that could be used e.g. in a NixOS test that doesn't increase the closure vs plain Python (we use mypy for Python scripts at work).

That is not to discourage the Rust approaches (good stuff!), but if I Python is something that can be agreed on easily, I'd that immediately vs waiting for e.g. another year for a potential Rust solution to be ready (we can still switch to something better afterwards).

@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Jan 8, 2021

I'm not familiar with rust, but I could help with python too.

It's important to reach a consensus on the language before, though: without it all the efforts could go to waste.
What do you think of making an RFC? For example, this has been done for choosing the new Nixpkgs/NixOS docs format.

@xaverdh
Copy link
Contributor

xaverdh commented Jan 8, 2021

I'm not familiar with rust, but I could help with python too.

It's important to reach a consensus on the language before, though: without it all the efforts could go to waste.
What do you think of making an RFC? For example, this has been done for choosing the new Nixpkgs/NixOS docs format.

Well for compiled languages we do not have to restrict ourselves that much, but it would indeed be a good idea to agree on a canonical interpreted language while / if we have critical interpreted code present at runtime.

@stale
Copy link

stale bot commented Nov 9, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 9, 2021
@ghost
Copy link

ghost commented Mar 22, 2023

The GRUB BLS module (blscfg.mod) does not support ZFS. Source 1 (search in page for blscfg); therefore we must disable BLS if ZFS support is enabled.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 22, 2023
@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Mar 22, 2023

@ne9z This PR has nothing to do with grub support for BLS. It's about adding some code to generate the BLS entries when switching to a new NixOS generation.

@ghost
Copy link

ghost commented Mar 22, 2023 via email

@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 19, 2024
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Dec 23, 2024
@github-actions github-actions bot added 8.has: documentation This PR adds or changes documentation 8.has: changelog This PR adds or changes release notes 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1 This PR causes 1 package to rebuild on Linux. labels Dec 23, 2024
@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Dec 23, 2024

@GrahamcOfBorg test grub nixos-rebuild-install-bootloader

@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Dec 23, 2024

So, it's been 4 years, install-grub.pl is still here and I'm still feeling the lack of boot loader entries. Even if the script will be ported to python (or whatever else) at some point, generating the boot loader entries is very simple and I'm sure this patch will not make that work any harded that already is. So, I'm going ahead with this.

I've rebased, added more tests and a release note. I'm now pretty confident that the implementation is correct.
Despite more systemd shenanigangs compared to 2020, I managed to get systemctl kexec working out of the box with GRUB, both EFI and non-EFI. The boot loader entries are generated using the bootspec, so they will match the ones produced by the systemd-boot module.

(unsurprisingly nixos-rebuild-install-bootloader times out: this test is completely insane.)

@rnhmjoj rnhmjoj marked this pull request as ready for review December 23, 2024 13:32
@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Dec 23, 2024

The test appears to be broken with system.includeBuildDependencies = true:

ERROR: cptofs failed. diskSize might be too small for closure.

but it passes running it in the test driver with network access.

@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Dec 23, 2024
@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Jan 4, 2025
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Feb 28, 2025
@rnhmjoj rnhmjoj merged commit 6bf084c into NixOS:master Feb 28, 2025
25 of 27 checks passed
@samueldr
Copy link
Member

samueldr commented Feb 28, 2025

This should either be reverted or fixed quickly, before it gets into release channels, as it can lead to breakage on user's systems.

Test available here, please cherry-pick into the fixing PR:


This new behaviour is breaking assumptions under the BLS:

The Boot Loader Specification

This document defines a set of file formats and naming conventions that
allow the boot loader menu entries to be shared between multiple
operating systems
and boot loaders installed on one device.

(Emphasis my own.)

The changes here are currently breaking multi-boot setups where a BLS-aware bootloader is handling some boot options, and a NixOS is booted with GRUB.

Note that this also differs in behaviour compared to:

Not re-using the same generator for all BLS entries generation has led to accidentally having diverging implementations.

Additionally, this behaviour should be behind an option, so it can be disabled. It probably should be a default-false option, too. The added BLS entries will be shown in the BLS entries menu, which may not be desirable when specifically configuring NixOS for GRUB usage.

Copy link
Contributor

@ElvishJerricco ElvishJerricco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, I reverted this PR thinking there was more wrong with it than there ended up being. That said, I do still think it was correct to revert it.

The issues with NixOS taking excessive ownership over BLS must be addressed. We cannot be deleting other OSes entries, and we cannot be deleting another OS's loader.conf for a grub that won't use it. Additionally, I think at minimum it must be possible to disable this, and I'm not convinced it's good for it to be enabled by default.

But in the bigger picture, I question the premise of the PR. Rather than implementing BLS twice, it seems more reasonable to me to have a separate module for implementing BLS that both systemd-boot and grub can use. That way they share the same logic for populating and garbage collecting BLS files.

Finally, @rnhmjoj, this PR touches a system critical component. It had been inactive for years. It hadn't been reviewed in years. And you didn't request any reviews when you recently restarted work on it. Yet you self-merged it anyway. That is out of line and IMO grounds for removal of the commit bit.

Comment on lines +53 to +55
bootPath = if cfg.mirroredBoots != [ ]
then (builtins.head cfg.mirroredBoots).path
else "/boot";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part confused me for several reasons. I thought there was an error here, but after looking closer, I see there isn't. But I do still have some nitpicks about what confused me.

  • There is an assertion that mirroredBoots cannot be empty, so I don't think we need to account for that case with a bogus value.
  • I think there should be a comment explaining why just any of the mirroredBoots is acceptable.
  • We shouldn't reuse the same bootPath variable name that is already used nearby in this module

Comment on lines +579 to +582
# mark the default entry
if (readlink($link) eq $defaultConfig) {
writeFile("$bootPath/loader/loader.conf", "default $name.conf");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, loader.conf is not actually part of BLS. That's specifically a systemd-boot thing.

But much more importantly, if another OS owns the loader.conf file on this machine, overwriting it is not acceptable. We cannot do that. We take NixOS's ownership of loader.conf as a given in systemd-boot-builder.py, but that is not a given with grub, because it doesn't use it. It is also a significant change in behavior for NixOS's existing grub support.

Comment on lines +681 to +685
# Atomically replace the BLS entries directory
my $entriesDir = "$bootPath/loader/entries";
rename $entriesDir, "$entriesDir.bak" or die "cannot rename $entriesDir to $entriesDir.bak: $!\n";
rename "$entriesDir.tmp", $entriesDir or die "cannot rename $entriesDir.tmp to $entriesDir: $!\n";
rmtree "$entriesDir.bak" or die "cannot remove $entriesDir.bak: $!\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the loader.conf issue, but more severe. We do not necessarily own all the entries on the system. We can't just delete the whole entries directory. The systemd-boot-builder.py takes great care to only remove entries and files known to be owned by NixOS.

Comment on lines +548 to +555
if ($grubEfi eq "" && !$copyKernels) {
# workaround for https://github.com/systemd/systemd/issues/35729
make_path("$bootPath/kernels", { mode => 0755 });
symlink($bootspec{kernel}, "$bootPath/kernels/$kernel");
symlink($bootspec{initrd}, "$bootPath/kernels/$initrd");
$copied{"$bootPath/kernels/$kernel"} = 1;
$copied{"$bootPath/kernels/$initrd"} = 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correctness of this is highly suspect. BLS doesn't specify how to interpret files in the tree that are symlinks to an absolute path like this. What root should be the root for resolving those symlinks? Obviously, in the case of running bootctl list or systemctl kexec on the booted NixOS system, this happens to use the system's root FS as the root. But that isn't a given in the BLS and doesn't really make sense for an arbitrary BLS boot loader.

Additionally, why must this not be an EFI setup to take this code path? You can still do copyKernels = false; on an EFI system, if $bootPath is on the root fs and efiSysMountPoint is something else.

Copy link
Contributor Author

@rnhmjoj rnhmjoj Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working around a bug in systemd, of course it's not "correct" according to the BLS spec.
See what I wrote in the issue I linked:

systemctl kexec expects the paths of the kernel and initrd to be relative to /run/boot-loader-entries/ and fails with File not found. Without an ESP I would expect the path to be absolute (relative to /), given the updated specification even says:

linux specifies the Linux kernel image to execute. The value is a path relative to the root of the file system containing the boot entry snippet itself.

Additionally, why must this not be an EFI setup to take this code path? You can still do copyKernels = false; on an EFI system, if $bootPath is on the root fs and efiSysMountPoint is something else.

It's unrelated to copyKernels: it has to do with linking the entries to /run/boot-loader-entries on non-EFI systems to expose the entries to systemd. In this (edge) case the BLS taken literally says to interpret paths relative to /, while systemd interprets them relative to /run/boot-loader-entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasoning for this. It's still highly suspect, and should probably not be considered proper BLS.

And yes, it does have to do with copyKernels. Try it out. I did. If you make an EFI system with /boot on the root fs and efiSysMountPoint = "/efi"; or something similar, this will produce a broken result that does not work. The BLS files will point to files that should be in the boot tree, but are not, because you disabled creating these symlinks in the boot tree for EFI systems.

IMO the feature simply shouldn't work without copyKernels enabled, because like I said, these symlinks don't make sense in the context of BLS. The fact that it doesn't work with copyKernels disabled on EFI systems is just something else that's wrong.


environment.systemPackages = mkIf (grub != null) [ grub ];

# Link /boot under /run/boot-loder-entries to make
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: loder -> laoder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: laoder -> loader

@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Mar 1, 2025

Finally, @rnhmjoj, this PR touches a system critical component. It had been inactive for years. It hadn't been reviewed in years. And you didn't request any reviews when you recently restarted work on it. Yet you self-merged it anyway. That is out of line and IMO grounds for removal of the commit bit.

You're right. If I have to defend myself, I've become increasingly frustrated with how slow NixOS is moving lately, ragarding both pushing new features or getting feedback. It seems the only way to get some comments is by accidentally breaking someone's setup...

@ElvishJerricco
Copy link
Contributor

It seems the only way to get some comments is by accidentally breaking someone's setup...

No. You get feedback by requesting review from the relevant people. The systemd team would have been a start, in this case, since the purpose of this PR is to work around a systemd problem.

@rnhmjoj
Copy link
Contributor Author

rnhmjoj commented Mar 1, 2025

The systemd team would have been a start, in this case, since the purpose of this PR is to work around a systemd problem.

I just didn't think of this work as systemd-related: the workaround is a relatively minor part. I would have appreciated feedback from people using grub, but it's officially unmmaintained.

@emilazy
Copy link
Member

emilazy commented Mar 1, 2025

“Unmaintained package” isn’t the same thing as “free‐for‐all for self‐merges two years after anyone last looked at the PR”…

This is the third or fourth time in the past few months that you’ve broken things by self‐merging PRs without anyone giving approval or even substantive review? Everyone without the commit bit has to use the review request room and thread. I don’t see how this is anything but a continued pattern of reckless abuse of committer privileges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: changelog This PR adds or changes release notes 8.has: documentation This PR adds or changes documentation 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants