Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systems/architecture: bump default architecture to x86-64-v2 #202526

Closed

Conversation

SuperSandro2000
Copy link
Member

@SuperSandro2000 SuperSandro2000 commented Nov 23, 2022

Compile everything by default with SSE4.2 to gain free performance.
This also removes support for any CPU older than westmere Nehalem.
Nehalem was chosen because the bootstrap gcc is to old to know westmere.

TODO:

  • Do we want to enable this?
  • Do all the hydra builders support this?
  • Write a changelog entry
Description of changes
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.11 Release Notes (or backporting 22.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@SuperSandro2000
Copy link
Member Author

Just drafting so that no one merges it by accident but I still want feedback and discuss this.

@K900
Copy link
Contributor

K900 commented Nov 23, 2022

RHEL is on x86_64-v2 by default now: https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-linux-9-for-the-x86-64-v2-microarchitecture-level

@K900
Copy link
Contributor

K900 commented Nov 23, 2022

Also we should probably add the x86_64-vX feature levels instead of using uarch names, because we want the featureset without the uarch specific codegen.

@NickCao
Copy link
Member

NickCao commented Nov 23, 2022

The baseline is mostly defined by the gcc-N package, which is configured to produce baseline binaries when options like -march= are not used. - https://wiki.debian.org/ArchitectureSpecificsMemo#Architecture_baselines

Is that what this change does?

@thiagokokada
Copy link
Contributor

x86-64-v2 (this PR) should be safe-ish to enable and brings a good performance boost.

I think Arch uses x86-64-v3 (haswell), but maybe this is too much.

@SamLukeYes
Copy link
Member

x86-64-v2 (this PR) should be safe-ish to enable and brings a good performance boost.

I think Arch uses x86-64-v3 (haswell), but maybe this is too much.

AFAIK, most of Arch packages are still using x86-64, though x86_64_v3 architecture was added to devtools. The RFC is to keep both x86_64 and x86_64_v3 packages, instead of bumping the default architecture.

@SuperSandro2000
Copy link
Member Author

SuperSandro2000 commented Nov 27, 2022

Also we should probably add the x86_64-vX feature levels instead of using uarch names, because we want the featureset without the uarch specific codegen.

I added a suggestion from me to this PR.

Is that what this change does?

Yes, on first look.

x86_64_v3 packages, instead of bumping the default architecture.

We are currently not doing this because it would be equivalent of adding a new architecture because we would build everything twice.


I am also currently (trying) to build my systems with haswell/skylake on my private hydra, to see how things go. I'll probably upstream some fixes along the way, too. The first broken package I encountered was libxcrypt which is already fixed on staging and that when bumping to skylake I couldn't run the jemalloc tests on a zenver3 machine.

@Tungsten842
Copy link
Member

Tungsten842 commented Nov 28, 2022

Some more cpu features should be added: POPCNT, LAHF-SAHF, CMPXCHG16B...
https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels

@thiagokokada thiagokokada changed the title systems/architecture: bump default architecture to westmere systems/architecture: bump default architecture to x86_64-v2 Nov 28, 2022
Copy link
Contributor

@thiagokokada thiagokokada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

IMO, we should merge this ASAP to get early feedback on what it is going to break.

@Flakebi
Copy link
Member

Flakebi commented Jun 14, 2023

To clarify, I did compare the runtime performance of an application compiled with gcc and znver3.
That application happened to be clang (because compilation is where I care most about performance). The flags I passed to clang to compile SuperTuxKart were the same in both cases I compared.

I was hoping that any improvement I see with znver3 is at least as good as x86-64-v2, which should be a subset/more general optimizations.
Apparently, clang is not a workload that benefits from that.

@K900
Copy link
Contributor

K900 commented Jun 14, 2023

The difference between v2 and higher feature levels is mostly vector extensions, so it's to be expected a compiler (which is mostly pointer chasing) doesn't gain much here.

@oxalica
Copy link
Contributor

oxalica commented Jul 28, 2023

The difference between v2 and higher feature levels is mostly vector extensions, so it's to be expected a compiler (which is mostly pointer chasing) doesn't gain much here.

I want to note that v3 DOES have some non-vector ops which benefits compilers, that includes 3-operand bitshift {S{H{L,R},AL},RO{L,R}}X. Since they have individual destination registers and does not limit CL for shift count, they can reduce register pressure and MOVs, especially when shift-count is not a constant. Bitshifts are heavily used anywhere, including time-sensitive hashing algorithms.
SHRX is also relied by a 1-cycle-per-byte shift-based-DFA algorithm, which can be used for lexing, UTF-8 validation or anything other small-state DFAs.

@peterhoeg
Copy link
Member

Here are some benchmarks from arch where the person running it kindly did a tl;dr which I am pasting here:

- there is no or negligible performance benefit of *-march=nehalem*, which corresponds to x86_64-v2,

- there is a moderate benefit of *-march=haswell* (x86_64-v3) - of around 10%-20% as compared to baseline for the tests performed

Link: https://lists.archlinux.org/pipermail/arch-general/2021-March/048739.html

I know it's a sample size of 1, but I would argue that the burden of proof with regards to demonstrating any benefit now lies with those proposing this change. The argument that "well, everyone else is doing it" doesn't carry much weight in this.

@peterhoeg
Copy link
Member

One thing I forgot (apologies for the comment spam) is if we do start looking at potential benefits and where we stand to gain something from this (full disclosure: I'm very much not a compiler guy).

  1. either the act of building software is faster because the compilers benefit from this (use less energy, can get critical fixes shipped faster, can build more with less and so on), or
  2. the resulting software runs faster (or uses fewer resources) - who doesn't like free speed?

I would imagine, that most of the software that really stands to benefit already has optimizations for various cpus/cpu features where we would gain nothing from fiddling with the baseline flags.

So could we compile variants of the compilers with a better baseline and use that when available? This way there would be no sudden breakage for end users?

@vcunat
Copy link
Member

vcunat commented Jul 29, 2023

You can't do this "when available" that easily. What would people with older computers do? Impurely switch their compiler variant? Or would we compile everything twice (e.g. base + -v3)?

@RaitoBezarius
Copy link
Member

Here are some benchmarks from arch where the person running it kindly did a tl;dr which I am pasting here:

- there is no or negligible performance benefit of *-march=nehalem*, which corresponds to x86_64-v2,

- there is a moderate benefit of *-march=haswell* (x86_64-v3) - of around 10%-20% as compared to baseline for the tests performed

Link: lists.archlinux.org/pipermail/arch-general/2021-March/048739.html

I know it's a sample size of 1, but I would argue that the burden of proof with regards to demonstrating any benefit now lies with those proposing this change. The argument that "well, everyone else is doing it" doesn't carry much weight in this.

This was mentioned in #202526 (comment) :).
It is not sufficient though.

I would imagine, that most of the software that really stands to benefit already has optimizations for various cpus/cpu features where we would gain nothing from fiddling with the baseline flags.

So could we compile variants of the compilers with a better baseline and use that when available? This way there would be no sudden breakage for end users?

Someone needs to do the rigorous work of looking into which packages benefit the most currently, see Guix prior art on that.

@peterhoeg
Copy link
Member

peterhoeg commented Jul 29, 2023 via email

@ghost
Copy link

ghost commented Jul 31, 2023

(full disclosure: I'm very much not a compiler guy).

There's a bit of a self-selection problem here; the people who really care about this stuff are already building everything themselves. I build with -march= and -mcpu= set to exactly what -mnative would choose, on four different architectures.

But my nix is patched to remove the Hydra key, so I don't really care what cache.nixos.org does.

@kjeremy
Copy link
Contributor

kjeremy commented Aug 28, 2023

It looks like CentOS is investigating what a jump to v3 might look like: https://blog.centos.org/2023/08/centos-isa-sig-performance-investigation/

@sergv
Copy link
Contributor

sergv commented Aug 28, 2023

I'd be intrerested to revisit reasons of other distros for switching to x86-64-v2, e.g. RedHat. Do they apply in NixOs' case? Or are they completely unapplicable to the NixOs? It's "do it because everyone else is doing it" kind of argument, but my point is why is everyone else doing it? They stand dropping support for old hardware just like NixOs does but still they decide to go ahead. What makes NixOs's situation different from everyone else's?

@RaitoBezarius
Copy link
Member

RaitoBezarius commented Aug 28, 2023 via email

@vcunat
Copy link
Member

vcunat commented Aug 29, 2023

AFAIK the other distros don't do it without replacement and still provide also binaries for older CPUs. This PR's proposal didn't do that. (And actually I don't think this is worth doubling build and storage costs for the x86 infra parts.)

@nyabinary
Copy link
Contributor

https://www.phoronix.com/news/CentOS-ISA-Experiment-Perform
CentOS is looking at v3 baseline

@K900
Copy link
Contributor

K900 commented Sep 5, 2023

As I said on Matrix, CentOS can afford to do that because CentOS has very long release cycles, so people on CentOS N-1 will remain supported for many years to come. For NixOS, the switchover would take at most 6 months, which is not nearly enough time to cut off support for all pre-2015 hardware.

@lucasew
Copy link
Contributor

lucasew commented Sep 6, 2023

Isn't there some kind of cflag to automatically create variants that use these features and decide in runtime which function sets will be used?

The examples cited here [1] have like explicit definitions but is there something that can kind of ramp up to more recent instructions?

For example, a math heavy library could generate more than one variant of loop heavy objects and some kind of logic to decide at runtime using the CPUID instruction which code variant will run.

IDK tbh if this is possible in single file binaries. The common approach for BLAS-like libraries for example is to generate a dynamic library for each variant then decide at runtime which one to dlopen.

[1] https://lwn.net/Articles/691932/

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/pre-rfc-moving-nixos-x86-64-baseline-to-x86-64-v3/35924/2

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/pre-rfc-gradual-transition-of-nixos-x86-64-baseline-to-x86-64-v3-with-an-intermediate-step-to-x86-64-v2/35924/1

@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 19, 2024
@LDprg
Copy link
Contributor

LDprg commented Apr 30, 2024

Any news on this? Would really like to see a v2 baseline for Nixos.

@Atemu
Copy link
Member

Atemu commented Apr 30, 2024

See the discourse thread above.

To summarise:

To our current knowledge, there is no good data that proves a benefit from bumping the generic target to x86_64-v2. There might be a benefit but we don't actually know whether it exists or not. If it exits, it will be rather small though. (Note that there is a decent amount of bad data on this.)

There is no good data on how many users a bump beyond x86_64-v1 would exclude either. Anectdata suggests that there are at least some.

There are some packages where even the flawed existing data shows a significant benefit from march tuning beyond a reasonable doubt. glibc-hwcaps could be leveraged to enable such march tuning for those specific packages without excluding any users.

IMHO, a generic bump such as the one in this PR will require an RFC. Perhaps we should close this to signal this a bit clearer.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 30, 2024
@LDprg
Copy link
Contributor

LDprg commented Apr 30, 2024

What about duplicated nixos channels (at least for stable), where one is the generic one and the other is compiled with x86-64-v3? This wouldn't deprecate any platform and it is also the way cachyos handles this.

The problem with selectively compiling packages with x86-64-v3, which could improve the most, is that if I for example compile gcc this will retrigger a recompilation of my whole system. I did not find a way around this problem.

Maybe there should be a flag to prevent such rebuild at the cost of reproducability?

@K900
Copy link
Contributor

K900 commented Apr 30, 2024

We do not have the hardware to build multiple channels. You can use system.replaceRuntimeDependency if you really want to. But also, x86_64-v3 also does not improve things much, and we can likely get very close in terms of performance with some dynamic dispatch.

@K900 K900 closed this Apr 30, 2024
@SuperSandro2000 SuperSandro2000 deleted the architecture-sse42-avx branch April 30, 2024 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.