-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Function multi-versioning #1018
Comments
This use case is in scope of Zig. Whether we use LLVM IFuncs or something else is still up for research, as well as the syntax and how it fits in with comptime and other features. |
Ifuncs only work on ELF platforms (and not all of them either) so Zig would need a fallback for platforms where they're not supported, such as Windows, FreeBSD < 12, etc. I'd attack this bottom-up: start with a generic solution and later on add ifuncs as a compiler optimization. |
Using hidden function pointers set by the compiler for individual functions would hog cache friendliness and destroy potential for inlining. |
Edit: Never mind. Thanks for clarifying @tiehuis! The way this is solved in Go is with a build tag at the file level. When the build runs, some build properties are defined depending on the platform and optional build arguments. The entire file is then either included or excluded depending on its build tags. Build tags such as This is simple to use, extensible with custom tags, and forces the user to put all platform-specific stuff in files having a tag immediately at the top, so there's no custom build stuff intermixed with normal code. |
@binary132 That is different to the issue here since it is a compile-time selection. This is possible in Zig right now by using standard if statements with comptime values. This issue is about function selection at runtime, based on runtime cpu feature detection. This is useful to generate a single portable binary that runs on different cpu processor architectures (e.g. core 2 duo vs. i7 skylake), while still allowing cpu-specific performance optimizations for hot functions. |
Or when using Vulkan? 😏 |
I proposal follow way to solve this stuff to make the code more easy to read and write:
And the compiler can generate 3 versions of this function and generate the LLVM IFuncs base on that information. Solve multi-versioning problem like golang (build tag https://golang.org/pkg/go/build/#hdr-Build_Constraints ) make the code difficult to read and write and refactor. (Do I define all the need versions I need? Do I change all the function name of all the version? Where is the linux version of that function?) |
I think the switch as specified by @bronze1man seems most in line with how zig would work, given it uses all standard syntax. Ignoring the ifunc optimization, first, we need cpu feature detection of sorts. I've implemented below some example code which provides runtime cpu feature detection to get a better idea of potential issues. Example
Implementationcpuid.c // This is done separately for now since zig's multi-return inline asm was a pain.
#include <cpuid.h>
int cpuid(unsigned int leaf, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx)
{
return __get_cpuid(leaf, eax, ebx, ecx, edx);
} cpu.zig
NotesNon-compatible architectures where the feature does not work are compile-time known which allows us to avoid compiling in incompatible branches. For the following example:
We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case. Links |
does llvm have any cpu feature capability detection features we could leverage instead of rolling our own? |
yes, but it is x86-specific and we already have it in the zig tree as |
We should probably model this after Linux's |
See also the GCC
|
I think there should be a way to disable this at compile-time. The use cases discussed so far assume you don't know the specifics of your target CPU. However, there are cases when you write code knowing exactly what hardware it will run on. If the standard library starts using function multi-versioning, yet you know at compile time that you will only need one specific version, you should be able to select that version at compile time and make sure that no others make it to the binary. Otherwise, you'll waste executable size (and potentially cache penalties if this is implemented via pointers). |
This has been a big pain point for Bun.
Our current solution is to compile two versions of bun for Linux x64 and macOS x64. One which targets This unfortunately breaks when someone installs Bun without using the official install script that checks for this - like if using a package manager for a Linux distribution. Having to separate by CPU features makes it more difficult for people to distribute Bun. |
@Jarred-Sumner I'm interested in this issue too but if you have certain specialized hot paths you can do runtime simd detection to avoid having to build two binaries. Its a bit of work but it isn't too bad. You have to abandon all of the Zig intrinsics and write raw assembly but I've been doing it some for some SIMD work and its been going along fine. If you're trying to get the compiler to auto-vectorize into special instructions though, then yeah, this is a pain 😄 Ping me on discord if you need help I have some example code I can share with you. Edit: here is how simdjson does ISA detection: https://sourcegraph.com/github.com/simdjson/simdjson/-/blob/include/simdjson/internal/isadetection.h Pair something like that with a Zig dynamic dispatch interface (ptr + vtable) and its pretty much just as fast if the interface is right. |
FYI zig already has cpuid code in the standard library zig/lib/std/zig/system/x86.zig Line 25 in 694d883
Though only the high level function is pub .
|
Helpful! Its much faster to just pull out the bits you care about than to make a generalized "parse all features of |
When implemented, this feature should probably allow selecting based not only on CPU features, but on the CPU family too. An example of the utility of this would be how AMD's processors implemented the BMI2 PDEP/PEXT instructions in microcode until Zen 3. This made the instructions incredibly slow, and therefore unfit for purpose. A project would likely want to treat this case as-if the instructions weren't available at all. |
This is such an extremely obscure thing to want that I think it makes perfect sense to require implementing it manually via function pointers. |
better yet, pub fn someMathFunction(vec: Vector) Vector {
if (builtin.cpu.arch == .x86_64) {
if (comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sse4_2)) {
// ...
// optimized for SSE 4.2
//
return;
}
if (comptime std.Target.x86.featureSetHas(builtin.cpu.features, .avx2)) {
// ...
// optimized for avx2
//
return;
}
// ...
// no asm/intrinsics optimization
//
}
} |
For this original issue, this might work, but I just want to note that this won't work for the imo more useful and general case of trying to compile software for a In my use case (which I don't think this issue is trying to solve and function pointers are probably the way to go, being very clear!), I want to be able to build and package my software such that it works on any I end up writing something that looks like this: /// This is purposely incomplete, just an example for comment.
const ISA = enum {
generic,
neon,
avx2,
/// Detect the ISA to use using compile-time information as well
/// as runtime information (i.e. cpuid).
pub fn detect() !ISA {
return switch (builtin.cpu.arch) {
// Neon is mandatory on aarch64. No runtime checks necessary.
.aarch64 => .neon,
// X86 we have to call out to cpuid
// Note: I don't have .x86/.i386 here because there was a
// recent change to the name of the enum. But, in general,
// you'd have x86 here too.
inline .x86_64 => detectX86(),
// Unknown, assume generic
else => .generic,
};
}
fn detectX86() ISA {
// Magic constants below come from Intel ISA Vol 2A 3-218
const id = cpuid.initEx(7, 0);
if (id.ebx & (@as(c_uint, 1) << 5) > 0) return .avx2;
return .generic;
}
}; |
To be clear @mitchellh your use case is exactly what this issue is intended to solve. Example would be swapping out memcpy on program startup to an implementation that takes full advantage of the CPU features detected at runtime. The main challenge is doing so in a way that does not cause function calls to become virtual and thereby compromise perf which is the main purpose of this feature. |
The reason this wouldn't be great is that users of this feature probably want at least some monomorphization of callers of multi-versioned functions. @andrewrk gave an example of memcpy, which should probably both be inlined into the caller and be a different implementation depending on the CPU. It's also fairly difficult to do it manually right now. Within your own code, you can pass around a comptime argument that specifies the level of ISA extensions to specialize to, but if you end up using libraries that take callbacks that do not pass back a user-provided comptime ctx, then you lose this information at library boundaries. edit: I guess this is actually doable by putting your callback function in a struct that has a member that specifies the level of extensions. But this will only get you specialization of your own SIMD kernels, not memcpy or similar. I spitballed that one could try implementing multi-versioning manually by mutating a comptime global (does this even work?) but that also breaks because comptime function evaluations are cached based on the values of their arguments, not the values of their arguments plus all the globals they read or could read. |
No, global mutable comptime memory is explicitly disallowed, as per #7396. |
Despite this proposal dealing with versioning of functions based on CPU capabilities, I'd like to hear the opinion on extending it to versions of zig compiler itself.
Zig is rapidly changing language and it is difficult to support code that have compiled on previous version and do not on the new one. Pros:
Cons:
|
A really interesting concept is function multi-versioning. The general idea is to support implementing multiple versions of a function for different hardware and having the correct version of the function selected at run time. Made up sample code:
There are ways to simulate this using function pointers, but the compiler would be better at optimizing this, plus implementing that over and over by hand would suck.
LLVM https://llvm.org/docs/LangRef.html#ifuncs
GCC https://lwn.net/Articles/691932/
The text was updated successfully, but these errors were encountered: