Port __clzsi2 from compiler_rt #4195

mlarouche · 2020-01-16T04:25:30Z

It's required for using std.fmt.format on some ARM architecture such as ARM7TDMI in Thumb mode. Also fix regression with __aeabi_read_tp that only works with OS Linux

Tested with ZigGBA and ran zig build test-compiler-rt on Windows, Linux and qemu on Linux.

LemonBoy · 2020-01-16T08:03:28Z

lib/std/special/compiler_rt.zig

@@ -177,7 +179,9 @@ comptime {
        @export(@import("compiler_rt/arm.zig").__aeabi_memclr, .{ .name = "__aeabi_memclr4", .linkage = linkage });
        @export(@import("compiler_rt/arm.zig").__aeabi_memclr, .{ .name = "__aeabi_memclr8", .linkage = linkage });

-        @export(@import("compiler_rt/arm.zig").__aeabi_read_tp, .{ .name = "__aeabi_read_tp", .linkage = linkage });
+        if (builtin.os == .linux) {


There's nothing linux-specific here, what error message did you get? I suppose it's because the GBA cpu doesn't support the mrc opcode we're using here.

We only export that symbol under linux.

But I can't see anything about that symbol being exclusive to Linux in the specs.

LemonBoy · 2020-01-16T08:04:57Z

lib/std/special/compiler_rt/clzsi2.zig

+fn __clzsi2_generic(a: i32) callconv(.C) i32 {
+    @setRuntimeSafety(builtin.is_test);
+
+    var x: c_uint = @bitCast(c_uint, a);


Please use the iN/uN types as it makes it easier to reason about their effective size.

I tried to use them as first but it caused too many interger overflow, once I noted that __udivsi3 was using c_uint and c_int to avoid this problem and be closer to the C original source

Don't sweep the overflow problems under the rug! Bitcast the i32 argument to u32 and use it.

LemonBoy · 2020-01-16T08:05:50Z

lib/std/special/compiler_rt/clzsi2.zig

+    @setRuntimeSafety(builtin.is_test);
+
+    var x: c_uint = @bitCast(c_uint, a);
+    var t: c_int = @intCast(c_int, @boolToInt((x & 0xFFFF0000) == 0)) << 4; // if (x is small) t = 16 else 0


Minor nit, can you please put the comments on their own line if it becomes too long?

I can, but this is how it was in the original C code

But this is not the original C code :)

LemonBoy · 2020-01-16T08:15:31Z

lib/std/special/compiler_rt/clzsi2.zig

+}
+
+fn __clzsi2_arm_clz(a: i32) callconv(.C) i32 {
+    return asm volatile(


If you aim for efficiency you probably want something like

fn __clzsi2_arm_clz() callconv(.Naked) void { asm volatile( \\ clz r0, r0 \\ bx lr ); unreachable; }

This way you don't have any frame setup/teardown overhead and no redundant input/output/clobbers in the code (remember that the final set of clobbered registers is the union of the in/out/clobber indications).

I tried that but the test suite would fail on ARM platforms because you can't call a naked function from a non-naked function

~~Was the problem in the test code?~~
I can't read

you can't call a naked function from a non-naked function

You have to @ptrCast the function to a different calling convention first. If the assembly code in your naked function matches one of the calling conventions then it is safe to cast the function to that and call it.

LemonBoy · 2020-01-16T08:28:42Z

lib/std/special/compiler_rt/clzsi2.zig

+    );
+}
+
+fn __clzsi2_arm32(a: i32) callconv(.C) i32 {


Ditto as before WRT the use of the asm block.

This is the first builtin to have asm-optimized implementations, are you sure the generic Zig version doesn't compile down to something sensible (I'd say that since you're targeting GBA you're interested in code size) ?
The two algorithms are exactly the same with the exception of the the generic one being branch-less, and I guess LLVM should do a good job at optimizing it out.

Likewise, this is how it was in LLVM compiler-rt code, I did try implementing a generic code closer to the ARM non-clz version but I think the goal is to be closer to the original compiler-rt source

LemonBoy · 2020-01-16T08:32:55Z

lib/std/special/compiler_rt/clzsi2.zig

+}
+
+const can_use_arm_clz = switch (builtin.arch) {
+    .arm => |sub_arch| switch (sub_arch) {


You can merge the two branches as .arm, .armeb

See here (search for "ACLE 6.4.5 CLZ"), it seems that v6m doesn't support clz.

LemonBoy · 2020-01-16T08:35:42Z

lib/std/special/compiler_rt/clzsi2.zig

+        .v4t => false,
+        else => true,
+    },
+    .thumb => |sub_arch| switch (sub_arch) {


You can merge the .thumb, .thumbeb branches.

Strictly speaking this is true for every revision that supports thumb2, but until #3927 lands we cannot simplify this :(

Done and yeah can't wait for #3927 to be merged

LemonBoy · 2020-01-16T15:22:29Z

My counter-proposal here is to use something simpe & generic such as

export fn my_clz(i: i32) i32 {
    @setRuntimeSafety(false);
    var x = @bitCast(u32, i);
    var n: i32 = 32;

    // Count first bit set using binary search, from Hacker's Delight
    var y = x >> 16;
    if (y != 0) {
        n = n - 16;
        x = y;
    }
    y = x >> 8;
    if (y != 0) {
        n = n - 8;
        x = y;
    }
    y = x >> 4;
    if (y != 0) {
        n = n - 4;
        x = y;
    }
    y = x >> 2;
    if (y != 0) {
        n = n - 2;
        x = y;
    }
    y = x >> 1;
    if (y != 0) {
        n = n - 1;
        x = y;
    }

    return n - @bitCast(i32, x);
}

This produces something sensible for ARM32 (with --release-fast, the generated ASM is hidden under the spoiler) and keep the "asm-accelerated" clz implementation even though this is a first in compiler-rt (and LLVM should be smart enough to emit that opcode directly instead of doing a libcall).

00000000 my_clz:
       0: 00 10 a0 e3                  	mov	r1, #0
       4: 20 08 51 e1                  	cmp	r1, r0, lsr #16
       8: 10 20 a0 e3                  	mov	r2, #16
       c: 20 08 a0 11                  	lsrne	r0, r0, #16
      10: 20 20 00 03                  	movweq	r2, #32
      14: 20 04 51 e1                  	cmp	r1, r0, lsr #8
      18: 20 04 a0 11                  	lsrne	r0, r0, #8
      1c: 08 20 42 12                  	subne	r2, r2, #8
      20: 20 02 51 e1                  	cmp	r1, r0, lsr #4
      24: 20 02 a0 11                  	lsrne	r0, r0, #4
      28: 04 20 42 12                  	subne	r2, r2, #4
      2c: 20 01 51 e1                  	cmp	r1, r0, lsr #2
      30: 20 01 a0 11                  	lsrne	r0, r0, #2
      34: 02 20 42 12                  	subne	r2, r2, #2
      38: a0 00 51 e1                  	cmp	r1, r0, lsr #1
      3c: 01 20 42 12                  	subne	r2, r2, #1
      40: a0 00 a0 11                  	lsrne	r0, r0, #1
      44: 00 00 42 e0                  	sub	r0, r2, r0
      48: 1e ff 2f e1                  	bx	lr

mlarouche · 2020-01-16T15:28:59Z

export fn my_clz(i: i32) i32 {
    @setRuntimeSafety(false);
    var x = @bitCast(u32, i);
    var n: i32 = 32;

    // Count first bit set using binary search, from Hacker's Delight
    var y = x >> 16;
    if (y != 0) {
        n = n - 16;
        x = y;
    }
    y = x >> 8;
    if (y != 0) {
        n = n - 8;
        x = y;
    }
    y = x >> 4;
    if (y != 0) {
        n = n - 4;
        x = y;
    }
    y = x >> 2;
    if (y != 0) {
        n = n - 2;
        x = y;
    }
    y = x >> 1;
    if (y != 0) {
        n = n - 1;
        x = y;
    }

    return n - @bitCast(i32, x);
}

Even better

export fn my_clz(i: i32) i32 {
    @setRuntimeSafety(false);
    var x = @bitCast(u32, i);
    var n: i32 = 32;

    // Count first bit set using binary search, from Hacker's Delight
    var y: u32 = 0;
    inline for ([_] i32 { 16, 8, 4, 2, 1}) |shift| {
        y = x >> shift;
        if (y != 0) {
            n = n - shift;
            x = y;
        }
    }

    return n - @bitCast(i32, x);
}

LemonBoy · 2020-01-16T15:31:04Z

Oh that's even nicer, I keep forgetting about inline for :)

andrewrk · 2020-01-16T19:20:15Z

When @LemonBoy gives this the 👌 then I'll hit merge

…me ARM architecture.

LemonBoy

LGTM, I only have a few nits that I can address in a PR later:

__clzsi2_arm32 is not needed as __clzsi2_generic compiles down to something very similar
After some digging in the LLVM code __clzsi2_arm_clz won't be ever called as __clzsi2 is only called if 1) the target is ARM < 5 2) if the target only supports the Thumb1 instruction set and in both cases the clz opcode is not available.

daurnimator added the arch-arm 32-bit ARM label Jan 16, 2020

LemonBoy reviewed Jan 16, 2020

View reviewed changes

mlarouche force-pushed the add_clzsi2_compiler_rt branch from ccbdff2 to 969fd01 Compare January 16, 2020 13:19

Port clzsi2 from compiler_rt, required for using std.fmt.format on so…

d563cf7

…me ARM architecture.

mlarouche force-pushed the add_clzsi2_compiler_rt branch from 969fd01 to d563cf7 Compare January 17, 2020 00:52

LemonBoy approved these changes Jan 17, 2020

View reviewed changes

andrewrk merged commit d9be6e5 into ziglang:master Jan 17, 2020

mlarouche deleted the add_clzsi2_compiler_rt branch February 8, 2020 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port __clzsi2 from compiler_rt #4195

Port __clzsi2 from compiler_rt #4195

mlarouche commented Jan 16, 2020

LemonBoy Jan 16, 2020

daurnimator Jan 16, 2020

LemonBoy Jan 16, 2020

LemonBoy Jan 16, 2020

mlarouche Jan 16, 2020

LemonBoy Jan 16, 2020

LemonBoy Jan 16, 2020

mlarouche Jan 16, 2020

LemonBoy Jan 16, 2020

LemonBoy Jan 16, 2020

mlarouche Jan 16, 2020 •

edited

Loading

LemonBoy Jan 16, 2020 •

edited

Loading

andrewrk Jan 16, 2020

LemonBoy Jan 16, 2020

mlarouche Jan 16, 2020

LemonBoy Jan 16, 2020

LemonBoy Jan 16, 2020

mlarouche Jan 16, 2020

LemonBoy Jan 16, 2020

LemonBoy Jan 16, 2020

mlarouche Jan 16, 2020 •

edited

Loading

LemonBoy commented Jan 16, 2020

mlarouche commented Jan 16, 2020 •

edited

Loading

LemonBoy commented Jan 16, 2020

andrewrk commented Jan 16, 2020

LemonBoy left a comment

Port __clzsi2 from compiler_rt #4195

Port __clzsi2 from compiler_rt #4195

Conversation

mlarouche commented Jan 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlarouche Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

LemonBoy Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlarouche Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

LemonBoy commented Jan 16, 2020

mlarouche commented Jan 16, 2020 • edited Loading

LemonBoy commented Jan 16, 2020

andrewrk commented Jan 16, 2020

LemonBoy left a comment

Choose a reason for hiding this comment

mlarouche Jan 16, 2020 •

edited

Loading

LemonBoy Jan 16, 2020 •

edited

Loading

mlarouche Jan 16, 2020 •

edited

Loading

mlarouche commented Jan 16, 2020 •

edited

Loading