Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port __clzsi2 from compiler_rt #4195

Merged
merged 1 commit into from
Jan 17, 2020

Conversation

mlarouche
Copy link
Contributor

It's required for using std.fmt.format on some ARM architecture such as ARM7TDMI in Thumb mode. Also fix regression with __aeabi_read_tp that only works with OS Linux

Tested with ZigGBA and ran zig build test-compiler-rt on Windows, Linux and qemu on Linux.

@daurnimator daurnimator added the arch-arm 32-bit ARM label Jan 16, 2020
@@ -177,7 +179,9 @@ comptime {
@export(@import("compiler_rt/arm.zig").__aeabi_memclr, .{ .name = "__aeabi_memclr4", .linkage = linkage });
@export(@import("compiler_rt/arm.zig").__aeabi_memclr, .{ .name = "__aeabi_memclr8", .linkage = linkage });

@export(@import("compiler_rt/arm.zig").__aeabi_read_tp, .{ .name = "__aeabi_read_tp", .linkage = linkage });
if (builtin.os == .linux) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing linux-specific here, what error message did you get? I suppose it's because the GBA cpu doesn't support the mrc opcode we're using here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only export that symbol under linux.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can't see anything about that symbol being exclusive to Linux in the specs.

fn __clzsi2_generic(a: i32) callconv(.C) i32 {
@setRuntimeSafety(builtin.is_test);

var x: c_uint = @bitCast(c_uint, a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the iN/uN types as it makes it easier to reason about their effective size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use them as first but it caused too many interger overflow, once I noted that __udivsi3 was using c_uint and c_int to avoid this problem and be closer to the C original source

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't sweep the overflow problems under the rug! Bitcast the i32 argument to u32 and use it.

@setRuntimeSafety(builtin.is_test);

var x: c_uint = @bitCast(c_uint, a);
var t: c_int = @intCast(c_int, @boolToInt((x & 0xFFFF0000) == 0)) << 4; // if (x is small) t = 16 else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, can you please put the comments on their own line if it becomes too long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but this is how it was in the original C code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is not the original C code :)

}

fn __clzsi2_arm_clz(a: i32) callconv(.C) i32 {
return asm volatile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you aim for efficiency you probably want something like

fn __clzsi2_arm_clz() callconv(.Naked) void {
    asm volatile(
        \\ clz r0, r0
        \\ bx lr
    );
    unreachable;
}

This way you don't have any frame setup/teardown overhead and no redundant input/output/clobbers in the code (remember that the final set of clobbered registers is the union of the in/out/clobber indications).

Copy link
Contributor Author

@mlarouche mlarouche Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but the test suite would fail on ARM platforms because you can't call a naked function from a non-naked function

Copy link
Contributor

@LemonBoy LemonBoy Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the problem in the test code?
I can't read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't call a naked function from a non-naked function

You have to @ptrCast the function to a different calling convention first. If the assembly code in your naked function matches one of the calling conventions then it is safe to cast the function to that and call it.

);
}

fn __clzsi2_arm32(a: i32) callconv(.C) i32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as before WRT the use of the asm block.

This is the first builtin to have asm-optimized implementations, are you sure the generic Zig version doesn't compile down to something sensible (I'd say that since you're targeting GBA you're interested in code size) ?
The two algorithms are exactly the same with the exception of the the generic one being branch-less, and I guess LLVM should do a good job at optimizing it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this is how it was in LLVM compiler-rt code, I did try implementing a generic code closer to the ARM non-clz version but I think the goal is to be closer to the original compiler-rt source

}

const can_use_arm_clz = switch (builtin.arch) {
.arm => |sub_arch| switch (sub_arch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge the two branches as .arm, .armeb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here (search for "ACLE 6.4.5 CLZ"), it seems that v6m doesn't support clz.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.v4t => false,
else => true,
},
.thumb => |sub_arch| switch (sub_arch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge the .thumb, .thumbeb branches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking this is true for every revision that supports thumb2, but until #3927 lands we cannot simplify this :(

Copy link
Contributor Author

@mlarouche mlarouche Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and yeah can't wait for #3927 to be merged

@LemonBoy
Copy link
Contributor

My counter-proposal here is to use something simpe & generic such as

export fn my_clz(i: i32) i32 {
    @setRuntimeSafety(false);
    var x = @bitCast(u32, i);
    var n: i32 = 32;

    // Count first bit set using binary search, from Hacker's Delight
    var y = x >> 16;
    if (y != 0) {
        n = n - 16;
        x = y;
    }
    y = x >> 8;
    if (y != 0) {
        n = n - 8;
        x = y;
    }
    y = x >> 4;
    if (y != 0) {
        n = n - 4;
        x = y;
    }
    y = x >> 2;
    if (y != 0) {
        n = n - 2;
        x = y;
    }
    y = x >> 1;
    if (y != 0) {
        n = n - 1;
        x = y;
    }

    return n - @bitCast(i32, x);
}

This produces something sensible for ARM32 (with --release-fast, the generated ASM is hidden under the spoiler) and keep the "asm-accelerated" clz implementation even though this is a first in compiler-rt (and LLVM should be smart enough to emit that opcode directly instead of doing a libcall).

00000000 my_clz:
       0: 00 10 a0 e3                  	mov	r1, #0
       4: 20 08 51 e1                  	cmp	r1, r0, lsr #16
       8: 10 20 a0 e3                  	mov	r2, #16
       c: 20 08 a0 11                  	lsrne	r0, r0, #16
      10: 20 20 00 03                  	movweq	r2, #32
      14: 20 04 51 e1                  	cmp	r1, r0, lsr #8
      18: 20 04 a0 11                  	lsrne	r0, r0, #8
      1c: 08 20 42 12                  	subne	r2, r2, #8
      20: 20 02 51 e1                  	cmp	r1, r0, lsr #4
      24: 20 02 a0 11                  	lsrne	r0, r0, #4
      28: 04 20 42 12                  	subne	r2, r2, #4
      2c: 20 01 51 e1                  	cmp	r1, r0, lsr #2
      30: 20 01 a0 11                  	lsrne	r0, r0, #2
      34: 02 20 42 12                  	subne	r2, r2, #2
      38: a0 00 51 e1                  	cmp	r1, r0, lsr #1
      3c: 01 20 42 12                  	subne	r2, r2, #1
      40: a0 00 a0 11                  	lsrne	r0, r0, #1
      44: 00 00 42 e0                  	sub	r0, r2, r0
      48: 1e ff 2f e1                  	bx	lr

@mlarouche
Copy link
Contributor Author

mlarouche commented Jan 16, 2020

export fn my_clz(i: i32) i32 {
    @setRuntimeSafety(false);
    var x = @bitCast(u32, i);
    var n: i32 = 32;

    // Count first bit set using binary search, from Hacker's Delight
    var y = x >> 16;
    if (y != 0) {
        n = n - 16;
        x = y;
    }
    y = x >> 8;
    if (y != 0) {
        n = n - 8;
        x = y;
    }
    y = x >> 4;
    if (y != 0) {
        n = n - 4;
        x = y;
    }
    y = x >> 2;
    if (y != 0) {
        n = n - 2;
        x = y;
    }
    y = x >> 1;
    if (y != 0) {
        n = n - 1;
        x = y;
    }

    return n - @bitCast(i32, x);
}

Even better

export fn my_clz(i: i32) i32 {
    @setRuntimeSafety(false);
    var x = @bitCast(u32, i);
    var n: i32 = 32;

    // Count first bit set using binary search, from Hacker's Delight
    var y: u32 = 0;
    inline for ([_] i32 { 16, 8, 4, 2, 1}) |shift| {
        y = x >> shift;
        if (y != 0) {
            n = n - shift;
            x = y;
        }
    }

    return n - @bitCast(i32, x);
}

@LemonBoy
Copy link
Contributor

Oh that's even nicer, I keep forgetting about inline for :)

@andrewrk
Copy link
Member

When @LemonBoy gives this the 👌 then I'll hit merge

Copy link
Contributor

@LemonBoy LemonBoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I only have a few nits that I can address in a PR later:

  • __clzsi2_arm32 is not needed as __clzsi2_generic compiles down to something very similar
  • After some digging in the LLVM code __clzsi2_arm_clz won't be ever called as __clzsi2 is only called if 1) the target is ARM < 5 2) if the target only supports the Thumb1 instruction set and in both cases the clz opcode is not available.

@andrewrk andrewrk merged commit d9be6e5 into ziglang:master Jan 17, 2020
@mlarouche mlarouche deleted the add_clzsi2_compiler_rt branch February 8, 2020 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm 32-bit ARM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants