Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174

Closed
kjetilkjeka opened this issue May 19, 2022 · 9 comments
Labels
C-bug Category: This is a bug. O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html

Comments

@kjetilkjeka
Copy link
Contributor

kjetilkjeka commented May 19, 2022

I tried this code (compiling with rustc +nightly main_rs.rs --target nvptx64-nvidia-cuda --crate-type=cdylib --emit=asm):

#![feature(abi_ptx)]
#![no_std]

#[panic_handler]
unsafe fn breakpoint_panic_handler(_: &::core::panic::PanicInfo) -> ! {
    loop {}
    core::hint::unreachable_unchecked();
}

#[derive(Clone, Copy)]
pub struct Foo {
    a: u8,
    b: u8,
    c: u8,
}

#[inline(never)]
fn device(v: u8) -> Foo {
    Foo {
        a: v,
        b: v,
        c: v,
    }
}

#[no_mangle]
pub unsafe extern "ptx-kernel" fn kernel(input: *const u8, output: *mut Foo) {
    *output = device(input.read());
}

I expected to see this happen: A well formed .ptx file

Instead, this happened: The following error (from ptx linker)

LLVM ERROR: Cannot select: 0x7f8743a820d0: ch = NVPTXISD::StoreRetval<(store (s24), align 1)> 0x7f8743a82410, Constant:i32<0>, 0x7f8743a827b8
  0x7f8743a822d8: i32 = Constant<0>
  0x7f8743a827b8: i32 = or 0x7f8743a82a28, 0x7f8743a829c0
    0x7f8743a82a28: i32 = or 0x7f8743a82000, 0x7f8743a824e0
      0x7f8743a82000: i32 = shl 0x7f8743a82340, Constant:i32<8>
        0x7f8743a82340: i32,ch = load<(dereferenceable load (s8) from %ir.6 + 1), zext from i8> 0x7f8743a82618, 0x7f8743a82820, undef:i64
          0x7f8743a82820: i64 = or FrameIndex:i64<0>, Constant:i64<1>
            0x7f8743a82c98: i64 = FrameIndex<0>
            0x7f8743a82a90: i64 = Constant<1>
          0x7f8743a823a8: i64 = undef
        0x7f8743a826e8: i32 = Constant<8>
      0x7f8743a824e0: i32,ch = load<(dereferenceable load (s8) from %ir.6), zext from i8> 0x7f8743a82618, FrameIndex:i64<0>, undef:i64
        0x7f8743a82c98: i64 = FrameIndex<0>
        0x7f8743a823a8: i64 = undef
    0x7f8743a829c0: i32 = shl 0x7f8743a82478, Constant:i32<16>
      0x7f8743a82478: i32,ch = load<(dereferenceable load (s8) from %ir.6 + 2), anyext from i8> 0x7f8743a82618, 0x7f8743a82750, undef:i64
        0x7f8743a82750: i64 = or FrameIndex:i64<0>, Constant:i64<2>
          0x7f8743a82c98: i64 = FrameIndex<0>
          0x7f8743a82888: i64 = Constant<2>
        0x7f8743a823a8: i64 = undef
      0x7f8743a82c30: i32 = Constant<16>
In function: _ZN7main_rs6device17h13d51938dd622c57E

Meta

rustc --version --verbose:

rustc 1.63.0-nightly (4c5f6e627 2022-05-17)

Comment

It seems like there are problems with the NVPTXISD::StoreRetval being used with a s24. I assume the problem is that a s24 is not a valid type at all. If anyone knows where this problems originate, and if it's on the rustc or LLVM side I'm very thankful.

I will do some experiments with clang, possibly this weekend, to see how device functions are being called with such types.

@kjetilkjeka kjetilkjeka added the C-bug Category: This is a bug. label May 19, 2022
@kjetilkjeka
Copy link
Contributor Author

@rustbot label +O-NVPTX

@rustbot rustbot added the O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html label May 19, 2022
@kjetilkjeka
Copy link
Contributor Author

I did a test with the corresponding cuda c++ code compiled with clang into llvm-ir (using the commandclang-12 main_cpp2.cu -S -emit-llvm --cuda-gpu-arch=sm_52 --cuda-device-only -std=c++11)

#include <stdint.h>
#include <stdio.h>

struct foo {
    uint8_t a;
    uint8_t b;
    uint8_t c;
};

__attribute__((noinline)) __device__ struct foo device(uint8_t v) {
    struct foo s = {
        .a = v,
        .b = v,
        .c = v
    };
    return s;
}

extern "C" __global__  void kernel(struct foo* output, uint8_t const* input) {
    *output = device(*input);
}

A similarity between the llvm-ir is that both produces the struct as a type:

  • rustc: %Foo = type { i8, i8, i8 }
  • clang: %struct.foo = type { i8, i8, i8 }

A big difference is that clang returns the struct while rustc returns a i24 from the device function:

rustc:

; main_rs::device
; Function Attrs: noinline nounwind
define internal i24 @_ZN7main_rs6device17h13d51938dd622c57E(i8 %v) unnamed_addr #2 {
start:
  %0 = alloca %Foo, align 1
  %1 = bitcast %Foo* %0 to i8*
  store i8 %v, i8* %1, align 1
  %2 = getelementptr inbounds %Foo, %Foo* %0, i32 0, i32 1
  store i8 %v, i8* %2, align 1
  %3 = getelementptr inbounds %Foo, %Foo* %0, i32 0, i32 2
  store i8 %v, i8* %3, align 1
  %4 = bitcast %Foo* %0 to i24*
  %5 = load i24, i24* %4, align 1
  ret i24 %5
}

clang:

; Function Attrs: convergent noinline nounwind optnone mustprogress
define dso_local %struct.foo @_Z6deviceh(i8 zeroext %0) #0 {
  %2 = alloca %struct.foo, align 1
  %3 = alloca i8, align 1
  store i8 %0, i8* %3, align 1
  %4 = getelementptr inbounds %struct.foo, %struct.foo* %2, i32 0, i32 0
  %5 = load i8, i8* %3, align 1
  store i8 %5, i8* %4, align 1
  %6 = getelementptr inbounds %struct.foo, %struct.foo* %2, i32 0, i32 1
  %7 = load i8, i8* %3, align 1
  store i8 %7, i8* %6, align 1
  %8 = getelementptr inbounds %struct.foo, %struct.foo* %2, i32 0, i32 2
  %9 = load i8, i8* %3, align 1
  store i8 %9, i8* %8, align 1
  %10 = load %struct.foo, %struct.foo* %2, align 1
  ret %struct.foo %10
}

When compiling the equivalent rust code for x86_64-unknown-linux-gnu the i24 type is still used. It seems therefore like something related to nvptx and llvm is especially bad at dealing with "non power of two types". The "right" thing is most likely to fix this in LLVM.

Are there any reasons for not using this struct as the llvm return type in rustc? The bitcast that happens right before the return seems both deliberate and strange.

@kjetilkjeka
Copy link
Contributor Author

kjetilkjeka commented May 29, 2022

Opened an issue in llvm llvm/llvm-project#55764

@kjetilkjeka
Copy link
Contributor Author

Short status update: I'm looking into two alternative solutions.

The most proper one is to add a Legalizer pass in LLVM for the ptx target. This transforms illegal instructions into something that is guaranteed to be selectable. This is what makes i24 and friends supported on other LLVM targets.

The more hacky and "rustc-centric" solution is to add a field backend_quirk_requires_pow_of_two_regs to rustc_target::src::spec::TargetOptions that is false as default but set to true for nvptx64-nvidia-cuda. This flag must then be checked in rustc_target::abi::call::nvptx64::compute_abi_info() and this code must promote the register to next power of two must be done if it is set.

@kjetilkjeka
Copy link
Contributor Author

kjetilkjeka commented Jun 24, 2022

I wanted to find out if the "passing as immediate" optimization made sense also for the NVPTX backend. I did a test between a compiler that disabled the passing as immediate optimization (no-opt) and one that promoted i24 to i32 before passing as an immediate today.

No opt time - warmup: 502.139603ms, normal: 465.524613ms
With opt time - warmup: 227.233111ms, normal: 233.259671ms

I'm surprised how much of a difference the optimization is even on a target like the nvptx which contains several levels of abstractions and thus also opportunities for optimizations. The conclusion is that disabling the optimization is not an alternative. Why doesn't llvm do this optimization themselves, are there no way to select an unspecified ABI and they always must follow the C abi?

Test code

Device

#![feature(abi_ptx)]
#![no_std]


#[panic_handler]
unsafe fn breakpoint_panic_handler(_: &::core::panic::PanicInfo) -> ! {
    loop {}
    core::hint::unreachable_unchecked();
}

#[repr(C)]
#[derive(Clone, Copy)]
pub struct ThreeU8 {
    a: u8,
    b: u8,
    c: u8,
}

// ptx linker is inlining the device function even if it is tagged as `never`
// I have checked that a combination of --emit=llvm-ir actually produces a function 
// in llvm-ir and compiling with llc keeps the functions into ptx assembly.
// TODO: verify that this function is not inlined after ptx-linker is fixed
#[inline(never)]
#[no_mangle]
pub fn device_three_u8(v: ThreeU8) -> ThreeU8 {
    ThreeU8{
        a: v.b,
        b: v.a,
        c: (v.a + v.b)/2,
    }
}

#[inline(never)]
#[no_mangle]
// CHECK: kernel_three_u8
pub unsafe extern "ptx-kernel" fn kernel_three_u8(input: *const ThreeU8, output: *mut ThreeU8) {
    for i in 0..1_000_000 {
        output.write_volatile(device_three_u8(*input));
    }
}

Host

The kernel above was spawned in a single thread on a stream and timed until synchronized

use cust::prelude::*;

use cust::stream::{
    Stream,
    StreamFlags
};

const NO_OPT_PTX: &str = include_str!("no_opt.ptx");
const OPT_PTX: &str = include_str!("opt.ptx");

#[repr(C)]
#[derive(Clone, Copy, Default, cust::DeviceCopy)]
pub struct ThreeU8 {
    a: u8,
    b: u8,
    c: u8,
}

fn main() {
    let ctx = cust::quick_init().unwrap();
    let module_no_opt = Module::from_ptx(NO_OPT_PTX, &[]).unwrap();
    let module_opt = Module::from_ptx(OPT_PTX, &[]).unwrap();
    let stream = Stream::new(StreamFlags::NON_BLOCKING, None).unwrap();

    let i = cust::memory::DeviceBox::new(&ThreeU8 {a: 4, b: 5, c: 6}).unwrap();
    let o = cust::memory::DeviceBox::new(&ThreeU8::default()).unwrap();

    let func_no_opt = module_no_opt.get_function("kernel_three_u8").unwrap();
    let func_opt = module_opt.get_function("kernel_three_u8").unwrap();

    // warm up
    let mut before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_no_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let no_opt_warmup = std::time::Instant::now() - before_run;

    before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_no_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let no_opt = std::time::Instant::now() - before_run;
    
    before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let opt_warmup = std::time::Instant::now() - before_run;
    
    before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let opt = std::time::Instant::now() - before_run;


    println!("No opt time - warmup: {:?}, normal: {:?}", no_opt_warmup, no_opt);
    println!("With opt time - warmup: {:?}, normal: {:?}", opt_warmup, opt);
}

@kjetilkjeka
Copy link
Contributor Author

A fix have been merged in LLVM (https://reviews.llvm.org/D129291). Next step is to get it into rustc

@kjetilkjeka
Copy link
Contributor Author

The patch is included in the LLVM 15 upgrade currently in progress #99464

I should add a test for this after the LLVM 15 upgrade is completed

@nikic
Copy link
Contributor

nikic commented Dec 18, 2022

The LLVM 15 update has since happened, does this work now?

@kjetilkjeka
Copy link
Contributor Author

Yes! This do work after LLVM 15. Just forgot to go back and close. Thanks for reminder!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html
Projects
None yet
Development

No branches or pull requests

3 participants