NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174

kjetilkjeka · 2022-05-19T13:58:53Z

I tried this code (compiling with rustc +nightly main_rs.rs --target nvptx64-nvidia-cuda --crate-type=cdylib --emit=asm):

#![feature(abi_ptx)]
#![no_std]

#[panic_handler]
unsafe fn breakpoint_panic_handler(_: &::core::panic::PanicInfo) -> ! {
    loop {}
    core::hint::unreachable_unchecked();
}

#[derive(Clone, Copy)]
pub struct Foo {
    a: u8,
    b: u8,
    c: u8,
}

#[inline(never)]
fn device(v: u8) -> Foo {
    Foo {
        a: v,
        b: v,
        c: v,
    }
}

#[no_mangle]
pub unsafe extern "ptx-kernel" fn kernel(input: *const u8, output: *mut Foo) {
    *output = device(input.read());
}

I expected to see this happen: A well formed .ptx file

Instead, this happened: The following error (from ptx linker)

LLVM ERROR: Cannot select: 0x7f8743a820d0: ch = NVPTXISD::StoreRetval<(store (s24), align 1)> 0x7f8743a82410, Constant:i32<0>, 0x7f8743a827b8
  0x7f8743a822d8: i32 = Constant<0>
  0x7f8743a827b8: i32 = or 0x7f8743a82a28, 0x7f8743a829c0
    0x7f8743a82a28: i32 = or 0x7f8743a82000, 0x7f8743a824e0
      0x7f8743a82000: i32 = shl 0x7f8743a82340, Constant:i32<8>
        0x7f8743a82340: i32,ch = load<(dereferenceable load (s8) from %ir.6 + 1), zext from i8> 0x7f8743a82618, 0x7f8743a82820, undef:i64
          0x7f8743a82820: i64 = or FrameIndex:i64<0>, Constant:i64<1>
            0x7f8743a82c98: i64 = FrameIndex<0>
            0x7f8743a82a90: i64 = Constant<1>
          0x7f8743a823a8: i64 = undef
        0x7f8743a826e8: i32 = Constant<8>
      0x7f8743a824e0: i32,ch = load<(dereferenceable load (s8) from %ir.6), zext from i8> 0x7f8743a82618, FrameIndex:i64<0>, undef:i64
        0x7f8743a82c98: i64 = FrameIndex<0>
        0x7f8743a823a8: i64 = undef
    0x7f8743a829c0: i32 = shl 0x7f8743a82478, Constant:i32<16>
      0x7f8743a82478: i32,ch = load<(dereferenceable load (s8) from %ir.6 + 2), anyext from i8> 0x7f8743a82618, 0x7f8743a82750, undef:i64
        0x7f8743a82750: i64 = or FrameIndex:i64<0>, Constant:i64<2>
          0x7f8743a82c98: i64 = FrameIndex<0>
          0x7f8743a82888: i64 = Constant<2>
        0x7f8743a823a8: i64 = undef
      0x7f8743a82c30: i32 = Constant<16>
In function: _ZN7main_rs6device17h13d51938dd622c57E

Comment

It seems like there are problems with the NVPTXISD::StoreRetval being used with a s24. I assume the problem is that a s24 is not a valid type at all. If anyone knows where this problems originate, and if it's on the rustc or LLVM side I'm very thankful.

I will do some experiments with clang, possibly this weekend, to see how device functions are being called with such types.

The text was updated successfully, but these errors were encountered:

kjetilkjeka · 2022-05-19T14:00:04Z

@rustbot label +O-NVPTX

kjetilkjeka · 2022-05-29T18:22:15Z

I did a test with the corresponding cuda c++ code compiled with clang into llvm-ir (using the commandclang-12 main_cpp2.cu -S -emit-llvm --cuda-gpu-arch=sm_52 --cuda-device-only -std=c++11)

#include <stdint.h>
#include <stdio.h>

struct foo {
    uint8_t a;
    uint8_t b;
    uint8_t c;
};

__attribute__((noinline)) __device__ struct foo device(uint8_t v) {
    struct foo s = {
        .a = v,
        .b = v,
        .c = v
    };
    return s;
}

extern "C" __global__  void kernel(struct foo* output, uint8_t const* input) {
    *output = device(*input);
}

A similarity between the llvm-ir is that both produces the struct as a type:

rustc: %Foo = type { i8, i8, i8 }
clang: %struct.foo = type { i8, i8, i8 }

A big difference is that clang returns the struct while rustc returns a i24 from the device function:

rustc:

; main_rs::device
; Function Attrs: noinline nounwind
define internal i24 @_ZN7main_rs6device17h13d51938dd622c57E(i8 %v) unnamed_addr #2 {
start:
  %0 = alloca %Foo, align 1
  %1 = bitcast %Foo* %0 to i8*
  store i8 %v, i8* %1, align 1
  %2 = getelementptr inbounds %Foo, %Foo* %0, i32 0, i32 1
  store i8 %v, i8* %2, align 1
  %3 = getelementptr inbounds %Foo, %Foo* %0, i32 0, i32 2
  store i8 %v, i8* %3, align 1
  %4 = bitcast %Foo* %0 to i24*
  %5 = load i24, i24* %4, align 1
  ret i24 %5
}

clang:

; Function Attrs: convergent noinline nounwind optnone mustprogress
define dso_local %struct.foo @_Z6deviceh(i8 zeroext %0) #0 {
  %2 = alloca %struct.foo, align 1
  %3 = alloca i8, align 1
  store i8 %0, i8* %3, align 1
  %4 = getelementptr inbounds %struct.foo, %struct.foo* %2, i32 0, i32 0
  %5 = load i8, i8* %3, align 1
  store i8 %5, i8* %4, align 1
  %6 = getelementptr inbounds %struct.foo, %struct.foo* %2, i32 0, i32 1
  %7 = load i8, i8* %3, align 1
  store i8 %7, i8* %6, align 1
  %8 = getelementptr inbounds %struct.foo, %struct.foo* %2, i32 0, i32 2
  %9 = load i8, i8* %3, align 1
  store i8 %9, i8* %8, align 1
  %10 = load %struct.foo, %struct.foo* %2, align 1
  ret %struct.foo %10
}

When compiling the equivalent rust code for x86_64-unknown-linux-gnu the i24 type is still used. It seems therefore like something related to nvptx and llvm is especially bad at dealing with "non power of two types". The "right" thing is most likely to fix this in LLVM.

Are there any reasons for not using this struct as the llvm return type in rustc? The bitcast that happens right before the return seems both deliberate and strange.

kjetilkjeka · 2022-05-29T20:39:00Z

Opened an issue in llvm llvm/llvm-project#55764

kjetilkjeka · 2022-06-17T08:55:18Z

Short status update: I'm looking into two alternative solutions.

The most proper one is to add a Legalizer pass in LLVM for the ptx target. This transforms illegal instructions into something that is guaranteed to be selectable. This is what makes i24 and friends supported on other LLVM targets.

The more hacky and "rustc-centric" solution is to add a field backend_quirk_requires_pow_of_two_regs to rustc_target::src::spec::TargetOptions that is false as default but set to true for nvptx64-nvidia-cuda. This flag must then be checked in rustc_target::abi::call::nvptx64::compute_abi_info() and this code must promote the register to next power of two must be done if it is set.

kjetilkjeka · 2022-06-24T12:51:21Z

I wanted to find out if the "passing as immediate" optimization made sense also for the NVPTX backend. I did a test between a compiler that disabled the passing as immediate optimization (no-opt) and one that promoted i24 to i32 before passing as an immediate today.

No opt time - warmup: 502.139603ms, normal: 465.524613ms
With opt time - warmup: 227.233111ms, normal: 233.259671ms

I'm surprised how much of a difference the optimization is even on a target like the nvptx which contains several levels of abstractions and thus also opportunities for optimizations. The conclusion is that disabling the optimization is not an alternative. Why doesn't llvm do this optimization themselves, are there no way to select an unspecified ABI and they always must follow the C abi?

Test code

Device

#![feature(abi_ptx)]
#![no_std]


#[panic_handler]
unsafe fn breakpoint_panic_handler(_: &::core::panic::PanicInfo) -> ! {
    loop {}
    core::hint::unreachable_unchecked();
}

#[repr(C)]
#[derive(Clone, Copy)]
pub struct ThreeU8 {
    a: u8,
    b: u8,
    c: u8,
}

// ptx linker is inlining the device function even if it is tagged as `never`
// I have checked that a combination of --emit=llvm-ir actually produces a function 
// in llvm-ir and compiling with llc keeps the functions into ptx assembly.
// TODO: verify that this function is not inlined after ptx-linker is fixed
#[inline(never)]
#[no_mangle]
pub fn device_three_u8(v: ThreeU8) -> ThreeU8 {
    ThreeU8{
        a: v.b,
        b: v.a,
        c: (v.a + v.b)/2,
    }
}

#[inline(never)]
#[no_mangle]
// CHECK: kernel_three_u8
pub unsafe extern "ptx-kernel" fn kernel_three_u8(input: *const ThreeU8, output: *mut ThreeU8) {
    for i in 0..1_000_000 {
        output.write_volatile(device_three_u8(*input));
    }
}

Host

The kernel above was spawned in a single thread on a stream and timed until synchronized

use cust::prelude::*;

use cust::stream::{
    Stream,
    StreamFlags
};

const NO_OPT_PTX: &str = include_str!("no_opt.ptx");
const OPT_PTX: &str = include_str!("opt.ptx");

#[repr(C)]
#[derive(Clone, Copy, Default, cust::DeviceCopy)]
pub struct ThreeU8 {
    a: u8,
    b: u8,
    c: u8,
}

fn main() {
    let ctx = cust::quick_init().unwrap();
    let module_no_opt = Module::from_ptx(NO_OPT_PTX, &[]).unwrap();
    let module_opt = Module::from_ptx(OPT_PTX, &[]).unwrap();
    let stream = Stream::new(StreamFlags::NON_BLOCKING, None).unwrap();

    let i = cust::memory::DeviceBox::new(&ThreeU8 {a: 4, b: 5, c: 6}).unwrap();
    let o = cust::memory::DeviceBox::new(&ThreeU8::default()).unwrap();

    let func_no_opt = module_no_opt.get_function("kernel_three_u8").unwrap();
    let func_opt = module_opt.get_function("kernel_three_u8").unwrap();

    // warm up
    let mut before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_no_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let no_opt_warmup = std::time::Instant::now() - before_run;

    before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_no_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let no_opt = std::time::Instant::now() - before_run;
    
    before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let opt_warmup = std::time::Instant::now() - before_run;
    
    before_run = std::time::Instant::now();
    unsafe {
        launch!(
            // slices are passed as two parameters, the pointer and the length.
            func_opt<<<1, 1, 0, stream>>>(i.as_device_ptr(), o.as_device_ptr())
        ).unwrap();
    }
    stream.synchronize().unwrap();
    let opt = std::time::Instant::now() - before_run;


    println!("No opt time - warmup: {:?}, normal: {:?}", no_opt_warmup, no_opt);
    println!("With opt time - warmup: {:?}, normal: {:?}", opt_warmup, opt);
}

kjetilkjeka · 2022-08-08T12:09:37Z

A fix have been merged in LLVM (https://reviews.llvm.org/D129291). Next step is to get it into rustc

kjetilkjeka · 2022-08-09T10:42:17Z

The patch is included in the LLVM 15 upgrade currently in progress #99464

I should add a test for this after the LLVM 15 upgrade is completed

nikic · 2022-12-18T15:32:27Z

The LLVM 15 update has since happened, does this work now?

kjetilkjeka · 2022-12-27T14:48:29Z

Yes! This do work after LLVM 15. Just forgot to go back and close. Thanks for reminder!

kjetilkjeka added the C-bug Category: This is a bug. label May 19, 2022

rustbot added the O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html label May 19, 2022

kjetilkjeka mentioned this issue May 29, 2022

NVPTX: Cannot select: = NVPTXISD::LoadParam<(load (s24), align 4)> llvm/llvm-project#55764

Closed

kjetilkjeka mentioned this issue Jun 17, 2022

Tracking issue for the "ptx-kernel" ABI #38788

Open

14 tasks

kjetilkjeka mentioned this issue Aug 16, 2022

make Condvar, Mutex, RwLock const constructors work with the unsupported impl #98457

Merged

kjetilkjeka closed this as completed Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174

NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174

kjetilkjeka commented May 19, 2022 •

edited

Loading

kjetilkjeka commented May 19, 2022

kjetilkjeka commented May 29, 2022

kjetilkjeka commented May 29, 2022 •

edited

Loading

kjetilkjeka commented Jun 17, 2022

kjetilkjeka commented Jun 24, 2022 •

edited

Loading

kjetilkjeka commented Aug 8, 2022

kjetilkjeka commented Aug 9, 2022

nikic commented Dec 18, 2022

kjetilkjeka commented Dec 27, 2022

NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174

NVPTX: "LLVM ERROR: Cannot select" when returning struct with 3byte size from "device function" #97174

Comments

kjetilkjeka commented May 19, 2022 • edited Loading

Meta

Comment

kjetilkjeka commented May 19, 2022

kjetilkjeka commented May 29, 2022

rustc:

clang:

kjetilkjeka commented May 29, 2022 • edited Loading

kjetilkjeka commented Jun 17, 2022

kjetilkjeka commented Jun 24, 2022 • edited Loading

Test code

Device

Host

kjetilkjeka commented Aug 8, 2022

kjetilkjeka commented Aug 9, 2022

nikic commented Dec 18, 2022

kjetilkjeka commented Dec 27, 2022

kjetilkjeka commented May 19, 2022 •

edited

Loading

kjetilkjeka commented May 29, 2022 •

edited

Loading

kjetilkjeka commented Jun 24, 2022 •

edited

Loading