Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocBLAS calls do not produce correct results #419

Open
evaleev opened this issue Sep 25, 2023 · 5 comments
Open

rocBLAS calls do not produce correct results #419

evaleev opened this issue Sep 25, 2023 · 5 comments

Comments

@evaleev
Copy link
Member

evaleev commented Sep 25, 2023

HIP/ROCm support introduced in #418 is only minimally functional at the moment (but already sufficient to provide HIP support in https://github.com/devreal/ttg/tree/ttg-device-support-master-coro-with-stream-tasks) but when trying to use rocBLAS (via ICL's blaspp C++ API) it seems that nothing happens. Here's a simplified version of examples/device/device_task:

// copy data from arg.data() to result.data()
blas::copy(result.size(), arg.data(), 1, device_data(result.storage()), 1,
             queue);
hipStreamSynchronize(queue.stream());
TA_ASSERT(result.data()[0] == arg.data()[0]);

It fails in the assertion. Meanwhile

hipMemcpyAsync(result.data(),arg.data(),result.size()*sizeof(double),device::MemcpyDefault, stream);
hipStreamSynchronize(stream);
TA_ASSERT(result.data()[0] == arg.data()[0]);

succeeds.

Note that result.data() and arg.data() point to the unified memory (allocated via hipMallocManaged). So the only working hypothesis is that rocBLAS does not support operations on data in UM ...

@evaleev
Copy link
Member Author

evaleev commented Sep 25, 2023

@dmcdougall : the issue seems to be resolved by setting env var HIP_VISIBLE_DEVICES=0. TA does not read this env var.

@dmcdougall
Copy link

I can't reproduce the failure locally:

$ cat repro.cpp 
#include <cassert>
#include <rocblas/rocblas.h>

#define hipCheck(s) \
do {\
  hipError_t err = s;\
  if (err != hipSuccess) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

#define rocblasCheck(s) \
do {\
  rocblas_status err = s;\
  if (err != rocblas_status_success) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

int main(int argc, char ** argv)
{
  int N = 1;
  size_t size = N * sizeof(double);

  double * arg, * result;

  hipCheck(hipMallocManaged(&arg, size));
  hipCheck(hipMallocManaged(&result, size));

  hipCheck(hipMemset(arg, 1, size));
  hipCheck(hipMemset(result, 0, size));

  hipStream_t stream;
  hipCheck(hipStreamCreate(&stream));

  rocblas_handle handle;
  rocblasCheck(rocblas_create_handle(&handle));
  rocblasCheck(rocblas_set_stream(handle, stream));

  rocblasCheck(rocblas_dcopy(handle, N, arg, 1, result, 1));  // copy arg into result
  hipCheck(hipStreamSynchronize(stream));
  assert(result[0] == arg[0]);  //fails?
  
  rocblas_destroy_handle(handle);
  hipCheck(hipStreamDestroy(stream));
  hipCheck(hipFree(arg));
  hipCheck(hipFree(result));
  return 0;
}
$ hipcc repro.cpp -L/opt/rocm-5.7.0/lib -o repro -lrocblas
$ env | grep HIP
$ env | grep ROCR
$ ./repro 
$ echo $?
0
$ rocminfo | grep gfx9
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

Your observation about setting the device visibility in the environment is interesting. Are you launching your job with slurm with the cgroups plugin enabled?

@dmcdougall
Copy link

dmcdougall commented Sep 29, 2023

Could you also re-run your example with AMD_LOG_LEVEL=3 set in the environment? I want to see if there are any hip runtime calls in your example that aren't present in my example.

There will be quite a lot of output to the screen (stderr, I think), so I recommend piping to a file.

@evaleev
Copy link
Member Author

evaleev commented Oct 1, 2023

@dmcdougall thanks for investigating ... I invoke the executable directly, so no slurm involved, and HIP_VISIBLE_DEVICES is not set by default.

Unfortunately my attempts to make the example more representative of the "real" example did not succeed to trigger the problem. In the real app all calls happen in thread pool so I thought maybe some thread-local state was not being initialized properly ... to no avail.

For the record, here's the most recent form of the example:

[ICL:dopamine ~]$ cat repro-thread.cc
#include <cassert>
#include <hip/hip_runtime.h>
#include <rocblas/rocblas.h>
#include <thread>
#include <iostream>
#include <atomic>

#define hipCheck(s) \
do {\
  hipError_t err = s;\
  if (err != hipSuccess) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

#define rocblasCheck(s) \
do {\
  rocblas_status err = s;\
  if (err != rocblas_status_success) {\
    printf( "Failed to run error %d ", __LINE__);\
    return -1;\
  }\
} while(0)\

using task_ptr = void (*)();

std::atomic<bool> done{false};
std::atomic<task_ptr> current_task{nullptr};

hipStream_t stream;
rocblas_handle handle;
const int N = 1000000;
double * arg, * result;

void do_work(){
    rocblas_dcopy(handle, N, arg, 1, result, 1);  // copy arg into result
    auto err = hipStreamSynchronize(stream);
    assert(result[0] == arg[0]);
  };

int main(int argc, char ** argv)
{
  size_t size = N * sizeof(double);

  // start worker thread
  std::thread worker([&]() {
    while (!done) {
      if (current_task != nullptr) {
        (*current_task)();
	done = true;
	current_task = nullptr;
      }
    }
  });

  hipCheck(hipMallocManaged(&arg, size));
  hipCheck(hipMallocManaged(&result, size));

  hipCheck(hipMemset(arg, 1, size));
  hipCheck(hipMemset(result, 0, size));

  hipCheck(hipStreamCreate(&stream));

  rocblasCheck(rocblas_create_handle(&handle));
  rocblasCheck(rocblas_set_stream(handle, stream));

  current_task = do_work;
  worker.join();

  rocblas_destroy_handle(handle);
  hipCheck(hipStreamDestroy(stream));
  hipCheck(hipFree(arg));
  hipCheck(hipFree(result));
  return 0;
}
[ICL:dopamine ~]$ hipcc -g -O0 repro-thread.cc -L/opt/rocm-5.7.0/lib -o repro-thread -lrocblas
[ICL:dopamine ~]$ ./repro-thread

@dmcdougall
Copy link

Ok, thanks.

Can you either:

  1. re-run your example with AMD_LOG_LEVEL=3 set in the environment? There are probably some hip runtime pieces I am missing; or
  2. show me how you're building TA and that example? That way I can experiment with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants