You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a kernel operation fails but doesn't kill my application, is it possible to detect that an error has occurred? In my sample code below, I have 2 test kernels - DivideByZeroKernel and IndexOutOfRangeKernel.
public static void Run()
{
int[] inputData = new int[] { 0, -100 };
using (Context context = Context.CreateDefault())
using (Accelerator acc = context.CreateCudaAccelerator(0))
using (MemoryBuffer1D<int, Stride1D.Dense> deviceData = acc.Allocate1D(inputData))
using (MemoryBuffer1D<int, Stride1D.Dense> deviceOutput = acc.Allocate1D<int>(deviceData.Length))
{
Action<Index1D, ArrayView<int>, ArrayView<int>> kernel = DivideByZeroKernel; // DivideByZeroKernel or IndexOutOfRangeKernel
Action<Index1D, ArrayView<int>, ArrayView<int>> loadedKernel = acc.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>, ArrayView<int>>(kernel);
loadedKernel((int)deviceOutput.Length, deviceData.View, deviceOutput.View);
try
{
acc.Synchronize();
int[] hostOutput = deviceOutput.GetAsArray1D();
for (int i = 0; i < hostOutput.Length; ++i)
{
Console.WriteLine($"100 / {inputData[i]} = {hostOutput[i]}");
}
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
} // using IndexOutOfRangeKernel, Visual Studio 2022 breaks on this line with ILGPU.Runtime.Cuda.CudaException: 'device-side assert triggered'
}
private static void DivideByZeroKernel(Index1D i, ArrayView<int> data, ArrayView<int> output)
{
output[i] = 100 / data[i];
}
private static void IndexOutOfRangeKernel(Index1D i, ArrayView<int> data, ArrayView<int> output)
{
output[i] = 100 / data[i + 1];
}
The output when running DivideByZeroKernel is below. Even though I divide by zero, the operation returned -1 instead of infinity, NaN or crashing. Is there a way to detect that -1 is an error value in the first case but a valid result in the second case?
100 / 0 = -1
100 / -100 = -1
The output when running IndexOutOfRangeKernel is below. Based on #1268 it would appear that my application will crash even if acc.Synchronize() is wrapped in a try/catch. I could use a sidecar app to monitor if my GPU app has crashed - are there any other ways to detect that a GPU kernel has crashed my app?
C:\temp\IlgpuApp\GpuErrors.cs:46: block: [0,0,0], thread: [1,0,0] Assertion `Index out of range` failed.
ILGPU.Runtime.Cuda.CudaException: device-side assert triggered
at ILGPU.Runtime.Cuda.CudaAccelerator.SynchronizeInternal()
at ILGPU.Runtime.Accelerator.Synchronize()
at IlgpuApp.GpuErrors.Run() in C:\temp\IlgpuApp\GpuErrors.cs:line 24
Unhandled exception. ILGPU.Runtime.Cuda.CudaException: device-side assert triggered
at ILGPU.Runtime.Cuda.CudaMemoryBuffer.DisposeAcceleratorObject(Boolean disposing)
at ILGPU.Runtime.AcceleratorObject.DisposeAcceleratorObject_Accelerator(Boolean disposing)
at ILGPU.Runtime.Accelerator.DisposeChildObject_AcceleratorObject(AcceleratorObject acceleratorObject, Boolean disposing)
at ILGPU.Runtime.AcceleratorObject.Dispose(Boolean disposing)
at ILGPU.Util.DisposeBase.DisposeDriver(Boolean disposing)
at ILGPU.Util.DisposeBase.Dispose()
at ILGPU.Runtime.MemoryBuffer`1.DisposeAcceleratorObject(Boolean disposing)
at ILGPU.Runtime.AcceleratorObject.DisposeAcceleratorObject_Accelerator(Boolean disposing)
at ILGPU.Runtime.Accelerator.DisposeChildObject_AcceleratorObject(AcceleratorObject acceleratorObject, Boolean disposing)
at ILGPU.Runtime.AcceleratorObject.Dispose(Boolean disposing)
at ILGPU.Util.DisposeBase.DisposeDriver(Boolean disposing)
at ILGPU.Util.DisposeBase.Dispose()
at IlgpuApp.GpuErrors.Run() in C:\temp\IlgpuApp\GpuErrors.cs:line 36
at IlgpuApp.Program.Main(String[] args) in C:\temp\IlgpuApp\Program.cs:line 14
Question
If a kernel operation fails but doesn't kill my application, is it possible to detect that an error has occurred? In my sample code below, I have 2 test kernels -
DivideByZeroKernel
andIndexOutOfRangeKernel
.The output when running
DivideByZeroKernel
is below. Even though I divide by zero, the operation returned -1 instead of infinity, NaN or crashing. Is there a way to detect that -1 is an error value in the first case but a valid result in the second case?The output when running
IndexOutOfRangeKernel
is below. Based on #1268 it would appear that my application will crash even ifacc.Synchronize()
is wrapped in atry/catch
. I could use a sidecar app to monitor if my GPU app has crashed - are there any other ways to detect that a GPU kernel has crashed my app?Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: