CUDA backend for the DNN module #14827

YashasSamaga · 2019-06-18T13:11:31Z

There are many ways to make use of multiple GPUs. Here is one which I think is the safest and the least complex solution. It makes use of the fact that the CUDA runtime library maintains a separate CUDA context for each CPU thread.

Suppose you have N devices.
Create N threads.
Assign a CUDA device to each thread by calling cudaSetDevice or cv::cuda::setDevice in that thread. Each thread is now associated with a device.
You can create any number of cv::dnn::Net objects in any of those threads and the network will use the device associated with that thread for memory and computation.

Benchmarks

Demo Video: https://www.youtube.com/watch?v=ljCfluWYymM

Project summary/benchmarks: https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d

Support Matrix for this PR

## Current Support Matrix: (not updated)

Blip	Meaning
✔️	supports all the configurations that are supported by all the existing backends (and might support more than what's currently supported)
🔵	partially supported (fallback to CPU for unsupported configurations)
❌	not supported (fallback to CPU)

Layer	Status	Constraints	Notes
Activations	✔️
Batch Normalization	✔️
Blank Layer	✔️
Concat Layer	✔️
Const Layer	✔️
Convolution 2d	✔️		asymmetric padding is disabled in layer constructor but the backend supports it
Convolution 3d	✔️		asymmetric padding is disabled in layer constructor but the backend supports it
Crop and resize	❌
Crop Layer	✔️		forwarded to Slice Layer
Detection Output Layer	❌
Deconvolution 2d	🔵	padding configuration should not lead to extra uneven padding
Deconvolution 3d	🔵	padding configuration should not lead to extra uneven padding
Elementwise Layers	✔️
Eltwise Layer	✔️
Flatten Layer	✔️
Fully Connected Layer	✔️
Input Layer	❌
Interp Layer	✔️
Local Response Normalization	✔️
Max Unpooling 2d	✔️
Max Unpooling 3d	✔️
MVN Layer	❌
Normalize Layer	🔵	Only L1 and L2 norm supported
Padding Layer	✔️
Permute Layer	✔️
Pooling 2d	🔵	Only max and average pooling supported	supports asymmetric padding
Pooling 3d	🔵	Only max and average pooling supported	supports asymmetric padding
Prior Box Layer	✔️
Proposal Layer	❌
Region Layer	✔️	NMS performed using CPU
Reorg Layer	✔️
Reshape Layer	✔️
Resize Layer	✔️
Scale Layer	✔️
Shift Layer	✔️		forwarded to Scale Layer
Shuffle Channel Layer	✔️
Slice Layer	✔️
Softmax Layer	✔️
Split Layer	✔️
LSTM Layer	❌

Known issues:

Tests for some of the SSD based networks fail on Jetson Nano

References: #14585

Results:

force_builders_only=Custom,linux,docs
buildworker:Custom=linux-4
docker_image:Custom=ubuntu-cuda:18.04

alalek

Good progress!

Please note, that we usually do not merge large code parts without corresponding tests.
Also we prefer to merge completed tasks instead of some helper parts.

So, consider working on this GSoC task in a single PR (if you don't have another agreement with your mentor).

Some build-related comments are below.

modules/dnn/src/cuda4dnn/csl/cudnn.cpp

modules/dnn/src/cuda4dnn/csl/stream.cpp

modules/dnn/include/opencv2/dnn/csl/cublas.hpp

YashasSamaga · 2019-06-21T05:43:34Z

Do I have to use CV_OVERRIDE and CV_FINAL? I preassume that they were added for portability but now since both final and override are keywords in C++11, should they be used?

Can I use std::shared_ptr instead of cv::Ptr? There isn't a make_shared equivalent and makePtr doesn't do what std::make_shared does.

Is it fine to force push occasionally when there isn't any dependent stuff like reviews in between?

alalek · 2019-06-21T10:05:07Z

CV_OVERRIDE and CV_FINAL

It is used to avoid excessive merge issues from 3.4 branch.
As your code is in master branch only and this problem is not actual, so you can use C++ keywords/modifiers.

use std::shared_ptr instead of cv::Ptr

Feel free to use std::shared_ptr (but it is not supported by bindings generator, so be careful with public API).

makePtr doesn't do what std::make_shared does.

In master branch it is just a wrapper, so it should do the same things.

Is it fine to force push

It is OK.
Also rebasing is preferred over "merge" commits (it is easy to do that using 1 squashed commit: squash first, then rebase).

modules/dnn/src/cuda/math.cu

modules/dnn/src/cuda4dnn/csl/tensor_ops.hpp

modules/dnn/src/layers/eltwise_layer.cpp

modules/dnn/src/layers/convolution_layer.cpp

modules/dnn/src/cuda4dnn/csl/tensor_ops.hpp

modules/dnn/src/cuda4dnn/csl/kernel_utils.hpp

modules/dnn/src/cuda/types.hpp

modules/dnn/include/opencv2/dnn/dnn.hpp

modules/dnn/src/cuda4dnn/cxx_utils/is_iterator.hpp

davisking · 2019-07-21T19:25:53Z

Seems like it would be implementation defined at worst, rather than UB. You sure it’s UB? If it’s ok in c++17 and works in our case I think it’s fine. I would be surprised if some compilers defined std::iterator_traits<T>::iterator_category for non iterators in c++11.

modules/dnn/src/cuda/permute.cu

MonocleSecurity · 2020-01-21T11:47:17Z

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

YashasSamaga · 2020-01-25T17:00:55Z

@molyswu The exception will have a fairly detailed message stating what exactly caused the exception to be raised. If you still haven't solved, I think your question will be more suitable for answers.opencv.org.

@MonocleSecurity I have opened an issue for discussion. There are a few problems which need to be sorted out.

pgirgis · 2020-01-28T11:21:38Z

Tested on Jetson TX2 running JetPack 4.2 CUDA 10.0 and cuDNN 7.6.5 and works fine. The above test using YOLOv3 delivers about 5 FPS which is substantial over CPU which was running about 0.33FPS.

YashasSamaga · 2020-01-28T11:51:05Z

@pgirgis what was the size of the input image you used?

pgirgis · 2020-01-28T20:02:32Z

The original image was 872x586. I resized the input image from to 416x416.

I tested with CUDA_FP16 and got slightly higher results (6FPS).
https://miro.medium.com/max/1744/1*EYFejGUjvjPcc4PZTwoufw.jpeg

The TX2, with only 256 CUDA Cores, this is what I was roughly expecting. I get 7FPS when using FP16

isra60 · 2020-01-28T20:06:00Z

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

Do you have any tutorial or example???

Also it could be a good improvement it this feature request is implemented
#15999

Now I have really good performance by using a gstreamer pipeline with the new deepstream hardware decoder from NVIDIA.

pgirgis · 2020-01-28T20:47:44Z

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

Do you have any tutorial or example???

Also it could be a good improvement it this feature request is implemented
#15999

Now I have really good performance by using a gstreamer pipeline with the new deepstream hardware decoder from NVIDIA.

I was getting 300FPS on the TX2 using VideoCapture on 4.2.0. I have not checked into it yet (just a side test) but assume it has been GPU enabled. This is comparable to the Nvidia SDK I believe.

isra60 · 2020-01-28T21:11:57Z

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

Do you have any tutorial or example???
Also it could be a good improvement it this feature request is implemented
#15999
Now I have really good performance by using a gstreamer pipeline with the new deepstream hardware decoder from NVIDIA.

I was getting 300FPS using VideoCapture on 4.2.0. I have not checked into it yet (just a side test) but assume it has been GPU enabled. This is comparable to the Nvidia SDK I believe.

But using the standard VideoCapture?? Which backend?? ffmpeg?? AFIAK ffmpeg is not accelerated on a Jetson Tx2 (maybe there is an update)

MohamedAliRashad · 2020-02-18T09:38:03Z

Does someone knows a check to know if my build has a cuda acceleration support or not ?

I found this, but it's for windows and i am using Ubuntu 18.04.

alalek · 2020-02-18T11:31:38Z

You should check getBuildInformation() output.

molyswu · 2020-02-19T03:55:49Z

Hi, What's the reason?
VS2017 +OPENCV4.2.0 +cuda10.0+cudnn7.6.5 what run yolov3 model tested bug :

net.setPreferableBackend(DNN_BACKEND_CUDA);
net.setPreferableTarget(DNN_TARGET_CUDA);
0x00007FFD17B04048(located at test2.exe )There are unhandled exceptions: Microsoft C++ exceptions: cv::dnn::cuda4dnn::csl::CUDAException，In memory location 0x000000FE94BADDD0 。

print as follows error at test2.exe
OpenCV(4.2.0) Error: Parsing error (Failed to parse NetParameter file: yolov3_1_80000.weights) in cv::dnn::dnn4_v20191202::readNetFromDarknet, file E:\tools\opencv-4.2.0\modules\dnn\src\darknet\darknet_importer.cpp, line 214
.\pei010718
[ INFO:0] global E:\tools\opencv-4.2.0\modules\core\src\ocl.cpp (891) cv::ocl::haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global E:\tools\opencv-4.2.0\modules\dnn\src\dnn.cpp (2204) cv::dnn::dnn4_v20191202::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "_input" of type NetInputLayer

Thanks!

andyrey · 2020-02-21T16:12:31Z

Hello, Yashas, I failed to build in Windows your code, bcz can't find opencv_dnn420.lib. I have built opencv420 with several latest contribs with CMake and VisualStudio 2015, but it never yielded this lib nor dll. Can you tell how to build it or just upload this lib ?

YashasSamaga · 2020-02-21T16:26:00Z

@andyrey https://jamesbowley.co.uk/accelerate-opencv-4-2-0-build-with-cuda-and-python-bindings/

andyrey · 2020-02-26T12:02:11Z

@YashasSamaga Thank you, Yashas! I used your your sample code and opencv_world.dll from your recommended reference (pre-built case), and achieved 2 ms/frame with Yolo-tiny configuration with 320x320 input blob! Before I had 28-32, it is magic, great work!

tuteming · 2020-03-09T11:25:05Z

my config is opencv 4.2 ,contri , 1080ti and compile with cuda by cmake and ms2015
(cuda version 10.0 cudnn 7.4.2
all is ok, but
I run your code yolov3_opencv_dnn_cuda.cpp
get
[ WARN:0] global E:\opencv_cuda_4.2\opencv-4.2.0\modules\dnn\src\dnn.cpp (1363) cv::dnn::dnn4_v20191202::Net::Impl::setUpNet DNN module was not built with CUDA backend; switching to CPU
run on cpu mode, not gpu mode,can you tell me what is my trouble?
thanks

andyrey · 2020-03-11T11:08:01Z

I have met strange phenomena: when I use cv::imshow ("Window_name", frame_show), my processing time is ~9 ms, when I remove graphical output, having commented this function, proc time increases (!) up to ~15 ms. I can't understand, why, before using opencv 420 version I removed the output to decrease proc time!
I work in Windows 10, yolo-tiny 320x320 input.

YashasSamaga · 2020-03-11T11:18:36Z

@tuteming https://www.pyimagesearch.com/2020/02/03/how-to-use-opencvs-dnn-module-with-nvidia-gpus-cuda-and-cudnn/

@andyrey

Were you using CPU for inference prior to 4.2.0 and GPU since 4.2.0?
Can you share an overall structure of your code around cv::imshow? Is it in a loop?
What do you mean by proc time?

I am going to take a guess here but I think it might have to do with your CPU inference being treated as compute-bound process and GPU inference being treated as an IO-bound process (at least from the scheduler's perspective). OSes generally have different scheduling policies for IO-bound and CPU bound processes.

andyrey · 2020-03-11T12:03:51Z

@YashasSamaga
1.I used Darknet YOLO based inference before and opencv330, but since your current release overdid, I implemented my code
based on this. In former case I alwais got time decreasing when switch out opencv graphical output.
2.my cv::imshow is at the end of framely loop, in main(). I
I use:
my_draw(frame_show, my_parameters...);
cv::imshow("My window",frame_show);

I use your postprocess(Mat& frame,...) having removed the loop
for (size_t i = 0; i < indices.size(); ++i)
{
...
drawPred(classIds[idx],...
...
}
bcz now I use your code not in main().
and using my own drawing in main().
But, if I remove my drawing and switch yours on, I observe the same effect!
3. Proc time= processing time, sorry for abbreviation.

andyrey · 2020-03-11T14:02:52Z

@YashasSamaga
Yashas, I returned from my may be crooked code to your original one and made same experiment.
Results by video of 3200 frames:
Average time ms/frame 2.55061 when I removes cv::imshow(...) in your code,
Average time ms/frame 2.13507 with it.
Again, graphical output inserting makes program run faster, it is strange..

GTRwolf · 2020-03-13T08:54:19Z

@tuteming
Did you solve it? I also face this problem.

andyrey · 2020-03-13T09:15:38Z

@tuteming
@GTRwolf
I had the same problem. In Windows 10, VS2015, C++.
Doesn't matter, did you succeed in building the proper dll from opencv420 (I failed).
But, following Yashas advice, I took one pre-built (for VS2019 is OK either) from
https://jamesbowley.co.uk/accelerate-opencv-4-2-0-build-with-cuda-and-python-bindings/
take there and put opencv_world420.dll (it's weight 647277 Kb) in your working dir, and run! If you have CUDA and cudnn installed in your comp, your program will invoke your Nvidia GPU.

ccl-private · 2020-03-25T10:37:03Z

my config is opencv 4.2 ,contri , 1080ti and compile with cuda by cmake and ms2015
(cuda version 10.0 cudnn 7.4.2
all is ok, but
I run your code yolov3_opencv_dnn_cuda.cpp
get
[ WARN:0] global E:\opencv_cuda_4.2\opencv-4.2.0\modules\dnn\src\dnn.cpp (1363) cv::dnn::dnn4_v20191202::Net::Impl::setUpNet DNN module was not built with CUDA backend; switching to CPU
run on cpu mode, not gpu mode,can you tell me what is my trouble?
thanks

That is because your cudnn version is unsuitble. Check your cmake log, and it will till you that cudnn version should be at least 7.5.

qlong1505 · 2020-04-14T13:35:22Z

Is there anyone test the opencv dnn with cuda backend on Jetson nano? I used https://gist.github.com/YashasSamaga/6d37bc403c0934329b078b4bad98c7f2 script and compiled successfully. But when I tested it show the error message

what(): OpenCV(4.3.0) /home/user/opencv/modules/core/src/cuda_info.cpp:62: error: (-217:Gpu API call) unknown error in function 'getCudaEnabledDeviceCount'

dkurt · 2020-04-14T14:47:22Z

@qlong1505, please use a forum for usage question: https://answers.opencv.org/questions/. This PR has already 236 messages.

CUDA backend for the DNN module * stub cuda4dnn design * minor fixes for tests and doxygen * add csl public api directory to module headers * add low-level CSL components * add high-level CSL components * integrate csl::Tensor into backbone code * switch to CPU iff unsupported; otherwise, fail on error * add fully connected layer * add softmax layer * add activation layers * support arbitary rank TensorDescriptor * pass input wrappers to `initCUDA()` * add 1d/2d/3d-convolution * add pooling layer * reorganize and refactor code * fixes for gcc, clang and doxygen; remove cxx14/17 code * add blank_layer * add LRN layer * add rounding modes for pooling layer * split tensor.hpp into tensor.hpp and tensor_ops.hpp * add concat layer * add scale layer * add batch normalization layer * split math.cu into activations.cu and math.hpp * add eltwise layer * add flatten layer * add tensor transform api * add asymmetric padding support for convolution layer * add reshape layer * fix rebase issues * add permute layer * add padding support for concat layer * refactor and reorganize code * add normalize layer * optimize bias addition in scale layer * add prior box layer * fix and optimize normalize layer * add asymmetric padding support for pooling layer * add event API * improve pooling performance for some padding scenarios * avoid over-allocation of compute resources to kernels * improve prior box performance * enable layer fusion * add const layer * add resize layer * add slice layer * add padding layer * add deconvolution layer * fix channelwise ReLU initialization * add vector traits * add vectorized versions of relu, clipped_relu, power * add vectorized concat kernels * improve concat_with_offsets performance * vectorize scale and bias kernels * add support for multi-billion element tensors * vectorize prior box kernels * fix address alignment check * improve bias addition performance of conv/deconv/fc layers * restructure code for supporting multiple targets * add DNN_TARGET_CUDA_FP64 * add DNN_TARGET_FP16 * improve vectorization * add region layer * improve tensor API, add dynamic ranks 1. use ManagedPtr instead of a Tensor in backend wrapper 2. add new methods to tensor classes - size_range: computes the combined size of for a given axis range - tensor span/view can be constructed from a raw pointer and shape 3. the tensor classes can change their rank at runtime (previously rank was fixed at compile-time) 4. remove device code from tensor classes (as they are unused) 5. enforce strict conditions on tensor class APIs to improve debugging ability * fix parametric relu activation * add squeeze/unsqueeze tensor API * add reorg layer * optimize permute and enable 2d permute * enable 1d and 2d slice * add split layer * add shuffle channel layer * allow tensors of different ranks in reshape primitive * patch SliceOp to allow Crop Layer * allow extra shape inputs in reshape layer * use `std::move_backward` instead of `std::move` for insert in resizable_static_array * improve workspace management * add spatial LRN * add nms (cpu) to region layer * add max pooling with argmax ( and a fix to limits.hpp) * add max unpooling layer * rename DNN_TARGET_CUDA_FP32 to DNN_TARGET_CUDA * update supportBackend to be more rigorous * remove stray include from preventing non-cuda build * include op_cuda.hpp outside condition #if * refactoring, fixes and many optimizations * drop DNN_TARGET_CUDA_FP64 * fix gcc errors * increase max. tensor rank limit to six * add Interp layer * drop custom layers; use BackendNode * vectorize activation kernels * fixes for gcc * remove wrong assertion * fix broken assertion in unpooling primitive * fix build errors in non-CUDA build * completely remove workspace from public API * fix permute layer * enable accuracy and perf. tests for DNN_TARGET_CUDA * add asynchronous forward * vectorize eltwise ops * vectorize fill kernel * fixes for gcc * remove CSL headers from public API * remove csl header source group from cmake * update min. cudnn version in cmake * add numerically stable FP32 log1pexp * refactor code * add FP16 specialization to cudnn based tensor addition * vectorize scale1 and bias1 + minor refactoring * fix doxygen build * fix invalid alignment assertion * clear backend wrappers before allocateLayers * ignore memory lock failures * do not allocate internal blobs * integrate NVTX * add numerically stable half precision log1pexp * fix indentation, following coding style, improve docs * remove accidental modification of IE code * Revert "add asynchronous forward" This reverts commit 1154b9d. * [cmake] throw error for unsupported CC versions * fix rebase issues * add more docs, refactor code, fix bugs * minor refactoring and fixes * resolve warnings/errors from clang * remove haveCUDA() checks from supportBackend() * remove NVTX integration * changes based on review comments * avoid exception when no CUDA device is present * add color code for CUDA in Net::dump

YashasSamaga force-pushed the cuda4dnn-csl-low branch 2 times, most recently from 5717c7f to 359bf93 Compare June 18, 2019 13:29

alalek reviewed Jun 18, 2019

View reviewed changes

modules/dnn/src/cuda4dnn/csl/cudnn.cpp Outdated Show resolved Hide resolved

modules/dnn/src/cuda4dnn/csl/stream.cpp Outdated Show resolved Hide resolved

modules/dnn/include/opencv2/dnn/csl/cublas.hpp Outdated Show resolved Hide resolved

YashasSamaga changed the title ~~add low-level CSL components for cuda4dnnn~~ [WIP] CUDA backend for the DNN module Jun 21, 2019

YashasSamaga force-pushed the cuda4dnn-csl-low branch from 46db2b1 to fbd05d3 Compare June 21, 2019 05:38

YashasSamaga commented Jun 23, 2019

View reviewed changes

modules/dnn/src/cuda/math.cu Outdated Show resolved Hide resolved

YashasSamaga commented Jun 23, 2019

View reviewed changes

modules/dnn/src/cuda/math.cu Outdated Show resolved Hide resolved

YashasSamaga force-pushed the cuda4dnn-csl-low branch from c8fd75b to 30b294e Compare June 25, 2019 11:56

alalek mentioned this pull request Jun 26, 2019

dnn: fix BNLL layer implementation #14899

Merged

2 tasks

YashasSamaga commented Jun 26, 2019

View reviewed changes

modules/dnn/src/cuda4dnn/csl/tensor_ops.hpp Outdated Show resolved Hide resolved

YashasSamaga commented Jun 26, 2019

View reviewed changes

modules/dnn/src/layers/eltwise_layer.cpp Outdated Show resolved Hide resolved

YashasSamaga commented Jun 28, 2019

View reviewed changes

modules/dnn/src/layers/convolution_layer.cpp Outdated Show resolved Hide resolved

YashasSamaga force-pushed the cuda4dnn-csl-low branch 3 times, most recently from 79c65f0 to 2941d74 Compare July 2, 2019 06:29

YashasSamaga commented Jul 10, 2019

View reviewed changes

modules/dnn/src/cuda4dnn/csl/tensor_ops.hpp Outdated Show resolved Hide resolved

YashasSamaga commented Jul 10, 2019

View reviewed changes

modules/dnn/src/cuda4dnn/csl/kernel_utils.hpp Outdated Show resolved Hide resolved

YashasSamaga commented Jul 16, 2019

View reviewed changes

modules/dnn/src/cuda/types.hpp Show resolved Hide resolved

YashasSamaga force-pushed the cuda4dnn-csl-low branch from 39837c8 to b89d7e0 Compare July 16, 2019 13:31

YashasSamaga commented Jul 17, 2019

View reviewed changes

modules/dnn/include/opencv2/dnn/dnn.hpp Outdated Show resolved Hide resolved

dkurt reviewed Jul 19, 2019

View reviewed changes

modules/dnn/include/opencv2/dnn/dnn.hpp Outdated Show resolved Hide resolved

YashasSamaga commented Jul 21, 2019

View reviewed changes

modules/dnn/src/cuda4dnn/cxx_utils/is_iterator.hpp Show resolved Hide resolved

YashasSamaga commented Jul 23, 2019

View reviewed changes

modules/dnn/src/cuda/permute.cu Outdated Show resolved Hide resolved

YashasSamaga force-pushed the cuda4dnn-csl-low branch 2 times, most recently from a818297 to 3584d72 Compare July 25, 2019 18:45

YashasSamaga force-pushed the cuda4dnn-csl-low branch 2 times, most recently from 9279922 to becb664 Compare August 7, 2019 05:50

grzegorzk mentioned this pull request Feb 22, 2020

opencv 4.2 and support for cuda opencv/opencv-python#295

Closed

YashasSamaga mentioned this pull request Mar 4, 2020

DNN : cannot set GPU device #16725

Closed

opencv locked as too heated and limited conversation to collaborators Apr 14, 2020

emgucv mentioned this pull request Jul 9, 2020

SEHException in net.Forward() with CUDA background. emgucv/emgucv#343

Closed

tushardhadiwal mentioned this pull request Jan 7, 2021

Improving inference time jkjung-avt/tensorrt_demos#321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA backend for the DNN module #14827

CUDA backend for the DNN module #14827

YashasSamaga commented Jun 18, 2019 •

edited

Loading

alalek left a comment

YashasSamaga commented Jun 21, 2019 •

edited

Loading

alalek commented Jun 21, 2019

davisking commented Jul 21, 2019

MonocleSecurity commented Jan 21, 2020

YashasSamaga commented Jan 25, 2020

pgirgis commented Jan 28, 2020

YashasSamaga commented Jan 28, 2020

pgirgis commented Jan 28, 2020 •

edited

Loading

isra60 commented Jan 28, 2020 •

edited

Loading

pgirgis commented Jan 28, 2020 •

edited

Loading

isra60 commented Jan 28, 2020

MohamedAliRashad commented Feb 18, 2020

alalek commented Feb 18, 2020

molyswu commented Feb 19, 2020 •

edited

Loading

andyrey commented Feb 21, 2020

YashasSamaga commented Feb 21, 2020

andyrey commented Feb 26, 2020

tuteming commented Mar 9, 2020

andyrey commented Mar 11, 2020

YashasSamaga commented Mar 11, 2020 •

edited

Loading

andyrey commented Mar 11, 2020 •

edited

Loading

andyrey commented Mar 11, 2020 •

edited

Loading

GTRwolf commented Mar 13, 2020

andyrey commented Mar 13, 2020 •

edited

Loading

ccl-private commented Mar 25, 2020 •

edited

Loading

qlong1505 commented Apr 14, 2020

dkurt commented Apr 14, 2020

CUDA backend for the DNN module #14827

CUDA backend for the DNN module #14827

Conversation

YashasSamaga commented Jun 18, 2019 • edited Loading

Known issues:

alalek left a comment

Choose a reason for hiding this comment

YashasSamaga commented Jun 21, 2019 • edited Loading

alalek commented Jun 21, 2019

davisking commented Jul 21, 2019

MonocleSecurity commented Jan 21, 2020

YashasSamaga commented Jan 25, 2020

pgirgis commented Jan 28, 2020

YashasSamaga commented Jan 28, 2020

pgirgis commented Jan 28, 2020 • edited Loading

isra60 commented Jan 28, 2020 • edited Loading

pgirgis commented Jan 28, 2020 • edited Loading

isra60 commented Jan 28, 2020

MohamedAliRashad commented Feb 18, 2020

alalek commented Feb 18, 2020

molyswu commented Feb 19, 2020 • edited Loading

andyrey commented Feb 21, 2020

YashasSamaga commented Feb 21, 2020

andyrey commented Feb 26, 2020

tuteming commented Mar 9, 2020

andyrey commented Mar 11, 2020

YashasSamaga commented Mar 11, 2020 • edited Loading

andyrey commented Mar 11, 2020 • edited Loading

andyrey commented Mar 11, 2020 • edited Loading

GTRwolf commented Mar 13, 2020

andyrey commented Mar 13, 2020 • edited Loading

ccl-private commented Mar 25, 2020 • edited Loading

qlong1505 commented Apr 14, 2020

dkurt commented Apr 14, 2020

YashasSamaga commented Jun 18, 2019 •

edited

Loading

YashasSamaga commented Jun 21, 2019 •

edited

Loading

pgirgis commented Jan 28, 2020 •

edited

Loading

isra60 commented Jan 28, 2020 •

edited

Loading

pgirgis commented Jan 28, 2020 •

edited

Loading

molyswu commented Feb 19, 2020 •

edited

Loading

YashasSamaga commented Mar 11, 2020 •

edited

Loading

andyrey commented Mar 11, 2020 •

edited

Loading

andyrey commented Mar 11, 2020 •

edited

Loading

andyrey commented Mar 13, 2020 •

edited

Loading

ccl-private commented Mar 25, 2020 •

edited

Loading