Run Tensorflow on OpenCL™ devices. UNDER CONSTRUCTION!!!
This repo was created from the original Tensorflow repository at:
Please see the main repository for full Tensorflow documentation. This readme will only focus on the OpenCL porting aspects of Tensorflow.
- per-element binary operators:
add
,sub
,mul
,div
,pow
,minimum
,maximum
,squared_difference
, as per test_tf3.py - per-element unary operators:
tanh
,abs
,acos
,asin
,atan
,ceil
,cos
,exp
,floor
,inverse
,isfinite
,isinf
,isnan
,log
,neg
,sign
,sin
,sqrt
, square,
tan` (test: test_tf4.py) - Variables can be placed on GPU
matmul
(using CLBlast)- some gradients
reduce_sum
,reduce_prod
,reduce_max
,reduce_mean
,reduce_min
working, in beta test_reductions.py- training works :-)))
- device name and memory reported correctly now
- Aymeric Damien's 2_BasicModels run ok on NVIDIA K520 now (not working on Intel HD5500 yet).
- types:
- float32 are primarily supported type
- int32 is also supported, as a second priority
- int8 (or uint8, havent decided yet), will probably be supported too
- out of scope: complex, double, half
- fix bugs...
- add convolutions
- For now, Ubuntu 16.04 is supported. In the future, I plan to support Mac OS X too
- You need:
- the tensorflow non-gpu installation pre-requisites,
- an OpenCL 1.2-enabled GPU, and OpenCL 1.2-enabled drivers
- python 3
- Simply download https://github.com/hughperkins/tensorflow-cl/releases/download/v0.11.0/tensorflow-0.11.0rc0-py3-none-any.whl , and
- Install using pip:
pip install --upgrade tensorflow-0.11.0rc0-py3-none-any.whl
If you want, you can build from source
- test per-element binary operations tensorflow/stream_executor/cl/test/test_tf3.py:
cd
source ~/env3/bin/activate
python ~/git/tensorflow-cl/tensorflow/stream_executor/cl/test/test_tf3.py
- test per-element unary operations tensorflow/stream_executor/cl/test/test_tf4.py:
cd
source ~/env3/bin/activate
python ~/git/tensorflow-cl/tensorflow/stream_executor/cl/test/test_tf4.py
- test blas test_blas.py :
cd
source ~/env3/bin/activate
python ~/git/tensorflow-cl/tensorflow/stream_executor/cl/test/test_blas.py
- training example test_gradients.py
cd
source ~/env3/bin/activate
python ~/git/tensorflow-cl/tensorflow/stream_executor/cl/test/test_gradients.py
Piccie of running Aymeric Damien's linear_regression.py:
test | Intel HD5500 | NVIDIA K520 |
---|---|---|
test_tf.py | ok | ok |
test_tf2.py | ok | ok |
test_tf3.py | fails for pow | ok |
test_tf4.py | fails for all | ok |
test_blas.py | ok | ok |
test_reductions.py | fails for all except reduce_mean | ok |
linear_regression.py | runs, but cost seems wrong | ok |
logistic_regression.py | epoch 1 ok, then memory error | ok |
nearest_neighbor.py | accuracy 0.12, seems a bit low... | ok |
multilayer_perceptron.py | cost is nan | a bit slow, otherwise seems ok |
recurrent_network.py | loss nan, accuracy broken | cost looks ok, accuracy seems broken |
- tensorflow code stays 100% NVIDIA® CUDA™
- cuda-on-cl compiles the CUDA code into OpenCL
- Cedric Nugteren's CLBlast provides BLAS (matrix multiplications)
- CLBlast BLAS for OpenCL
- cuda-on-cl Compile CUDA apps for OpenCL
- EasyCL Handles running kernels, passing in arguments etc, on OpenCL
- Nov 1:
- building clew, CLBlast, easycl, cocl as shared libraries now, rather than static
- hopefully this will facilitate debugging things on the HD5500 on my laptop, since dont need to build/install entire wheel, for
libcocl
tweaks
- hopefully this will facilitate debugging things on the HD5500 on my laptop, since dont need to build/install entire wheel, for
- turned on
clew
- this means no longer needs
libOpenCL.so
during build process - might facilitiate building on Mac, since no longer need to link to
libOpenCL.so
, which was outside the Bazel build tree
- this means no longer needs
- building clew, CLBlast, easycl, cocl as shared libraries now, rather than static
- Oct 30:
- new wheel v0.11.0
- fixes critical bug in v0.10.0 release, where the number of devices was hard-coded to be 0 :-P
- Aymeric Damien's 2_BasicModels all run now, on NVIDIA K520. Seem broken on Intel HD5500 for now
- bunch of fixes underneath to get 2_BasicModels working ok on K520
- new wheel v0.11.0
- Oct 29:
reduce_min
working now, and test_reductions.py tests three types of reduction axes: inner, outer, all- Wheel v0.10.0 released:
- Aymeric Damien's linear_regression runs fairly ok now (a bit slow, but not monstrously slow, maybe 3-4 times slower than on CUDA)
- kernels cached between kernel launches (this gives a hugggeee speed boost, compared to earlier)
- bunch of behind-the-scenes ops added, like Cast
- memory and device name reported correctly now
reduce_min
working nowsoftmax
added
- Oct 28:
- training working :-) test_gradients.py
reduce_sum
,reduce_prod
,reduce_max
,reduce_mean
added, in beta test_reductions.py
- Oct 25:
- fixed BLAS wrapper, working now, on GPU, test script: test_blas.py
- int32 constant works on gpu now, test_ints.py
- Oct 24:
- hmmm, just discovered some new options, to ensure operations really are on the gpu, and ... many are not :-P, so back to the drawing board a bit
- the good news is that component-wise add really is on the gpu
- the bad news is that everything else is not :-P
- (re-)added following per-element binary operators:
sub
,mul
,div
,pow
,minimum
,maximum
,squared_difference
. This time, they actually are really running on the gpu :-) (test: test_tf3.py) - (re-)added following per-element unary operators:, which really are running on gpu now :-), test_tf4.py:
tanh
,abs
,acos
,asin
,atan
,ceil
,cos
,exp
,floor
,inverse
,isfinite
,isinf
,isnan
,log
,neg
,sign
,sin
,sqrt
, square,
tan` - Variables can be placed on gpu now, test_gradients.py
- hmmm, just discovered some new options, to ensure operations really are on the gpu, and ... many are not :-P, so back to the drawing board a bit
- Oct 23:
- can use component wise addition from Python now :-)
- fixed critical bug involving
float4
s, that meant that tensors larger than, say, 3 :-P, could not be added correctly added following per-element binary operators:sub
,mul
,div
,not_equal
,minimum
,maximum
,pow
,squared_difference
(test: test_tf3.py)added following per-element unary operator:tanh
,abs
,acos
,asin
,atan
,ceil
,cos
,exp
,floor
,inverse
,isfinite
,isinf
,isnan
,log
,neg
,sigmoid
,sign
,sin
,sqrt
, square,
tan` (test: test_tf4.py)added following comparison operators:equal_to
,greater
,greater_equal
,less
,less_equal
added in BLAS (using Cedric Nugteren's CLBlast ). Not very tested yet. Test script test_blas.py
- Oct 22:
- componentwise addition working, when called from c++
- commit
0db9cc2e
: re-enabled-fPIC
,-pie
- this is a pre-requisite for being able to run from python at some point
- but if you built prior to this, you need to deeeeep clean, and rebuild from scratch:
rm -Rf third_party/cuda-on-cl/build bazel clean --expunge
- python working (as of commit 5e67304c3c)
- you'll need to do
bazel clean
, and rebuild from scratch, if you already did a build prior to this commit
- you'll need to do
- Oct 20:
- removed requirement for CUDA Toolkit
- updated build slightly: added https://github.com/hughperkins/cuda-on-cl as a submodule
- Oct 18:
- stream executor up
- crosstool working