Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting Caffe to use the GPU #147

Closed
robertnishihara opened this issue Feb 7, 2016 · 20 comments
Closed

Getting Caffe to use the GPU #147

robertnishihara opened this issue Feb 7, 2016 · 20 comments
Assignees
Labels

Comments

@robertnishihara
Copy link

Hi, thanks for all of your help so far!

I have a question about getting Caffe to use the GPU. A minimal example is below, and a project that compiles and runs the code below is attached.

This code creates a network for Cifar10 and then calls caffeNet.ForwardBackward(inputs) some number of times. You can see by running top in a separate window that CPU usage is very high. Furthermore, each call to ForwardBackward takes about 0.3s (on my machine), which is much slower than you would expect for one minibatch of Cifar10 on a GPU. This suggests that the GPU is not being used. A call to nvidia-smi also does not show any GPU usage.

The line Caffe.set_mode(Caffe.GPU) doesn't seem to make any difference. Are there any obvious mistakes here? Thanks!

import org.bytedeco.javacpp.caffe._

object ExampleGPU {
  def main(args: Array[String]) {
    val netParam = new NetParameter();
    ReadProtoFromTextFileOrDie("cifar10_quick.prototxt", netParam);
    val caffeNet = new FloatNet(netParam)
    val inputSize = netParam.input_size

    // this block of code just sets up the input to the net
    val inputs = new FloatBlobVector(inputSize)
    val inputRef = new Array[FloatBlob](inputSize)
    for (i <- 0 to inputSize - 1) {
      val dims = new Array[Int](netParam.input_shape(i).dim_size)
      for (j <- dims.indices) {
        dims(j) = netParam.input_shape(i).dim(j).toInt
      }
      inputRef(i) = new FloatBlob(dims)
      inputs.put(i, inputRef(i))
    }

    val t1 = System.currentTimeMillis()
    for (i <- 0 to 20) {
      Caffe.set_mode(Caffe.GPU) // this line makes no difference
      val tops = caffeNet.ForwardBackward(inputs)
    }
    val t2 = System.currentTimeMillis()
    print("iters took " + ((t2 - t1) * 1F / 1000F).toString + " s\n")
  }
}

ExampleGPU.zip

@saudet
Copy link
Member

saudet commented Feb 8, 2016

Any ideas @cypof?

@cypof
Copy link
Contributor

cypof commented Feb 8, 2016

I would be useful to put traces in common.cpp, when the thread-local instance of Caffe is created. It will show if the GPU context is created correctly. Also Caffe.set_mode might need to be called earlier, beofre the creation of the FloatNet.

@robertnishihara
Copy link
Author

Thanks @saudet and @cypof! We tried two things.

  1. Moving Caffe.set_mode(Caffe.GPU) to right before val caffeNet = new FloatNet(netParam), which resulted in the following crash.
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd90c8bf090, pid=35395, tid=140570775308032
#
# JRE version: OpenJDK Runtime Environment (7.0_91-b02) (build 1.7.0_91-b02)
# Java VM: OpenJDK 64-Bit Server VM (24.91-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.3
# Distribution: Ubuntu 14.04 LTS, package 7u91-2.6.3-0ubuntu0.14.04.1
# Problematic frame:
# C  [libcaffe.so+0x3c1090]  caffe::Caffe::RNG::generator()+0x0
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

We get the same error when we run the src/main/java/caffe.java code from the README https://github.com/bytedeco/javacpp-presets/blob/master/caffe/README.md.

  1. To check if the problem was with cuda, we compiled caffe (not using JavaCPP), both with and without USE_CUDNN := 1, and found that ./examples/mnist/train_lenet.sh worked perfectly (and used the GPU as shown by nvidia-smi).

@saudet
Copy link
Member

saudet commented Feb 9, 2016

It also crashes in the same way here. It wasn't like that before. I wonder what changed...

@pcmoritz
Copy link
Contributor

I'm in the process of debugging this right now, @saudet or @cypof, can you give me a date or commit from where you think this worked (and also a commit/data/version of Caffe that worked)?

@cypof
Copy link
Contributor

cypof commented Feb 20, 2016

It worked with both master branches a couple months ago. Do you a way to debug C code called from inside the JVM? It can be done by launching the JVM from a C program, using jni.h, so that gdb can be attached. Let me know if you are all set.

@saudet
Copy link
Member

saudet commented Feb 20, 2016

Thanks for the input guys! I've tried it in GDB and it crashes in Boost, on a null-pointer assert trying to dereference a shared_ptr from the RNG module. It's really strange. It has nothing to do with CUDA or Java.

I remember it working with CUDA 7.0, but I'm not sure that's related. CUDA 7.5 has been out for a while already...

@pcmoritz
Copy link
Contributor

Ah! Thanks a lot guys, that explains why I couldn't find a version of caffe/javacpp that worked. I didn't suspect that CUDA 7.5 was the problem! Trying 7.0 now...

@pcmoritz
Copy link
Contributor

Unfortunately I observed the same crash with CUDA 7.0, I set the .jar up like this: https://github.com/amplab/SparkNet/blob/0cbff42ec8e072be215da9f302b3ad55f96e5679/doc/creating-jars.md

I can also try an older version of boost and see if that fixes the problem.

@saudet
Copy link
Member

saudet commented Feb 22, 2016

Thanks for testing! I didn't think it was related to the CUDA version either, and I don't think its Boost either, but it's easy enough to test, so why not.

If I had a bit of time though, the next thing I'd try to do is to convert the main() function of the original caffe.cpp sample into a JNI function, and call that and only that from a main() method of Java. If it behaves the same by crashing in the RNG, then we could be sure that it has nothing to do with JavaCPP and that the JDK and Caffe with CUDA are conflicting in some way. It can happen, weirder things have happened:
http://stackoverflow.com/questions/9627333/jvm-does-not-work-as-expected-with-jni-c-code-containing-a-class-named-node/9781340#9781340

@pcmoritz
Copy link
Contributor

Thanks a lot for your help!

The approach I'm trying is doing a bisection to see if I can pinpoint where the problem was introduced. For this to work, I need a configuration where it doesn't crash. The oldest configuration I succeeded building was

commit 5070834
Author: Samuel Audet [email protected]
Date: Fri Sep 25 16:44:12 2015 +0900

and caffe/javacpp-presets from the same day, which still seems to crash (that was with CUDA 7.5 however). I'll keep you updated about how it goes.

@saudet
Copy link
Member

saudet commented Feb 23, 2016

Yes, I understand, but I'm pretty sure it used to work with the release of JavaCPP 1.1, but since then OpenJDK and the kernel has been updated a few times, and who knows what at this point

@bfoust
Copy link

bfoust commented Mar 1, 2016

This error happened for me after updating Caffe, but before I updated to JavaCPP1.1 (that's why I tried updating).

@saudet
Copy link
Member

saudet commented Mar 1, 2016

@bfoust So it's something that changed in Caffe itself?

@brentsony
Copy link

@saudet Not yet verified - that is, haven't reverted Caffe, yet, to test that. I'm also on a different machine, so quite possibly just a g++ version mismatch (32-/64-bit) issue. I currently suspect JavaCPP is using a different version of g++ than Caffe, as Caffe would not compile without the previous version of gcc/g++ (3.8).

@saudet
Copy link
Member

saudet commented Mar 3, 2016

As a small experiment just now, I've tried now to rename the main() function of the original caffe.cpp example to execute() (or something) and called that directly from the main() method of Java, and it runs fine! So it's not the C++ compiler, the JVM, or something weird with the dependencies. The caffe.cpp file has been updated in subtle ways for a while, and I haven't kept the Java translation up to date, so there might be something that's not being set properly anymore. Or there's a bug in the way some of the functions are called via JavaCPP...

saudet added a commit that referenced this issue Mar 4, 2016
 * Make OpenBLAS build for Caffe more generic (issue #154)
@saudet
Copy link
Member

saudet commented Mar 4, 2016

Ok, I've found the issue! We just have to remove the CPU_ONLY define... :-/

@robertnishihara
Copy link
Author

Awesome! We'll give this a try!

@robertnishihara
Copy link
Author

It worked for us. Thanks a lot!

@liaocs2008
Copy link

I didn't find you removing "CPU_ONLY" in your "caffe/src/main/java/org/bytedeco/javacpp/presets/caffe.java". So I tried editing and removing "CPU_ONLY" here:

@Platform(value = {"linux", "macosx"}, define = {"NDEBUG", "CPU_ONLY", ...

Now it is working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants