Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about multiple CPU usage. #137

Open
charlesjsun opened this issue Apr 28, 2020 · 3 comments
Open

Question about multiple CPU usage. #137

charlesjsun opened this issue Apr 28, 2020 · 3 comments

Comments

@charlesjsun
Copy link
Contributor

What's the proper way of limiting CPU (or GPU) usage? I tried setting --cpus 6 or --trial-cpus 6 or both, but all of them seem to use all 12 of the CPUs. Also, from my understanding, in softlearning, only 1 trial is ever created, and only one environment is created, so what are the reasons why more than 1 CPU is ever needed?

@hartikainen
Copy link
Member

We use Ray Tune for running all the trials/experiments, meaning that all the resources configurable through the command line correspond to Tune's resources. You can basically configure two things: 1) what resources are available for the Tune runner, and 2) what resources are required for each trial to run. These resources are not hard constraints but only used for scheduling purposes. I.e., if your machine has 40 cpus available, and you specify --trial-cpus=8, then softlearning will run 5 trials in parallel, but all those 5 trials can still access all 40 cpus. Note that GPUs are exception here. Depending on your --trial-gpus flag, Tune will actually set the CUDA_VISIBLE_DEVICES for you s.t. that each trial uses only one specific trial at a time. Also note that the CUDA_VISIBLE_DEVICES you specify manually take priority.

For example, if you have a machine with 24 cpus and 8 gpus, then:

  • --trial-cpus=1 --trial-gpus=1 will run 8 trials, i.e. 1 trial per each gpu (assuming you haven't limited CUDA_VISIBLE_DEVICES manually).
  • --trial-cpus=4 --trial-gpus=1 will run 24/4 = 6 trials, because cpus are the bottleneck.

By default, I'd say you only ever want to configure the trial resources (1) above) and not the machine resources itself, since Ray automatically determines the machine resources.

Machine resources 1) from above are passed into ray.init here:

resources=example_args.resources or {},

Trial resources 2) from above are passed into tune.run here (in experiment_kwargs):

tune.run(
trainable_class,
**experiment_kwargs,
with_server=example_args.with_server,
server_port=example_args.server_port,
scheduler=None,
reuse_actors=True)

... in softlearning, only 1 trial is ever created, and only one environment is created, so what are the reasons why more than 1 CPU is ever needed?

I think there should be more than one trial created if you set the trial resources correctly. Make sure you're running with softlearning run_example_local ... instead of softlearning run_example_debug .... Debug mode will by default limit to only 1 trial at a time so as to make the debugging more manageable.

Also note that even though we currently use only one environment for sampling, all the numerical frameworks (i.e. numpy and tensorflow) will still automatically run across multiple cpus. With SAC, the environment sampling is rarely the bottleneck (even when training from images), which is why I haven't yet implemented parallel environment sampling yet. That's on my todo list though. Let me know if you think that should be prioritized.

Did that answer your questions? Let me know if any of that is still unclear.

@charlesjsun
Copy link
Contributor Author

Thanks! So just so I'm understanding this right, even if I have 4 trials, only one environment is created, so the different trials just speed up numpy and tensorflow calculations. In this case, say I have 12 CPUs and 4 GPUs, is there a difference between using 4 trials each with 3 CPUs and 1 GPU, versus using 1 trials with all 12 CPUs and 4 CPUs?

Also you said tune runner will try using all the available resources to run experiments without the --cpu --gpu flags. What if I have a machine and I don't want to consume all the resources since I'm sharing with others/or want to run multiple experiments? Then which settings should I use?

@hartikainen
Copy link
Member

hartikainen commented Apr 28, 2020

even if I have 4 trials, only one environment is created

Not exactly. We still create 1 environment for each trial. That is, each trial is completely independent of other trials (unless you use some fancier hyperparameter tuning).

12 CPUs and 4 GPUs, is there a difference between using 4 trials each with 3 CPUs and 1 GPU, versus using 1 trials with all 12 CPUs and 4 CPUs?

Yeah, there's a difference here. Imagine you sweep over four different Q_lr values with tune.grid_search([1e-4, 3e-4, 1e-3, 3e-3]) and you run 3 samples (with --num-samples=3). Then you effectively have 4*3 = 12 trials to run. If you run 4 trials each with 3 CPUs and 1 GPU, then your 12 trials will finish roughly 3 times faster than they would if you only queued 1 trial at a time with 12 CPUs and 4 GPUs. That's because it's hard for 1 trial to actually fully leverage so many cpus/gpus at once.

Here are some very rough estimates of how I allocate my resources. For runs that use low-level state (i.e. not vision observations), I typically run 3-6 trials per GPU as long as each trial has at least 1 or to CPUs. If you have no GPUs available, you need more than 1 CPU per trial, the optimal probably being somewhere around 4. For vision-based experiments, it really depends on your GPU and image and convnet sizes. In some cases you can just run as many trials as you can fit in the GPU memory, but typically I run something like 2-3 trials per GTX 1080 with image sizes of 64x64 and a couple layer convnet.

These are to minimize the cost per trial. Obviously, if you want to maximize the speed for one trial without caring the costs, you'd just allocate all your resources to one trial at a time though :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants