Allow to set a device when loading a model #154

leor-c · 2020-09-01T22:00:38Z

Added a 'device' keyword argument to BaseAlgorithm.load(), to enable users to set the model to their device of choice.
Edited test_save_load to also test the load method with all possible devices.
Added the changes to the changelog (I'm not completely confident about the choice of words though).

Description

Added a 'device' keyword argument with default value 'auto', similarly to the hard-coded value that was used before my change to BaseAlgorithm.load(), and forwarded it to the C'tor call inside the load method (instead of the hard-coded string parameter).

Motivation and Context

I'm using this repo for my research and when I work with simpler models (tiny MLP for example) it is actually faster to use the CPU even though my machine has a powerful GPU since the overhead of the calls is higher than the benefits.
Thus, when I load the models I'm training, I need to be able to force them to load to the CPU easily.
Since I've seen that the code is already there but currently the device is chosen using a hard-coded string inside the load method, I suggested to make this little but significant change (:

Closes #153

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

I've read the CONTRIBUTION guide (required)
I have updated the changelog accordingly (required).
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have reformatted the code using make format (required)
I have checked the codestyle using make check-codestyle and make lint (required)
I have ensured make pytest and make type both pass. (required)

Edited the save and load test to also test the load method with all possible devices. Added the changes to the changelog

leor-c · 2020-09-02T09:24:00Z

tests/test_save_load.py

-    # check if params are still the same after load
-    new_params = model.policy.state_dict()
+    # Check if the model loads as expected for every possible choice of device:
+    for device in ["auto", "cpu", "cuda"]:


I noticed that the git code comparison looks quite messy. I'm elaborating about the changes I've made here to ease the review process for you:
The actual change that I made here is the added 'for' loop that goes over all possible devices, and at each iteration the device parameter is passed to the call of 'load' (line 76). At the end of each iteration I delete the model (line 92) so it can be loaded cleanly at the next iteration.
Everything else is the same as before, i.e., I've used the exact same test (inside the new 'for' loop) to ensure proper loading and tested with all possible values of the new argument 'device'.

it seems that you are actually not testing that the device parameter was successfully used.
Also, you should skip the cuda device if no GPU is available

You're right. I will work on improving the test.

What should be the expected behavior when a user uses "device='cuda'" on a machine with no GPU?
I noticed that the c'tor defaults to using the CPU in that case without notifying the user.
Anyway, I think the test should include all possible inputs while verifying that the outcome matches your expectations. Do you agree?

I've used in my test the utils.get_device() function (which is used inside the constructor as well) to determine the device. This way, if for example, the behavior of get_device will change, the test won't break.

…device.

… policy would change, it wouldn't break the test.

tests/test_save_load.py

@araffin

@araffin's suggestion during the PR process Co-authored-by: Antonin RAFFIN <[email protected]>

araffin · 2020-09-02T21:00:52Z

running on GPU, it yields this error:

            # Check that all params are the same as before save load procedure now
            for key in params:
>               assert th.allclose(params[key], new_params[key]), "Model parameters not the same after save and load."
E               RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

tests/test_save_load.py:87: RuntimeError

Co-authored-by: Antonin RAFFIN <[email protected]>

leor-c · 2020-09-02T21:06:53Z

Thanks for all your help! I'll look into it.

…et_device() doesn't provide device index. Now the code loads all of the model parameters from the saved state dict straight into the required device. (fixed load_from_zip_file).

leor-c · 2020-09-04T12:56:54Z

When comparing the devices in the test, I restored the comparison of types only, since the "get_device()" function doesn't fill the device index, which causes problems.

leor-c · 2020-09-04T13:18:52Z

stable_baselines3/common/save_util.py

@@ -352,6 +352,7 @@ def load_from_pkl(path: Union[str, pathlib.Path, io.BufferedIOBase], verbose=0)
 def load_from_zip_file(
    load_path: Union[str, pathlib.Path, io.BufferedIOBase],
    load_data: bool = True,
+    device: Union[th.device, str] = "auto",


does the order of the arguments here makes sense? I'm not sure if I should have added the new argument last, for cases where users didn't use explicit keyword arguments.
On the other hand, I think it makes more sense to be in front of 'verbose'...

leor-c · 2020-09-04T13:54:42Z

When comparing the devices in the test, I restored the comparison of types only, since the "get_device()" function doesn't fill the device index, which causes problems.

Now I'm observing another concerning issue related to this: on my GPU capable machine "test_predict.test_predict" fails on the same assertion (assert get_device(device) == model.policy.device) due to the same reason - the get_device function doesn't fill the index of the device. This behavior appears on the master branch as well...

        # Test detection of different shapes by the predict method
        model = model_class("MlpPolicy", env_id, device=device)
        # Check that the policy is on the right device
>       assert get_device(device) == model.policy.device
E       AssertionError: assert device(type='cuda') == device(type='cuda', index=0)

test_predict.py:49: AssertionError
FAILED               [ 43%]
tests\test_predict.py:32 (test_predict[cuda-Pendulum-v0-TD3])
device(type='cuda') != device(type='cuda', index=0)

araffin · 2020-09-15T12:05:45Z

doesn't fill the index of the device. This behavior appears on the master branch as well...

I see... but you could easily fix that by passing device="cuda:0" in the tests, no?

leor-c · 2020-09-16T10:13:06Z

It doesn't fix the issue, unfortunately. get_device() still doesn't fill the index argument. If you look at the implementation of this function, you'll see that in some cases like "auto", get_device() uses a hard-coded string with no index.
Also, it's a different test case ("cuda:0" vs "cuda"). I think that the behavior of get_device() should be tested in both cases, but it is not directly related to my issue...
The test that fails on the master branch has nothing to do with my changes (obviously...), so I'm not sure it's right to fix it in this branch. I'd open a bug for this issue and dive into this, and probably fix get_device() as well.

araffin · 2020-09-16T10:39:13Z

It doesn't fix the issue, unfortunately. get_device() still doesn't fill the index argument. If you look at the implementation of this function, you'll see that in some cases like "auto", get_device() uses a hard-coded string with no index.
Also, it's a different test case ("cuda:0" vs "cuda"). I think that the behavior of get_device() should be tested in both cases, but it is not directly related to my issue...
The test that fails on the master branch has nothing to do with my changes (obviously...), so I'm not sure it's right to fix it in this branch. I'd open a bug for this issue and dive into this, and probably fix get_device() as well.

get_device("cuda:0") does fill the index but get_device("auto") don't (which is normal), however L148 of utils.py needs to be updated (using .type).

araffin · 2020-09-16T10:42:50Z

I think as it is fast and easy to fix, please update the tests to use .type in the assertion too (even though not directly related to the code you added) and also update L148 of utils.py ;).

…dated the assertion to consider only types of devices. Also corrected a related bug in 'get_device()' method.

araffin

LGTM, thanks =)

Added a 'device' keyword argument to BaseAlgorithm.load().

ae63d1c

Edited the save and load test to also test the load method with all possible devices. Added the changes to the changelog

leor-c commented Sep 2, 2020

View reviewed changes

leor-c added 2 commits September 2, 2020 15:08

improved the load test to ensure that the model loads to the correct …

196f220

…device.

improved the test: now the correctness is improved. If the get_device…

33adda4

… policy would change, it wouldn't break the test.

leor-c requested a review from araffin September 2, 2020 18:37

araffin reviewed Sep 2, 2020

View reviewed changes

tests/test_save_load.py Show resolved Hide resolved

araffin reviewed Sep 2, 2020

View reviewed changes

tests/test_save_load.py Outdated Show resolved Hide resolved

Update tests/test_save_load.py

1799c6a

@araffin's suggestion during the PR process Co-authored-by: Antonin RAFFIN <[email protected]>

Update tests/test_save_load.py

d85d88d

Co-authored-by: Antonin RAFFIN <[email protected]>

Bug fixes: when comparing devices, comparing only device type since g…

000f917

…et_device() doesn't provide device index. Now the code loads all of the model parameters from the saved state dict straight into the required device. (fixed load_from_zip_file).

leor-c commented Sep 4, 2020

View reviewed changes

leor-c requested a review from araffin September 4, 2020 13:20

Merge branch 'master' into feat/allow_to_set_device_when_loading_a_model

d7b3329

leor-c and others added 2 commits September 20, 2020 13:47

PR fixes: bug fix - a non-related test failed when running on GPU. up…

004307a

…dated the assertion to consider only types of devices. Also corrected a related bug in 'get_device()' method.

Update changelog.rst

a4ad856

araffin approved these changes Sep 20, 2020

View reviewed changes

araffin merged commit f5104a5 into DLR-RM:master Sep 20, 2020

araffin mentioned this pull request Sep 23, 2020

Get/set parameters and review of saving and loading #138

Merged

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to set a device when loading a model #154

Allow to set a device when loading a model #154

leor-c commented Sep 1, 2020

leor-c Sep 2, 2020 •

edited

Loading

araffin Sep 2, 2020

leor-c Sep 2, 2020 •

edited

Loading

leor-c Sep 2, 2020 •

edited

Loading

araffin commented Sep 2, 2020

leor-c commented Sep 2, 2020

leor-c commented Sep 4, 2020

leor-c Sep 4, 2020

leor-c commented Sep 4, 2020 •

edited

Loading

araffin commented Sep 15, 2020

leor-c commented Sep 16, 2020

araffin commented Sep 16, 2020

araffin commented Sep 16, 2020

araffin left a comment

Allow to set a device when loading a model #154

Allow to set a device when loading a model #154

Conversation

leor-c commented Sep 1, 2020

Description

Motivation and Context

Types of changes

Checklist:

leor-c Sep 2, 2020 • edited Loading

Choose a reason for hiding this comment

araffin Sep 2, 2020

Choose a reason for hiding this comment

leor-c Sep 2, 2020 • edited Loading

Choose a reason for hiding this comment

leor-c Sep 2, 2020 • edited Loading

Choose a reason for hiding this comment

araffin commented Sep 2, 2020

leor-c commented Sep 2, 2020

leor-c commented Sep 4, 2020

leor-c Sep 4, 2020

Choose a reason for hiding this comment

leor-c commented Sep 4, 2020 • edited Loading

araffin commented Sep 15, 2020

leor-c commented Sep 16, 2020

araffin commented Sep 16, 2020

araffin commented Sep 16, 2020

araffin left a comment

Choose a reason for hiding this comment

leor-c Sep 2, 2020 •

edited

Loading

leor-c Sep 2, 2020 •

edited

Loading

leor-c Sep 2, 2020 •

edited

Loading

leor-c commented Sep 4, 2020 •

edited

Loading