[Feature] Multicollector interruptor #963

albertbou92 · 2023-03-13T09:38:55Z

Description

This PR implements a preemptive mechanism to early stop stragglers in multi sync data collectors. The invalid data can be identified because its trajectory ids are -1.

For now, preemptive mechanism is not compatible with split_trajs=True, but this could be easily adapted by ignoring trajectory ids equal to -1 in the split_trajectories method.

i.e.
splits = [(splits == i).sum().item() for i in splits.unique_consecutive() if i != -1] (changing line 39 in collectors/utils.py)
out_splits = rollout_tensordict.view(-1)[traj_ids != -1].split(splits, 0) (changing line 54 in collectors/utils.py)

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #15213 if this solves the issue #15213

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.

vmoens

LGTM I left some comments

vmoens · 2023-03-15T16:08:44Z

torchrl/collectors/collectors.py

@@ -67,6 +68,44 @@ def __call__(self, td: TensorDictBase) -> TensorDictBase:
        return td.set("action", self.action_spec.rand())


+class Interruptor:


if it's public it should be in the doc

In that case I think it makes more sense that Interruptor and InterruptorManager are private. I don't see Interruptor class being useful for users beyond the scope of the collector.

vmoens · 2023-03-15T16:08:50Z

torchrl/collectors/collectors.py

+            return self._collect is False
+
+
+class InterruptorManager(SyncManager):


vmoens · 2023-03-15T16:12:49Z

test/test_collector.py

+
+        for batch in collector:
+            assert (
+                batch["collector"]["traj_ids"][


let's rewrite this in a more straightforward way, it's a bit hard to read
What if none is -1? Shouldn't we test that we have at least one traj_id set to -1 in the batch?

I simplified the code for more clarity.

Regarding the possibility of having none -1, I have addressed that as well by setting the preemptive threshold to 0.0 in the test instead of 0.25. This way, all collectors will stop after the first iteration and only the very first frame of each sync collector will be valid. So we know for sure there are -1's in the batch.

vmoens · 2023-03-16T17:26:44Z

Can you merge main in this branch, the tests are failing bc they're looking for a deprecated function in tensordict

vmoens

Can you have a look at how it would play out with the distributed collectors I integrated yesterday?
We don't necessarily need everything to be fully compatible but I want to make sure that we're not missing some obvious point, e.g. how we handle the traj-ids in the distributed collector may need a bit of refactoring.

(I will merge this sooner than that as the solution is neat and usable as of now)

vmoens · 2023-03-17T09:05:23Z

torchrl/collectors/collectors.py

@@ -1399,6 +1462,18 @@ def iterator(self) -> Iterator[TensorDictBase]:

            i += 1
            max_traj_idx = None
+
+            if self.interruptor:


for clarity can we have if self.interruptor is not None?

albertbou92 added 11 commits February 24, 2023 15:38

interruptor class

0fbcc0b

interruptor class

a76c61e

basic test

0023f9c

basic test2

83bd5fe

basic test2

3c327e1

clean code

7cc35ea

clean code

da3a3ed

format

3a3382b

format

63ee19e

docs

0406b83

format

0aae437

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2023

fix

a77f030

vmoens reviewed Mar 15, 2023

View reviewed changes

albertbou92 added 3 commits March 16, 2023 15:22

improve tests

b8d8e91

improve tests

a4040c4

format

07f03bb

albertbou92 added 3 commits March 17, 2023 07:33

merged main

1ed88e7

merged main

1373be8

fix

089a04b

vmoens added the enhancement New feature or request label Mar 17, 2023

vmoens reviewed Mar 17, 2023

View reviewed changes

minor fix

19fa08e

vmoens merged commit e79f15e into pytorch:main Mar 17, 2023

albertbou92 deleted the multicollector_interruptor branch January 18, 2024 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Multicollector interruptor #963

[Feature] Multicollector interruptor #963

albertbou92 commented Mar 13, 2023 •

edited

Loading

vmoens left a comment

vmoens Mar 15, 2023

albertbou92 Mar 16, 2023 •

edited

Loading

vmoens Mar 15, 2023

vmoens Mar 15, 2023

albertbou92 Mar 16, 2023 •

edited

Loading

vmoens commented Mar 16, 2023

vmoens left a comment •

edited

Loading

vmoens Mar 17, 2023

		@@ -67,6 +68,44 @@ def __call__(self, td: TensorDictBase) -> TensorDictBase:
		return td.set("action", self.action_spec.rand())


		class Interruptor:

		return self._collect is False


		class InterruptorManager(SyncManager):

[Feature] Multicollector interruptor #963

[Feature] Multicollector interruptor #963

Conversation

albertbou92 commented Mar 13, 2023 • edited Loading

Description

Motivation and Context

Types of changes

Checklist

vmoens left a comment

Choose a reason for hiding this comment

vmoens Mar 15, 2023

Choose a reason for hiding this comment

albertbou92 Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

vmoens Mar 15, 2023

Choose a reason for hiding this comment

vmoens Mar 15, 2023

Choose a reason for hiding this comment

albertbou92 Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

vmoens commented Mar 16, 2023

vmoens left a comment • edited Loading

Choose a reason for hiding this comment

vmoens Mar 17, 2023

Choose a reason for hiding this comment

albertbou92 commented Mar 13, 2023 •

edited

Loading

albertbou92 Mar 16, 2023 •

edited

Loading

albertbou92 Mar 16, 2023 •

edited

Loading

vmoens left a comment •

edited

Loading