Support PettingZoo Parallel API and action mask #305

nkzawa · 2024-09-15T11:40:31Z

As the title, add support for PettingZoo by providing a wrapper class. Also, add support for action masks.

Creating PettingZoo env

from sample_factory.envs.pettingzoo_envs import PettingZooParallelEnv

def make_pettingzoo_env(full_env_name, cfg=None, env_config=None, render_mode: Optional[str] = None):
    return PettingZooParallelEnv(some_env.parallel_env(render_mode=render_mode))

Currently, it supports only Parallel API since it requires a different flow of executions for supporting AEC API.

Action mask

It works when you add the action_mask key to the dict observation.

import gymnasium as gym

class CustomEnv(gym.Env):
    def __init__(self, full_env_name, cfg, render_mode: Optional[str] = None):
        self.observation_space = gym.spaces.Dict({
            "obs": gym.spaces.Box(low=0, high=1, shape=(3, 3, 2), dtype=np.int8),
            "action_mask": gym.spaces.Box(low=0, high=1, shape=(9,), dtype=np.int8),
        })
        self.action_spaces = gym.spaces.Discrete(9)

    def step(self, action):
        ...
        return {"obs": obs, "action_mask": action_mask}, reward, terminated, truncated, info

It seems this is the most common interface for providing action masks on PettingZoo, so I think it makes sense to follow that. It's also common to have the value in info as info["action_mask"] but it's not supported for now since sample-factory requires knowing the shape to allocate buffer as far as I understand.

I added an example that trains Tic-Tac-Toe but I'm not sure about the configuration so appreciate any suggestions.

btw, I followed CONTRIBUTING.md but make check-codestyle fails with the error:

.../python3.11/site-packages/sympy/polys/numberfields/resolvent_lookup.py: "pyflakes[F]" failed during execution due to RecursionError('maximum recursion depth exceeded')

Also, make test fails when mujoco is involved.

OSError: dlopen(/System/Library/OpenGL.framework/OpenGL, 0x0006): tried: '/System/Library/OpenGL.framework/OpenGL' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/System/Library/OpenGL.framework/OpenGL' (no such file), '/System/Library/OpenGL.framework/OpenGL' (no such file, not in dyld cache)

Python version is 3.11.9 on Intel Mac

alex-petrenko · 2024-09-16T17:04:00Z

hi @nkzawa !
thank you for this contribution

would it be possible to make sure pre-commit passes, as well as Ubuntu tests?
https://www.samplefactory.dev/12-community/contribution/ there are some details here

A short doc page would be very nice to have too :) I will do a proper review soon!

EDIT: just noticed your comment about pyflakes
I will take a look

alex-petrenko

Great contribution, thank you!
There are a couple of small comments to address.
If you have the time to add a small documentation page, that'd be very helpful! Something similar to other contributed env integrations, like https://www.samplefactory.dev/09-environment-integrations/nethack/ (does not have to be as elaborate as this, installation instruction and training example would already be great, a wandb run/report or some videos - doubly great!)

alex-petrenko · 2024-09-17T06:41:29Z

sample_factory/algo/utils/action_distributions.py

+            epsilons = torch.full_like(probs, 1e-6)
+            probs = torch.where(all_zero, epsilons, probs)  # ensure sum of probabilities is non-zero
+
+        samples = torch.multinomial(probs, 1, True)


just checking if we don't have to re-normalize the probabilities here so they add up to 1.
Does torch.multinomial do this internally?

Technically, it seems there is no need to add up to 1, according to the doc:

The rows of input do not need to sum to one (in which case we use > the values as weights), ...

https://pytorch.org/docs/stable/generated/torch.multinomial.html

But I'm not so sure honestly (I'm a newbie on RL). So please feel free to fix if you see something wrong with the code.

Actually, seems it requires to normalize the value with softmax as far as I understand, so implemented that.

sample_factory/algo/utils/action_distributions.py

alex-petrenko · 2024-09-17T06:46:58Z

sample_factory/model/actor_critic.py

@@ -321,4 +336,13 @@ def create_actor_critic(cfg: Config, obs_space: ObsSpace, action_space: ActionSp
    from sample_factory.algo.utils.context import global_model_factory

    make_actor_critic_func = global_model_factory().make_actor_critic_func
-    return make_actor_critic_func(cfg, obs_space, action_space)
+    return make_actor_critic_func(cfg, obs_space_without_action_mask(obs_space), action_space)


I think it would be a bit cleaner to add special treatment for action_mask inside make_actor_critic_func but I'm fine with this solution too 👍

I assumed that it meant to do inside default_make_actor_critic_func. Fixed it so anyway 🙏

alex-petrenko · 2024-09-17T07:36:45Z

tests/envs/pettingzoo/test_pettingzoo.py

+    def register_pettingzoo_fixture(self):
+        register_custom_components()
+        yield  # this is where the actual test happens
+        reset_global_context()


Added this (similar to other env tests)
this resets global encoder factory, otherwise other tests can fail if this runs first.

(I'd be the first to admit this is not a perfect solution but hey I wrote this years ago)

…e value for now

nkzawa · 2024-09-17T09:10:25Z

Will add docs as well 👍

EDIT:
Added docs, though no video nor report. Feel free to fix/improve 🙏

nkzawa · 2024-10-03T11:06:26Z

Based on this paper and the Maskable PPO implementation in Stable-Baselines3, it appears that action masks should also be applied when calculating log probabilities as well. This approach helps the model learn to avoid selecting invalid actions, as far as I understand. I'll modify the code accordingly.

EDIT: done

alex-petrenko · 2024-10-05T22:17:55Z

docs/07-advanced-topics/action-masking.md

+    def __init__(self, full_env_name, cfg, render_mode=None):
+        ...
+        self.observation_space = gym.spaces.Dict({
+            "obs": gym.spaces.Box(low=0, high=1, shape=(3, 3, 2), dtype=np.int8),


Small nit: I wonder if low=0 high=1 here is intentional, would this mean binary observations?

I understand 0/1 in action_mask since this is a binary mask

This is intentional since it's retrieved from tic tac toe of PettingZoo but I think we can change if it's confusing.

alex-petrenko · 2024-10-05T22:18:09Z

docs/07-advanced-topics/action-masking.md

@@ -0,0 +1,38 @@
+# Action Masking


This documentation is wonderful, thank you!

alex-petrenko · 2024-10-05T22:19:06Z

docs/09-environment-integrations/pettingzoo.md

@@ -0,0 +1,46 @@
+# PettingZoo


Love this. Thank you!

alex-petrenko · 2024-10-05T22:23:39Z

sample_factory/algo/utils/action_distributions.py

+# https://github.com/allenai/allennlp/blob/80fb6061e568cb9d6ab5d45b661e86eb61b92c82/allennlp/nn/util.py#L243
+def masked_softmax(logits, mask):
+    # To limit numerical errors from large vector elements outside the mask, we zero these out.
+    result = functional.softmax(logits * mask, dim=-1)


Can you help me understand this please?

I think logits in general can be negative, or positive but close to 0, in which case multiplying them by zero does not achieve the desired effect.

I'd say we should probably use something like this instead?

def masked_softmax(logits, mask): # Mask out the invalid logits by adding a large negative number (-1e9) logits = logits + (mask == 0) * -1e9 result = functional.softmax(logits, dim=-1) result = result * mask result = result / (result.sum(dim=-1, keepdim=True) + 1e-13) return result

the choice of 1e-9 is arbitrary here, but it could be something like -max(abs(logits)) * 1e6 to make this universal

It's got from AllenNLP including the comment so don't fully understand but as far as I investigated seems your version is safer in some cases even tho usually the results are identical in both versions 👍

alex-petrenko · 2024-10-05T22:28:07Z

sample_factory/algo/utils/action_distributions.py

+    # vector + mask.log() is an easy way to zero out masked elements in logspace, but it
+    # results in nans when the whole vector is masked.  We need a very small value instead of a
+    # zero in the mask for these cases.
+    logits = logits + (mask + 1e-13).log()


This makes more sense to me, this is essentially adding log(1e-13) to non-valid elements which is about -30. I'm not sure if this is universally correct, but most likely should work. Why can't we just explicitly add a large negative constant though, like -1e9 or -max(abs(logits)) * 1e6 like in the previous example?

It seems you're correct. This version causes a problem in extreme cases as far as I tested.

support PettingZoo Parallel API and action mask

d776e31

alex-petrenko added 2 commits September 16, 2024 23:19

Fix pre-commit

f66b514

Add pettingzoo to github test deps

12cd0f6

alex-petrenko self-requested a review September 17, 2024 06:31

add missing reset_global_context()

a3479be

alex-petrenko requested changes Sep 17, 2024

View reviewed changes

nkzawa added 2 commits September 17, 2024 15:33

fix ContinuousActionDistribution to receive action_mask but ignore th…

b289a3f

…e value for now

exclude action_mask in default_make_actor_critic_func

33652c4

add docs for action mask and pettingzoo env

2b60e51

nkzawa requested a review from alex-petrenko September 17, 2024 13:14

nkzawa added 5 commits September 18, 2024 13:04

improve doc

4a0de3e

apply action_mask to epsilons

9d561e2

use original probs when all actions are masked

eaca450

apply softmax with mask

f6899b6

remove unnecessary computations

8ba3044

fix to apply mask to probs

1a66a74

alex-petrenko reviewed Oct 5, 2024

View reviewed changes

improve masked_softmax and masked_log_softmax for extreme cases

b46ee0f

nkzawa requested a review from alex-petrenko October 6, 2024 07:29

alex-petrenko approved these changes Oct 23, 2024

View reviewed changes

alex-petrenko merged commit abbc459 into alex-petrenko:master Oct 23, 2024
4 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support PettingZoo Parallel API and action mask #305

Support PettingZoo Parallel API and action mask #305

nkzawa commented Sep 15, 2024 •

edited

Loading

alex-petrenko commented Sep 16, 2024 •

edited

Loading

alex-petrenko left a comment

alex-petrenko Sep 17, 2024

nkzawa Sep 17, 2024

nkzawa Oct 1, 2024

alex-petrenko Sep 17, 2024

nkzawa Sep 17, 2024

alex-petrenko Sep 17, 2024

nkzawa commented Sep 17, 2024 •

edited

Loading

nkzawa commented Oct 3, 2024 •

edited

Loading

alex-petrenko Oct 5, 2024

nkzawa Oct 6, 2024

alex-petrenko Oct 5, 2024

alex-petrenko Oct 5, 2024

alex-petrenko Oct 5, 2024

nkzawa Oct 6, 2024

nkzawa Oct 6, 2024

alex-petrenko Oct 5, 2024

nkzawa Oct 6, 2024

nkzawa Oct 6, 2024

Support PettingZoo Parallel API and action mask #305

Support PettingZoo Parallel API and action mask #305

Conversation

nkzawa commented Sep 15, 2024 • edited Loading

Creating PettingZoo env

Action mask

alex-petrenko commented Sep 16, 2024 • edited Loading

alex-petrenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkzawa commented Sep 17, 2024 • edited Loading

nkzawa commented Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkzawa commented Sep 15, 2024 •

edited

Loading

alex-petrenko commented Sep 16, 2024 •

edited

Loading

nkzawa commented Sep 17, 2024 •

edited

Loading

nkzawa commented Oct 3, 2024 •

edited

Loading