-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to organize self-play #391
Comments
We experimented a bit with self-play in 2018 edition of the competition and didn't get anywhere with it (but that could be just our shabby agent codes which didn't work). At the moment I do not have access to the codes/notes to comment further, but I recall stuff like training being much slower and running into deadlocks and whatnot (sometimes player or two stopped responding to ViZDoom's commands, sometimes they segfaulted, etc). But, we did run longer experiments with smaller games, so this could be feasible as long you are fine with the ASYNC_PLAYER way. I can give better comments once I get hand on the notes/code again. |
Hi, @Miffyli! Thanks a lot for the feedback! My current plan is to host a game in sync (PLAYER) mode, run the separate game instances for clients in the same process, or in different threads/processes. Then, after the host does the step() it will send some signal (e.g. cond var) to all other clients so that they can do their step. If this works, the game can proceed in the normal RL step-by-step fashion, provided that clients and the server are able to communicate their state between steps. At this point I am not sure whether I should run clients in sync or async mode. |
Actually now I remember we also used SYNC mode, because nowadays SYNC mode is supported for multiplayer as well. In that sense you should be good to go. And with the SYNC mode you should be fine with just |
That'd be awesome! I would love to chat more about that, so I'll be waiting for your feedback. |
I went through our code, and indeed it does not seem to be anything else than just setting
This project should be doable with ViZDoom, but you have to be careful with the networking as it was tad fragile. I would also manually check the states/observations from ViZDoom envs to see if they make sense (all agents progressed by X timesteps, agents executed correct actions, etc). |
Ok, I guess I started to realize how this all works. So when you do game.init() in Python it actually spawns a separate Doom process. Therefore, regardless of whether the mode is sync or async, the host is always listening on the network port, regardless of the "agent loop". This is what allows it to function in the sync mode. If you implement this "multi-agent" loop naively, I think there's nothing to guarantee that game state updates are broadcasted to all the clients before the next game step is made. I want all my clients to have the latest information about the game without any lag, but if the sync mode loop is too fast, there might be just no time for the changes to propagate to clients between steps. A proper way to fix it would be to introduce some part of the game state that is synchronized between server and client and can be queried through Python interface, something like "state ID". Then the process is the following:
Edit: |
If you use |
@Miffyli thank you so much, your input was very helpful! The only problem I have is that I am not able to use make_action(..., skip) where skip is >1 The other approach to multi-agent Doom, which is a multiple viewpoint rendering within the same game instance, seems even more tempting now. I know that Doom has a split-screen feature, so it should already be supported by the engine on some level: https://www.youtube.com/watch?v=k-fjc8hZaJA |
Ok, I was able to partially work around that by manipulating the With the Here's my current implementation: https://gist.github.com/alex-petrenko/5cf4686e6494ad3260c87f00d27b7e49 |
I also recall networking being a bottleneck when we run our things (something-something processes waited for messages something). Unless I am terribly mistaken, ZDoom uses the original P2P networking which does not cope well when you try to squeeze in many players or a lot of frames. The implementation looks quite nifty and compact, though! With bit of tidying up it could be a nice example. Perhaps also a small write-up on challenges/requirements/limitations of multi-agent ViZDoom training, @mwydmuch ? |
Definitely happy to collaborate! I am planning to tidy it up and get rid of "vizdoomgym" dependency at some point (or maybe we should actually make something like this in VizDoom repo, because for every project I end up using some kind of Gym wrapper). After this, it can probably be added to examples. Performance-wise, I was able to push pure experience collection to around ~11500 environment frames on a 10-core 20-thread CPU. So this is ~2850 observations/sec with 16 parallel workers, each running an 8-agent environment, so 128 Doom processes in total. Doom renders at the 256x144 resolution, later downsampled to 128x72. Standard 160x120 -> 84x84 will give additional ~5% improvement, but I am sticking to widescreen now. So it is evident that old ZDoom p2p networking is a major bottleneck. I might look into ways to speed this up, but not right now. |
"Official" gym environment in this repo would be nice indeed, but integrating it in the code is not too straight forward, since all ViZDoom code is done in C/C++. There has been separate repos for Gym envs, but they have died/quieted down. Easy way could be to add just an Python example that implements Gym API. The numbers sound promising! Naturally Google had a lot of hardware to throw at their training, but I am willing to bet in local machines (like your 10-core) using ViZDoom is much faster :). I would like to hear when/if you get any results! |
Thanks for the encouragement, definitely will share results! Side note: it looks like the reward mechanism is behaving weirdly in multiplayer. If I give my agent reward of +1 every tick, e.g. in cig.acs script:
then all of my agents also get the reward for everyone else. E.g. if I have 8 agents in the environment, then every single agent gets +8 reward every tick. Please correct me if I am wrong :) |
Ah yes, forgot to mention this. Indeed the ACS scripts behave wonky, quite possibly due to reason like you said (it dies the |
BTW, this worked out. This repo contains full implementation of working multi-agent training in VizDoom: https://github.com/alex-petrenko/sample-factory |
Excuse me, have you ever used RLlib to wrapper your project “sample-factory” in the game vizdoom? Or use the project vizdoomgym? |
@Maxwell2017 did not use any of that. All of the code is handwritten, and is available at the URL above. python -m algorithms.appo.train_appo --env=doom_duel --train_for_seconds=360000 --algo=APPO --gamma=0.995 --env_frameskip=2 --use_rnn=True --num_workers=72 --num_envs_per_worker=16 --num_policies=8 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --res_w=128 --res_h=72 --wide_aspect_ratio=False --benchmark=False --pbt_replace_reward_gap=0.5 --pbt_replace_reward_gap_absolute=0.35 --pbt_period_env_steps=5000000 --with_pbt=True --pbt_start_mutation=100000000 --experiment=doom_duel_full |
In fact, I want to use a reinforcement learning framework, which is compatible with TF and pytorch, and can integrate various game environments. I can also implement the algorithm myself, or use the built-in algorithm. Do you have any good suggestions? |
@Maxwell2017 Implement an OpenAI Gym interface over ViZDoom (see old example here). This will make it easy to use ViZDoom with existing libraries like stable-baselines, and easier to implement algorithms. |
@Maxwell2017 @Miffyli SampleFactory also comes with a multi-agent VizDoom wrapper, which is a pretty non-trivial thing to implement. Besides SampleFactory, there are not that many frameworks that support multi-agent training and self-play out of the box. One other option is RLLib which you can configure for multi-agent training, but keep in mind that experiments will be a lot slower (3-10x difference depending on the environment). Basically, an attempt to train these bots with RLLib led to the development of SampleFactory, because RLLib was just a bit too slow for that (it is a pretty powerful codebase otherwise). If you're not interested in multi-agent learning, you can try any other RL frameworks. stable-baselines or rlpyt are good exampes. I would still consider using wrappers from SampleFactory, this will save you a lot of time. |
Does the repo ViZDoomGymhttps://github.com/shakenes/vizdoomgym have the same doom version with gym-doomhttps://github.com/ppaquette/gym-doom? In fact, I don’t know the difference between the new and the old version. Can you point me out?@Miffyli |
There is another question. I found in your paper (https://arxiv.org/pdf/2006.11751.pdf) that there is a vizdoom experiment based on rllib. |
@Maxwell2017 |
@alex-petrenko |
Yeah, that's pretty much it. Just copy-paste SampleFactory gym implementation (or install it as a local pip package from sources), and set up training parameters. |
Thank you, I'll try it now:) |
@Maxwell2017 first of all, what paper and what experiment in the paper are you referring to? In short, RL is still more art than a mature technology. You generally can't expect to just plug in a environment to a learning system and expect it to work right away 100% of the time. Things need tuning. If you're trying to reproduce a result from SampleFactory paper, I suggest that you try to do this with SampleFactory first. |
In fact, the paper I refer to is "vizdoom: a doom based AI research platform for visual reinforcement learning ", in the basic experiment, I refer to the neural network architecture and learning settings in the paper, The policy uses dqn with RLlib. And now In this repo(ViZDoom),i cant find the complete network structure which is same with the paper. I don't know the step size of the conv layer. I set it to 1 by default.@Miffyli |
I see, so you're talking about "basic" scenario.
The full reward function subtracts 1 point for every action, so 80 is
actually close to the maximum reward that can be expected (i.e. a monster
killed only in 20 steps). I think ~80 is the performance of the optimal
policy.
If you are using these rewards directly, the first thing I'd do is to
reduce the reward scale. I believe in SampleFactory we used 0.01 scale for
these rewards, i.e. 101 turns into 1.01, -5 turns into -0.05, etc.
Neural networks used in training typically have a much easier time learning
from small quantities like that.
пт, 26 февр. 2021 г. в 01:50, Maxwell2017 <[email protected]>:
… In fact, the paper I refer to is "vizdoom: a doom based AI research
platform for visual reinforcement learning ", in the basic experiment, I
refer to the neural network architecture and learning settings in the
paper, The policy uses dqn with RLlib. And now In this repo(ViZDoom),i cant
find the complete network structure which is same with the paper. I don't
know the step size of the conv layer. I set it to 1 by ***@***.***
<https://github.com/Miffyli>
In rllib, I use register_env and register_custom_model defines the game
environment and network structure. In today's experiment, I found that
sometimes the reward can reach 80 (close to the index in the paper), but
then the reward will decline, very unstable. I'm very sorry, I think it
must be that I didn't set it correctly in some ***@***.***
<https://github.com/alex-petrenko>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJ6HLY3CCFYXXEBY7PVKVLTA5VHZANCNFSM4HZ474QQ>
.
|
Good suggestion for scaling reward ! |
Increasing the size usually leads to slower learning (but probably better results) later on. For the I suggest you look at OpenAI Spinning Up for practical information on DRL. |
Sorry to see your reply now @Miffyli . The network defined in the example file ,you mean examples/python/learning_tensorflow.py#L232 |
Ah, I just recalled that some of those were updated just recently... Yes, even a simple/small network should learn the basic environment (the original theano code has such a small network). A larger network might actually make it much slower (more parameters to tune).
Which paper are you referring to? You can find the scenario file in scenarios directory. If you want an example code that learns in that scenario, you can modify the example learning code to support the scenario. Learning better policies in health-gathering supreme requires providing the current health information to network that needs bit more modifications (see e.g. this comment). Some of the papers related to the competition ran experiments with these environments (see references in this paper). This paper also used health-gathering task. |
In fact, the paper I refer to is "vizdoom: a doom based AI research platform for visual reinforcement learning ", I find the Game Settings in Section IV-B2-b, but I don’t know how to set it in the DQN network. |
Ah, it seems the only change to other experiments is use of RMSProp. Note that the example code in this repo not the code used in the paper (I do not think it is available). I recommend you use existing implementation of DQN for your experiments, e.g. the one from stable-baselines/stable-baselines3.
Yup! This is exactly what he suggested. |
Maybe I didn’t explain it clearly. In fact, I want to know that in the paper I refer to, "The nonvisual inputs |
Ah, right. The traditional way to do is to concatenate such 1D features into the feature vector that comes out from CNN (inside the network). In the example PyTorch code you would do something like (around line 200): x = self.conv3(x)
x = self.conv4(x)
x = x.view(-1, 192)
# Combine picture-features and 1D features into one vector
x = th.concat((x, your_1d_features), dim=0)
# Note that these would need changing as well...
x1 = x[:, :96] # input for the net to calculate the state value
x2 = x[:, 96:] # relative advantage of actions in the state You can find cleaner implementations in Unity ML code, rllib or experimental stable-baselines3 PR for supporting so called "dictionary observations" (see the comment I linked above and related PR). |
For Health in game_variables, it is a scalar (in observation_space it should be defined as spaces.Box(0, np.Inf, (1,)) or obtained through self.game.get_available_game_variables() ), so here I can directly cancat it with the features after convolutions, Am I understand right? : ) |
Yes, you can concatenate the health information in that spot (but remember to adjust the other code around it). The example code is very hardcoded for basic.py so, again, I recommend taking a look at established libraries. |
I am sorry to see your reply now.
As shown in the code above, concatenating 1-D health info with the convolutional feature, I tried it with neural network using in basic scenario , and it didn’t seem to converge. |
Health gathering supreme is muuuuuch harder task than the basic one, especially if you do not add the said reward shaping (giving/negating reward upon picking up medkit/vial). I recommend you adding this reward shaping to see if it starts to learn anything (+1 reward for medkit, -1 for vial, everything else is zero). Even with this it might take hundreds of thousands of steps to train. Note that generally it is hard to say if "method X should learn scenario Y", especially if it has not been done before. You might need to tune hyperparameters to get it working. |
@Miffyli |
Yes, best way to do reward shaping with Gym environments is through wrappers. Both the health_gathering.cfg should be very easy to learn, even without the reward shaping. health_gathering_supreme is way more difficult and I recommend starting with shaping. |
Hi @alex-petrenko , I'm currently doing vectorized multi-agent vizdoom env but encountered the same issues (the different thing is that I used C++ interface):
I dig into #417 and also sample-factory's source code but still cannot find either a reason or a solution for the above cases, do you have other insights? |
Hi! I fixed these in my fork https://github.com/alex-petrenko/ViZDoom/commits/doom_bot_project This is why SampleFactory instructs installing VizDoom from this branch. This was supposed to be merged into Vizdoom as a part of Sound-RL project (https://github.com/mwydmuch/ViZDoom/pull/486/files) but looks like we forgot to include it. Alternatively, @Trinkle23897 I would also really appreciate if you could submit these as a PR if they fix your issue! |
I am planning to experiment with population-based training and self-play, similar to the recent DeepMind's Q3 CTF paper. The obvious requirement would be the ability to train the agents to play against other agent copies on the same map at the same time.
I could probably wrap a multiplayer session into a single multi-agent interface and use ASYNC_PLAYER mode, maybe with increased tickrate (#209)
However the optimal way to implement this would be to render multiple observations for different agents within the same tick in the same process in synchronous mode, similar to how it's done in single-player.
Any thoughts on what is the right course of action here? Does multi-agent SYNC mode seem feasible or would it require changing half the codebase?
The text was updated successfully, but these errors were encountered: