Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: API/commands for fetching audio #225

Closed
Miffyli opened this issue Jul 20, 2017 · 19 comments · Fixed by #486
Closed

Suggestion: API/commands for fetching audio #225

Miffyli opened this issue Jul 20, 2017 · 19 comments · Fixed by #486

Comments

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 20, 2017

E.g. by commanding "enable_audio" before initing the game, and then receiving additional object in State object which holds audio samples played inside that time frame.

I know it is VizDoom but this could possibly allow bots to "home in" towards high-action areas and/or hear close-by enemies behind them.

@mwydmuch
Copy link
Member

Hi @Miffyli,
this idea has been on our minds for some time now and I'd love to add it. However, it's easy to just pass OpenAL buffer as it is to state, but I have no idea if it'll be convenient to work with it (unfortunately I've never done any serious sound processing). So we need some help to decide what things we should take care of and what should be configurable (format? stereo/some 3D sound, channels? frequency? sample rate/size?).

If anyone has any ideas about these things I'll be happy to hear them :)

@Miffyli
Copy link
Collaborator Author

Miffyli commented Jul 21, 2017

@mwydmuch
I have background in speech processing but still stuck deciphering the structure of VizDoom source ^^'.

Anywho I do not think we need anything fancy, especially considering Doom was originally indented to run on old machines. I think these would be enough, at least for a start:

  • 8kHz sampling rate (Doom has very low-freq audioscape, from my experience)
  • 16 bit sample size
  • Two channels (stereo)
  • Simple PCM, without modifications. I guess OpenAL exposes something like this.

And as for what API would give to user: 2xN matrix where N is the amount of samples played in that state's timeframe. I think "timeframe" = "since last call of get_state", for simplicity. Users can then build longer buffer in Python for analyzing longer pieces of audio.

I can create example scenarios/scripts, and generally test the implementation if this is added.

@mwydmuch
Copy link
Member

Alright, thank you @Miffyli for the tips! For now I'm pretty busy, but I think that I will be able to add this by the end of August and then I will ask you for a small tests and review :)

@mwydmuch mwydmuch added this to the 1.2.0 milestone Jul 22, 2017
@piquirez
Copy link

piquirez commented Dec 3, 2019

Hi @mwydmuch :
Any news on any way to get the sound buffer as part of Doom.get_state()? The research that can be carried out by adding the audio is very interesting. If anyone knows about any method to obtain the sound as part of the inputs it would be much appreciated.

@Miffyli
Copy link
Collaborator Author

Miffyli commented Dec 3, 2019

@piquirez

I did further digging on this subject earlier, and I think it hits a roadblock: ZDoom uses OpenAL library to create the sound samples from sound sources/listeners and their locations. You'd have to start messing around with OpenAL (and its drivers) to be able to hijack these samples at some part of the way before they are fed into a common buffer.

A hacky way to do this would be to create a sound device per each vizdoom instance and capture the audio there, but syncing this up with frames would be difficult if not even impossible.

@piquirez
Copy link

piquirez commented Dec 4, 2019

@Miffyli Thanks for your answer.
As you describe, it does seem quite complicated. I did notice that the sound plays at the same speed no matter what the speed of the screen inputs is, making it very hard to sync them if you wanted to capture the audio since when you use it for training the speed will be different than inference.
However this gave me an idea; I presume that each sound is triggered by a doom instruction in a particular frame, and we know that in real time doom should run at 35 FPS?. In this case it should be possible to save the audio triggers on each frame, and then divide each sound in small samples over 35 FPS. in this way we would obtain one audio sample per frame which is what we are after. Is this something feasible? Even if it was possible to get the audio triggers per frame that would be very helpful, and I could sort out the audio divide part.

@Miffyli
Copy link
Collaborator Author

Miffyli commented Dec 4, 2019

@piquirez

Theoretically that could work. However since it would skip audio library completely it would not have any processing done by the positional audio (e.g. how strong audio plays on left/right, how faint it is). Now that you mention it, the "sped up" game also makes things harder: If you do things through audio library, it (probably) plays sounds at the natural speed and thus far too slow for the ZDoom running at lightspeeds (at thousands FPS).

@piquirez
Copy link

piquirez commented Dec 5, 2019

@Miffyli
I believe the stereo information about a sound should be part of the command that executes the sound. So if all sounds are saved in mono, the stereo version would simply be a mathematical relation based on the position of the player. If we have the information of the frame that plays a sound, which should include where it has to play spatially (stereo information) we could create as you mentioned earlier "2xN matrix where N is the amount of samples played in that state's timeframe". in this case N would be 1/35th of a second of the audio file. In this way, no matter at which speed doom is run, the agent will always get the same sound synchronized to 35 FPS., which will allow it to learn.

@Miffyli
Copy link
Collaborator Author

Miffyli commented Dec 5, 2019

@piquirez

Hmm you are right, this could work. I am not sure how easy all the "positional audio processing" would be, but the part of providing samples of sounds-being-played should be possible. It is not perfect but it would be a start.

As for implementing something like this: I am not intimately familiar with ZDoom on that side and do not have time to work on this for at least couple months, sadly :(

@piquirez
Copy link

piquirez commented Dec 6, 2019

@Miffyli
A couple of months doesn't sound bad. I don't have much time myself either, but I will research on how zdoom handles sounds whenever I get the chance and hopefully I'll be able to help you if you're interested.

@hegde95
Copy link

hegde95 commented Jun 10, 2020

Hey, was anyone able to get this up?

@Miffyli
Copy link
Collaborator Author

Miffyli commented Jun 10, 2020

I have not worked on this since last posts, my attention shifted to other projects sadly :( . The above issues are still complex to handle, as playing audio (or sound, as it were) is so tightly tied to our "natural passage" of time.

@hegde95
Copy link

hegde95 commented Jun 10, 2020

would it be possible to get audio in "real time" by using the fix in #40 ?

@mwydmuch
Copy link
Member

Hi @hegde95, as described in #40 the audio can be enabled so for sure it's possible to obtain it from os somehow. On Linux, you can probably access Pulse Audio sink using some Python library. But I guess that's all we know about the topic right now.

@mwydmuch
Copy link
Member

This approach will require to use ViZDoom async mode to have correctly played audio.

@hegde95
Copy link

hegde95 commented Jun 11, 2020

So I'm guessing if we have multiple games running in parallel, it won't be possible to isolate the sound produced by each game this way?

@Miffyli
Copy link
Collaborator Author

Miffyli commented Jun 11, 2020

You can create virtual outputs in PulseAudio, and then with some commands direct program's audio to sink you want (I can not find those commands right now). It is doable, but bit of a mess.

If it does not have to be ViZDoom per se, Unity's ML-agents can be tuned to include audio in the observations by creating the necessary AudioListeners etc in the Unity game. We did this in some experiments and worked quite well.

@hegde95
Copy link

hegde95 commented Aug 13, 2020

if I had to push the audio buffers collected in async mode to the ViZDoomPythonModule.cpp as a part of the game state, what would be the changes I'd have to make?

@Miffyli
Copy link
Collaborator Author

Miffyli commented Aug 13, 2020

@mwydmuch Could you provide quick pointers to above?

@mwydmuch mwydmuch modified the milestones: 1.2.0, 1.1.11 Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants