Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

posix_ipc.BusyError: Semaphore is busy #285

Closed
jaydenmilne opened this issue Jun 21, 2019 · 9 comments
Closed

posix_ipc.BusyError: Semaphore is busy #285

jaydenmilne opened this issue Jun 21, 2019 · 9 comments
Assignees
Labels
bug cant-reproduce We can't find reproduce this bug known-issue A known issue with no fix currently available
Milestone

Comments

@jaydenmilne
Copy link
Contributor

Describe the bug
Running integration tests on Linux has an intermittent failure with the error posix_ipc.BusyError: Semaphore is busy

To Reproduce
Run the integration tests a few times on Linux.

I haven't observed this outside of testing.

Logs

_ ERROR at setup of test_all_agents_and_sensors_present[CyberPunkCity-Follow] __

request = <SubRequest 'env_scenario' for <Function test_all_agents_and_sensors_present[CyberPunkCity-Follow]>>

    @pytest.fixture
    def env_scenario(request):
        """Gets an environment for the scenario matching request.param. Creates the env
        or uses a cached one. Calls .reset() for you
        """
        global envs
        scenario = request.param
        if scenario in envs:
            env = envs[scenario]
            env.reset()
            return env, scenario
    
>       env = holodeck.make(scenario, show_viewport=False)

scenarios/conftest.py:38: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../.local/lib/python3.6/site-packages/holodeck/holodeck.py:89: in make
    return HolodeckEnvironment(**param_dict)
../../.local/lib/python3.6/site-packages/holodeck/environments.py:118: in __init__
    self._client.acquire()
../../.local/lib/python3.6/site-packages/holodeck/holodeckclient.py:97: in acquire
    self._get_semaphore_fn(self._semaphore2)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

sem = posix_ipc.Semaphore("/HOLODECK_SEMAPHORE_CLIENT3493712f-3c43-4a09-b06d-44dd5ffd366f", mode=0600)

    def posix_acquire_semaphore(sem):
>       sem.acquire(self.timeout)
E       posix_ipc.BusyError: Semaphore is busy

../../.local/lib/python3.6/site-packages/holodeck/holodeckclient.py:78: BusyError

@jaydenmilne
Copy link
Contributor Author

Still seeing this bug

@jaydenmilne jaydenmilne added this to the 0.3.1 milestone Aug 30, 2019
@Sypherd
Copy link

Sypherd commented Nov 14, 2019

I've had the same issue training an agent on MazeWorld-FinishMazeSphere. Holodeck 0.3.0, DefaultWorlds installed.

To Reproduce

Make "MazeWorld-FinishMazeSphere" and begin training agent (DQN) for ~200,000 steps

Logs

(note that the same error has occurred in env.step as well as env.reset):

Traceback (most recent call last):
  File "./Holo.py", line 161, in <module>
    state = env.reset()# Reset environment
  File "/home/chris/anaconda3/envs/Maze/lib/python3.7/site-packages/holodeck/environments.py", line 270, in reset
    self.tick()
  File "/home/chris/anaconda3/envs/Maze/lib/python3.7/site-packages/holodeck/environments.py", line 339, in tick
    self._client.acquire()
  File "/home/chris/anaconda3/envs/Maze/lib/python3.7/site-packages/holodeck/holodeckclient.py", line 97, in acquire
    self._get_semaphore_fn(self._semaphore2)
  File "/home/chris/anaconda3/envs/Maze/lib/python3.7/site-packages/holodeck/holodeckclient.py", line 78, in posix_acquire_semaphore
    sem.acquire(self.timeout)
posix_ipc.BusyError: Semaphore is busy

@jaydenmilne
Copy link
Contributor Author

jaydenmilne commented Nov 14, 2019

Uh oh, this means we have to fix it now ;)

What distro are you using? What version of holodeck (holodeck.util.get_holodeck_version())?

Now that we know that its an issue in the wild we'll prioritize it. I'm going to try and reproduce the exact failure you're seeing.

@craig-m-k
Copy link

Same problem here. OpenSUSE 15.0, Unreal 4.22.3, Holodeck 0.2.2.

To reproduce

Build ExampleLevel package for Linux. Install where Holodeck installs its worlds, create the JSON files. Invoke holodeck.make with this world within python3 shell.

Logs

[2019.11.16-01.57.32:786][  0]LogHolodeck: Reset signal registered
[2019.11.16-01.57.32:786][  0]LogHolodeck: CommandCenter::CommandCenter() constructed
[2019.11.16-01.57.32:786][  0]LogHolodeck: CommandCenter:: is getting command buffer
[2019.11.16-01.57.32:786][  0]LogHolodeck: Mallocing 8388608 bytes for key command_buffer
[2019.11.16-01.57.32:786][  0]LogHolodeck: Mallocing 1 bytes for key command_bool
Signal 7 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=7
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=105648 LargeMemoryPoolOffset=236784 
[2019.11.16-01.57.32:801][  0]LogCore: === Critical error: ===
Unhandled Exception: SIGBUS: invalid attempt to access memory at address 0x000000001bfc2000

[2019.11.16-01.57.32:801][  0]LogCore: Fatal error!

0x0000000002870a5c Holodeck!UCommandCenter::GetCommandBuffer() [/mnt/data/HAA/holodeck-engine/Source/Holodeck/ClientCommands/Private/CommandCenter.cpp:43]
0x00000000028772cc Holodeck!AHolodeckGameMode::StartPlay() [/mnt/data/HAA/holodeck-engine/Source/Holodeck/HolodeckCore/Private/HolodeckGameMode.cpp:52]
0x00000000054bb659 Holodeck!UWorld::BeginPlay() [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Engine/Private/World.cpp:3947]
0x00000000053c0886 Holodeck!UEngine::LoadMap(FWorldContext&, FURL, UPendingNetGame*, FString&) [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Engine/Private/UnrealEngine.cpp:12412]
0x00000000053bcb20 Holodeck!UEngine::Browse(FWorldContext&, FURL, FString&) [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Engine/Private/UnrealEngine.cpp:11643]
0x0000000004be491c Holodeck!UGameInstance::StartGameInstance() [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Engine/Private/GameInstance.cpp:499]
0x0000000004bcfdae Holodeck!UGameEngine::Start() [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Engine/Private/GameEngine.cpp:840]
0x000000000241286c Holodeck!FEngineLoop::Init() [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Launch/Private/LaunchEngineLoop.cpp:3305]
0x000000000241ce67 Holodeck!GuardedMain(char16_t const*) [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Launch/Private/Launch.cpp:155]
0x0000000005ae3dcf Holodeck!CommonUnixMain(int, char**, int (*)(char16_t const*)) [/mnt/data/HAA/UnrealEngine/Engine/Source/Runtime/Unix/UnixCommonStartup/Private/UnixCommonStartup.cpp:243]
0x00007f061a446f4a libc.so.6!__libc_start_main(+0xe9)
0x000000000240c029 Holodeck!_start()

[2019.11.16-01.57.32:821][  0]LogExit: Executing StaticShutdownAfterError
[2019.11.16-01.57.32:840][  0]LogCore: Warning: Unable to statfs('/home/craig/.local/share/holodeck/0.2.2/worlds/NewLevel/LinuxNoEditor/Holodeck/Saved/Crashes/crashinfo-Holodeck-pid-7950-003DEE0B08D769F54A27838AEEBDF05B/HolodeckLog.txt'): errno=2 (No such file or directory)
Malloc Size=74874 LargeMemoryPoolOffset=311674 
Engine crash handling finished; re-raising signal 7 for the default handler. Good bye.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anaconda/anaconda3/envs/python36/lib/python3.6/site-packages/holodeck/holodeck.py", line 89, in make
    return HolodeckEnvironment(**param_dict)
  File "/home/anaconda/anaconda3/envs/python36/lib/python3.6/site-packages/holodeck/environments.py", line 118, in __init__
    self._client.acquire()
  File "/home/anaconda/anaconda3/envs/python36/lib/python3.6/site-packages/holodeck/holodeckclient.py", line 97, in acquire
    self._get_semaphore_fn(self._semaphore2)
  File "/home/anaconda/anaconda3/envs/python36/lib/python3.6/site-packages/holodeck/holodeckclient.py", line 78, in posix_acquire_semaphore
    sem.acquire(self.timeout)
posix_ipc.BusyError: Semaphore is busy

jaydenmilne added a commit that referenced this issue Jan 9, 2020
I picked this timeout at random, it seems that is was too short.

The original purpose was to prevent tests from hanging indefinitely.
60s is long enough that it shouldn't happen under normal usage
but will prevent tests from hanging.
@jaydenmilne
Copy link
Contributor Author

For those who are interested, it seems part of the issue was a timeout was set too low, that would cause the issue when starting up an environment. That wouldn't explain why it would fail after 200,000 steps, I have been unable to reproduce that issue. Still trying though

jaydenmilne added a commit that referenced this issue Jan 28, 2020
I picked this timeout at random, it seems that is was too short.

The original purpose was to prevent tests from hanging indefinitely.
60s is long enough that it shouldn't happen under normal usage
but will prevent tests from hanging.
@daniekpo
Copy link
Contributor

@jaydenmilne what's the status on this issue? @kolbytn said he hasn't had any more issues. Can we close this?

@jaydenmilne
Copy link
Contributor Author

There are two separate issues that cause this error.

  1. Busy error caused by the initialization timeout being set too low, so holodeck would fail to initialize
  2. Busy error occurring after initialization and ticking a few hundred thousand times (see @Sypherd's comment above), someone else in the lab has run into this too.

We thought they were related, but they weren't. We fixed (1) but haven't figured out (2). I have tried to reproduce (2) since it was reported but haven't been able. I am planning on running a few more tests to try and reproduce (2), since it broke someone's research and is thus a high priority issue.

@jaydenmilne jaydenmilne self-assigned this Jan 30, 2020
@jaydenmilne
Copy link
Contributor Author

I've been unable to reproduce this issue resetting the world 14,000 times and ticking it 13,380,000 times. This is the script I am trying to use to reproduce it:

import holodeck
import datetime
import traceback

i = 0

with holodeck.make("MazeWorld-FinishMazeSphere", show_viewport=False) as env:
    try:
        while True:
            env.tick(1000)
            env.reset()
            i += 1
            if i % 5 == 0:
                print("ticks = {} resets = {}".format(i * 1000, i))
                print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
    except Exception as e:
        print("ticks = {} resets = {}".format(i * 1000, i))
        print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

        track = traceback.format_exc()
        print(track)
        print(e)

Should I be ticking it more times per episode?

@jaydenmilne jaydenmilne modified the milestones: 0.3.1, 0.4.0 Mar 24, 2020
@jaydenmilne jaydenmilne added the cant-reproduce We can't find reproduce this bug label Mar 31, 2020
@daniekpo daniekpo assigned daniekpo and unassigned jaydenmilne May 18, 2020
@daniekpo daniekpo added the known-issue A known issue with no fix currently available label Jun 12, 2020
@daniekpo
Copy link
Contributor

Ok. I spent a fair amount of time digging into this issue. I was able to reproduce it consistently when running tests on my Linux box. It has something to do with OpenGL when we run Holodeck headless (but with rendering enabled. i.e delete the DISPLAY environment variable). All the logs that I got had something to do with either OpenGL or some graphics-related problem.

I ran the same tests but with DISPLAY set. Instead, I passed -nullrhi as a parameter to the engine and it never failed. My observation remained true across multiple runs. It looks like some projects that rely on Unreal Engine (like AirSim) have the same unresolved issue. I'm not sure what exactly triggers it but the engine simply fails to initialize OpenGL correctly. Unfortunately, we can't use the -nullrhi flag to run headless since it doesn't render at all, so our camera sensor would be empty. Also, unfortunately, Vulkan currently doesn't support rendering without a display so I couldn't' test that.

Let us know if you run into this problem while running holodeck with a GUI I'm going to close this issue now for housekeeping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cant-reproduce We can't find reproduce this bug known-issue A known issue with no fix currently available
Projects
None yet
Development

No branches or pull requests

4 participants