Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing global semaphore deadlock for *NIX platforms #1603

Closed

Conversation

colincornaby
Copy link
Contributor

If the program crashed or behaved unexpectedly while a global semaphore was locked, that semaphore would stay locked until the machine was restarted. Unlinking the semaphore upon creation will cause the semaphore to be disposed of once the last process using that semaphore is no longer running.

If the program crashed or behaved unexpectedly while a global semaphore was locked, that semaphore would stay locked until the machine was restarted. Unlinking the semaphore upon creation will cause the semaphore to be disposed of once the last process using that semaphore is no longer running.
@colincornaby colincornaby requested a review from dpogue July 13, 2024 07:02
@colincornaby colincornaby changed the title Fixing global semaphore deadlock for NIX platforms Fixing global semaphore deadlock for *NIX platforms Jul 13, 2024
@colincornaby
Copy link
Contributor Author

FYI - the 'sem_init' in the alternate path isn't just deprecated on macOS, but it's outright non functional.

@@ -106,6 +106,8 @@ hsGlobalSemaphore::hsGlobalSemaphore(int initialValue, const ST::string& name)

/* Named semaphore shared between processes */
fPSema = sem_open(semName.c_str(), O_CREAT, 0666, initialValue);
// Unlink it immediately so it will be freed if we unexpectedly leave it locked
sem_unlink(semName.c_str());
Copy link
Member

@Hoikas Hoikas Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. This will cause other processes to get a new semaphore that isn't connected to this semaphore, breaking the process shared nature of the global semaphore.

From sem_unlink documentation:

     The named semaphore named name is removed.  If the semaphore is in use by
     other processes, then name is immediately disassociated with the semaphore,
     but the semaphore itself will not be removed until all references
     to it have been closed.  Subsequent calls to sem_open() using name will
     refer to or create a new semaphore named name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may actually be the desired behavior here. The Windows version, when used with logging, creates a process local semaphore. Not a machine local semaphore.

However:

  • That may not have been what was intended on Windows.
  • The Windows version can still create a global semaphore if the semaphore name is prepended with 'global' (have not tested- that's what the docs claim.)

Certainly open to suggestions here. Not seeing a great way to close the semaphore automatically on an unexpected exit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have mixed thoughts on this right now... on one hand, this is the easy solution that solves the common case, and I don't think we currently have any cases where we need multi-process semaphores, but this does potentially introduce a limitation where semaphores can't actually be used across multiple processes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirical evidence suggests that Win32 semaphores are created in either the global or session namespace without the local namespace prefix. That is basically the point of the hsGlobalSemaphore - to be able to wait and signal across processes. The big difference is that Win32 semaphores automatically destroy themselves when the last process holding a HANDLE to the semaphore terminates. I'm not really sure what to do here, unfortunately. We may need to rethink how we synchronize access to log files when multiple clients are running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Just to be clear here - not a shot in the dark. When @dpogue was describing the didn't-launch-til-restart behavior I knew exactly what it was because I've seen it happen before.

It shouldn't happen in normal operation but every once in a while if the client crashes it just gets stuck. Especially worrisome - if the client locked up due to a deadlock it will stay deadlocked when it relaunches.

Copy link
Member

@Hoikas Hoikas Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't doubt that this fixes the deadlock on launch at all. The problem is that this fixes it by breaking the global semaphore mechanism.

@dpogue
Copy link
Member

dpogue commented Jul 17, 2024

I guess this issue probably exists on Linux too, but is it worth looking into Mach semaphores for Darwin-based systems? Supposedly this is the primitive on which Darwin implemented named POSIX semaphores: https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/synchronization/synchronization.html

@colincornaby
Copy link
Contributor Author

I had started to look at Mach semaphores, but didn't see anything directly analogous to a named semaphore in my cursory glance. There was some reference about passing semaphores across process boundaries, but I didn't see anything that directly mapped to named. Let me know if you find anything.

Another suggestion I found - that would work well here but completely upend the interface - is to establish a file based lock on the log file. File system locks - supposedly - are released when the process dies. Technically - you may even be able to rebuild named locks over file system locks. I have not looked too deeply at this path though.

@colincornaby
Copy link
Contributor Author

Looking into file locks as a possible replacement - especially in since the guard here is specifically intended to gate file access.

On Windows this function would be LockFileEx:
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-lockfileex

If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system. However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them.

In POSIX the function would be flock. In Linux it at least seems that file looks are kernel managed will be released on unexpected exit.
https://stackoverflow.com/questions/12651068/release-of-flock-in-case-of-errors

I would have to check the behavior on macOS.

@colincornaby
Copy link
Contributor Author

I think the other PR is likely the direction to go - so I'll close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants