Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mac kext: Workaround for Kextlog/IOSharedDataQueue stall issue #161

Closed
wants to merge 3 commits into from

Conversation

pmj
Copy link
Contributor

@pmj pmj commented Aug 14, 2018

This is a refresh of my earlier PR - no code changes other than that I've reapplied the patches on latest master.

Background:
I did some deep digging and it turned out the stall problems we were having with the logging system were a regression in IOSharedDataQueue itself. When a message is enqueue()d on the ring buffer, the writer thread must decide whether the reader process needs to be woken up with a Mach message. The decision is based on whether the ring buffer was empty (head==tail) before enqueueing or was emptied during the enqueue operation as both of those situations imply that the reader thread has run out of messages to dequeue, and thus will likely have gone to sleep.

The bug relates to updates to the head and tail fields of the IODataQueueMemory struct, i.e. the core mechanism of the ring buffer. There seem to be 2 regressions since OS X 10.11.6:

  1. In macOS 10.12, the kernel side IOSharedDataQueue class was changed to use C11 atomics instead of OSAtomic functions from libkern. The C11 atomics chosen mostly use __ATOMIC_RELAXED semantics with weak memory ordering guarantees. This causes rare occurrences of the bug.
  2. Something seems to have changed on the user space side between 10.12.6 and 10.13.6. I can't see any differences until 10.13.3 (the last version with source) but they would at minimum have been compiled with different compilers. It's possible the IODataQueueDequeue() function itself has been changed between 10.13.3 and 10.13.6. In any case, this change makes the stall bug to become extremely frequent.

The changes cause the reader and writer thread to end up with inconsistent views of the shared IODataQueueMemory data structure. For example:

  1. Reader thread sees head = 100, tail = 200. This means there are messages to be dequeued.
  2. Writer thread wants to enqueue a new message, notes the initial state of the ring buffer is head = 100, tail = 200.
  3. Reader dequeues a message, updates head to 200.
  4. Writer has queued a new message, updates tail to 300.
  5. Reader checks for more messages. head and tail are found to be equal at 200, the tail write of 300 has not propagated to this thread yet.
  6. Writer checks whether the ring buffer was emptied during the enqueue operation. head is still 100, as the write has not yet propagated.

This can be avoided by enforcing strict ordering on the reads and writes, which used to be the case in OS X 10.11.6. I guess those OSAtomic functions are being deprecated so the implementation was switched to fancy new C11 atomics.

The 2 commits in this PR work around the problem on 10.13.6 by subclassing IOSharedDataQueue and overriding its enqueue() implementation with the version from 10.11.6, and by replacing the IODataQueueDequeue function from the system's IOKit.Framework with the version from 10.13.3.

I have reported this bug to Apple with Radar number 43093190. A public copy of this bug report is on OpenRadar.

@msftclas
Copy link

msftclas commented Aug 14, 2018

CLA assistant check
All CLA requirements met.

@microsoft microsoft deleted a comment from pmj Aug 14, 2018
@pmj pmj force-pushed the kextlog-stall-fix branch from 765d4a3 to b317148 Compare September 1, 2018 10:43
pmj added 3 commits September 19, 2018 16:42
This adds the implementation of the IODataQueueDequeue() function from IODataQueueClient.c in the IOKitUser-1445.40.1 bundle from the macOS 10.13.3 source code - see https://opensource.apple.com/source/IOKitUser/IOKitUser-1445.40.1/

This mostly works around the kext logging stall issue which in turn has been reported to Apple as Radar issue 43093190.
…lementation

A subclass of IOSharedDataQueue is added which overrides the enqueue() member function with the implementation from OS X 10.11.6, prior to the introduction of a regression which causes inconsistent memory ordering across threads.

Both user client subclasses are updated to use the new class in place of IOSharedDataQueue.

This works around the remaining occurrences of the IOSharedDataQueue stall issue which has been reported to Apple as Radar issue 43093190.
@pmj pmj force-pushed the kextlog-stall-fix branch from b317148 to 72cf85e Compare September 19, 2018 14:43
@pmj
Copy link
Contributor Author

pmj commented Sep 19, 2018

Just a note that I've rebased these patches on latest master, and that I thus far haven't had any official feedback on the bug filed with Apple, nor has there been a fix in a publicly released macOS version yet.

@pmj
Copy link
Contributor Author

pmj commented Sep 28, 2018

I've had a response from Apple on the Radar, indicating that there should be a fix for this in an upcoming OS release. I suspect that means we'll never see a fix in 10.13.x, only on 10.14.1+, so for supporting High Sierra we should definitely include some kind of workaround. The user space side fix is, if I remember correctly, much more important than the kernel side, which only rarely causes issues under very high pressure. So if the problem is inclusion/distribution of APSL source code and binaries would be OK, we could ship the fix as a dylib which we look for at runtime and can maintain out of tree. (This would be trickier to do in a kext.)

@pmj
Copy link
Contributor Author

pmj commented Oct 5, 2018

I've just tested with macOS 10.14.1 beta 2 (18B50c) and the bug does indeed appear to be fixed there. I've asked whether there are any plans to backport the fix to a 10.13.6 update, no idea how likely that is.

Incidentally, it looks like the severe regression (user space side) was introduced with 10.13.6 itself - source code for the full 10.13 series is now online, and the IODataQueue functions were changed to use C11 atomics in IOKitUser-1445.71.1, which I guess caused the bug. (Or rather, using incorrect memory ordering semantics with those C11 atomics.)

@pmj
Copy link
Contributor Author

pmj commented Oct 10, 2018

I've got as close to a definitive statement from Apple as I'm likely to get regarding this bug on 10.13: apparently they won't be providing a fix against 10.13.6 as it's not a security issue. So I'd strongly recommend we implement some kind of workaround for that version, unless Apple can be persuaded to change their mind. (Presumably via Microsoft's contact at Apple's DTS?) Any news on the license issue or thoughts on asking DTS @sanoursa? As I mentioned in an earlier comment, I think we can keep the workaround in a separate binary which we would attempt to load at runtime on 10.13.6 only. I've thought of a way we could do this for the kext-side fix too.

@sanoursa
Copy link
Contributor

Thanks for following up on this. From what I understand, Office will be moving to 10.14.1 pretty quickly after it is available, but in the meantime a temporary fix (the separate binary) would definitely be helpful. We could probably stick that binary in a nuget package that we put in our dev feed, unless you have a better way to distribute it.

@pmj
Copy link
Contributor Author

pmj commented Oct 11, 2018

We could probably stick that binary in a nuget package that we put in our dev feed, unless you have a better way to distribute it.

I don't really know much about nuget, but if our installer is already set up to download stuff from nuget, then that seems like a good way forward assuming it doesn't matter that there's not actually any .NET based component to it. A few things to clarify before I put this together:

  • I'm assuming the code for this workaround would live in a separate repository?
  • If I put together an Xcode based project that builds the dylib, can you take it from there to bundle it into a nuget package that our installer downloads?
  • There will be a small chunk of code required in the PrjFSLib native library that actually looks for and loads the workaround dylib on affected systems, but this won't be tainted license-wise in any way.

@nickgra
Copy link
Member

nickgra commented Oct 15, 2018

@pmj That plan sounds good to me. Let me know if you need any help in getting that put together, I can handle getting the dylib packaged from the separate repo and consumed in this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants