-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mac: sporadic vnode I/O errors #328
Comments
@pmj Is the following test case that failed on a Mac functional test related? Or should I create a separate issue?
|
@derrickstolee that one is covered by #315 |
Thanks for the help! Sorry for the noise. |
I think the two failure cases that we're describing are the same thing. I've only ever seen this callstack (line numbers will be off for your branches/HEAD):
The linecount for HandleFileOpOperation will show that we're in the file handle close path. I hit this all the time when working on large builds. At least a 50% hit rate during the most parallel parts of the build. One thought that comes to mind is because there's so much I/O flying around in the system, we hit a vnode that is incomplete and missing attributes (which would be an OS bug IMO). Alternatively, some system process produces a VNodeType that we're not supporting and since we intercept every file handle close, we're bound to hit this eventually. I've found machines that I've left idle for days panicked on this call stack so that might be the better explanation. I think we need to take some fix for this particular panic soon regardless. I know @wilbaker has some opinions about 'hiding' the problem as your current PR does, I'm curious to see what others think. |
On certain Mac Pros (but not the ones sitting on my desk), I can repro this 100% of the time. Having the kext loaded and then running our PrepFunctionalTests.sh script (which first mounts a .dmg and runs a .pkg in it) will hit this panic. Just another data point, it doesn't change my explanation for what's going on here, it just proves that I/O on disk image volumes is triggering interactions in our kext when it shouldn't be. Edit: For closure, I figured out what's different. Someone installed System Center Endpoint Protection on these machines, which brings its own kext. So the root cause here might be similar to the issue with the Google Drive kext. |
I've just hit this condition, with the same trace as @nickgra's during development/testing. In my case, it failed with err == 9 (EBADF, Bad file descriptor), and vn_getpath failed on the vnode. I'm going to put some extra logging in my local version that prints more diagnostic data about the vnode, plus the path passed in rather than trying to generate it. I'm suspecting this is somehow a vnode that ends up deleted/recycled before we're done with it? I've not got any AV installed on this test machine. (Quad core 2012 Mac Mini) |
I've run into cases where |
I've just run into another error, this time getting flags in the |
What's the best way (assuming there is a way) to programmatically check that a vnode has been recycled?
With the current vnode cache proposal we don't cache anything about the file itself. Even if we did (e.g. cache fsid and inode) I think the cache would only help us if the vnode in the cache happened to have the same Did you have any ideas in mind for something more the cache could do?
Although (I believe) it's asynchronous, I wonder if the File Events API would be more reliable that listening to FileOps in the kext. Any thoughts on that approach? Another approach might be to rely on the vnode operations rather than If we wanted to stick with the current approach are you aware of any machine\macOS settings we could tweak to make this less likely to occur? |
As long as there are references (
I don't think the vid will still match once the recycling process has begun, it should have been incremented once at that point; we can technically detect that, but it's an implementation detail rather than something that's guaranteed. However, it's vanishingly unlikely that once we get a
Yeah, fsevents would work for simply detecting changes to files. The async nature is fairly irrelevant, as our kext to provider messaging is also currently async.
Yes, I think that's probably not a bad idea anyway, I'm not sure I assume it's not a problem if we end up with false positives for modified/created messages? What about timing? Are these notifications used lazily on the next (If false positives are OK, we can send a modified message from
No, I don't think this is something we can influence via settings. |
I've looked this up in the xnu source now: the caller of the |
Thanks for pointing me to your changes, I clearly need to catch up on your PRs!
I checked Apple's document for vnode_getwithref(), and if it succeeded I'm not sure why the vnode would have been recycled. From Apple docs (emphasis mine):
If
If I'm following you correctly I think you're suggesting something like this:
Is that correct? I do share your concern that if we went with this approach we'd be relying on an undocumented implementation detail (and based on the documentation of
Last time I checked the source from the comments\naming it sounded like I ask because there is a subtle difference when it comes to how we are asynchronous vs fsevents being asynchronous. Although it's true that our user-mode process will pick up the message from the kext asynchronously, the kext will block the FileOp callback until it receives a response from user-mode (making the kext's handling of the event synchronous from the perspective of the caller).
False positives are not a problem as long as we don't end up with too many of them (especially when it comes to modified messages). We're able to keep git commands quick by restricting the set of files that git considers, and each false positive means there's one more file that git needs to consider. If this is infrequent then there's no problem, but we don't want git to end up checking all of the files in a repo (unless they've all actually been modified). On the Windows platform we aggressively add files to the set that git considers and this has not caused issues (files are added when a handle is opened with write access, and writes may never occur).
VFS4G will immediately add the modified\created files to a list that it maintains. Whenever a git command runs VFS4G provides this list of files to git so that it know what files it should consider. So, if a user has a script like this:
We want to make sure that VFS4G has been notified that
That sounds like a good option to me. I think we'll simply need to give it a try and see how frequent false positives are. My hope is that we can sufficiently reduce false positives by only notifying VFS4G when the
False positives are okay, but I think the bigger issue\concern is completely missing events for recycled vnodes that we're unable to reason about. |
We already had #209 (partially) open for this change. I've updated it with the details from this discussion. |
I'm not really expecting anyone to review #555. :-)
As so often, I think the terminology is used inconsistently, so that might be what's causing confusion: In this specific case, the vnode won't die between
Precisely.
Ah, indeed, I forgot that the kext waits for a response for the modified messages too. Your example with the script makes sense. So
Yeah; I seem to remember having some issues with the auth cache in that context, where ACCESS results were cached and I never got a non-ACCESS callback. I might be misremembering (this was on a project in 2012/2013) but as we've disabled the cache, it shouldn't affect us either way. Of course, we or another vnode listener kext could return DENY from the callback, in which case |
Following out-of-band discussion, I'm going to wrap up most of the error logging I added with #555 into a PR we can merge into master, so that we can get a better overall picture of these errors and so that they are logged automatically on end user machines (c.f. #396) as they may help with diagnosing problems. Additionally, we want to start mitigating the known problems, particularly the name cache misses. (with a view to implementing something along the lines of #338) |
Note to self: add functional tests CI step that scans the kext log for any errors and fails the job if there were errors. |
Not sure if this is of interest, but I had the logger running to test some theories and while I was not interacting with my virtualization root, we hit some of the new errors from 643:
|
Yeah, this has turned up before - it seems that by the time we get the |
There seem to be a number of places in the kext's vnode/fileop handler where I/O (reading attributes, etc.) occasionally fails. In a few places there are
assert
s that even take down the system in case of such failures. We need to track down all the failure cases, some common root causes, and handle them correctly and gracefully in the kext instead of crashing the system.Known failure cases:
Reports of more cases appreciated.
The text was updated successfully, but these errors were encountered: