-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker tasks on generic worker sometimes hit issues with caches #538
Comments
@matt-boris and I dug into this a bunch today. Here's what we found:
There are two very confusing parts of this still:
I think the next step here is to add a bunch of additional logging to |
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in #538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
I managed to reproduce this issue with additional debugging information in this task. In it, we have one cache configured:
My additional debugging consisting of dumping all the files in the cache directory, as well as the Interestingly (and annoyingly), the immediate previous run was in a very odd state: it was claimed by a worker that didn't seem to exist in GCP (at least, not under the worker id that was given). It's not clear to me whether or not this is related to the issue, but it's coincidental enough to note here. (The previous time we saw this the errors came after a well handled spot termination.) It does seem like there's a legitimate bug in |
Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache. A few guesses:
|
Thanks Dustin; you were the closest thing we had to an expert based on the blame 😬
In this case, we have a mounted cache volume, which (I believe) is used across multiple workers, which could explain this part?
I'll check into these theories. The last in particular is an interesting theory. We already do things in |
I apologize, I only barely remember this! But, my reading of the So, I think the place to look when this occurs is the previous run on the same worker. Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with |
Sorry, I only saw after pinging you that it was....a decade ago that you review it 😬
Right, right, thank you for pointing this out! I kept reading the cache definitions as a mount definition, but clearly this is not the case after a re-read.
And indeed, that's right where we find this in my most recent tests:
Curiously though, the original task this was reported in don't seem to have this correlation. We no longer have history for those workers though unfortunately :(.
That does sound very plausible, indeed! Is garbageCollection what you're referring to here? |
One thing I realized while looking at things just now is that the reason we don't hit this on the non-d2g tasks is because none of them have caches configured. The d2g ones have caches configured ostensibly for the docker images, but they end up getting used for the VCS checkout as well AFAICT. |
That sounds like a promising lead! I don't know the relevant GW code, but |
I had another quick look at this today. I think it's unlikely that garbage collection is a factor here, as we've been seeing this on instances where the task directory is located on a 300gb disk, and the earlier tasks certainly don't fill it up enough to get GC running. There's also no evidence in the log that GC runs prior to the cache error being detected. In the abscence of other ideas, I'm tempted to say that the best path forward here may be to see if we can get rid of the checkout cache for the d2g tasks. This repository is small enough that is really doesn't give us much benefit. |
I got inspired and ended up setting up my own worker with some extra instrumentation. I'm pretty darn sure this is a generic-worker bug at this point; I've filed taskcluster/taskcluster#7128 about it. |
As an update here, we've landed a couple of fixes upstream already. We're waiting on one edge case to be addressed, after which we'll be unblocked on picking up a new generic worker version that should be free of this issue. |
With some help from the Taskcluster team, I did some testing with the latest version of generic-worker, and it appears to fix this issue. We'll need to upgrade the version we're using on our images, and then we should be able to call this fixed, and switch to d2g/generic-worker for CPU tasks. (Switching GPU tasks to d2g/generic-worker is #710.) |
CPU workers are switching to the new image in #949. We still have a bit more work to do on the image before it's ready for the GPU workers. |
For example: https://firefox-ci-tc.services.mozilla.com/tasks/IvbeCQBuRuKIOaeOIGEfHg
We had an initial run of this task which got spot killed. Subsequent runs failed with:
The text was updated successfully, but these errors were encountered: