docker tasks on generic worker sometimes hit issues with caches #538

bhearsum · 2024-04-26T13:50:01Z

For example: https://firefox-ci-tc.services.mozilla.com/tasks/IvbeCQBuRuKIOaeOIGEfHg

We had an initial run of this task which got spot killed. Subsequent runs failed with:

error: cache /builds/worker/checkouts is not empty and is missing a .cacherequires file; the cache names for this task are likely mis-configured or TASKCLUSTER_CACHES is not set properly

The text was updated successfully, but these errors were encountered:

bhearsum · 2024-04-26T16:08:57Z

@matt-boris and I dug into this a bunch today. Here's what we found:

The task linked above got into a strange loop where it was quickly spot terminated, then the next run failed with the error above, and then repeated that over and over.
There was another task in the same graph that had many exceptions - but all of those were spot terminations. The first rerun after the first spot termination had no issue with caches - they end up in the "cache is empty" code path in run-task just like the first run.
The most notable differences between the docker-worker machines and the new workers that are hitting this issue are: the new ones are generic-worker, and that they handle spot terminations correctly. This ought to mean that the cache is preserved between runs.
The error message we're seeing comes from run-task, and happens when we have a non-empty cache with no .cacherequires file on it.

There are two very confusing parts of this still:

Why does one task that's been spot killed end up with this error, while another one runs fine?
How exactly could get into this state. The runs that get spot killed get far enough that they have already written a .cacherequires file, which suggests that we should end up in the code path in run-task that finds a .cacherequires file

I think the next step here is to add a bunch of additional logging to run-task to help inspect the state of the cache when we hit a failure like this. It _should_be fairly easy to achieve this by simulating host maintance events on workers.

Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.

Analysis done in #538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.

bhearsum · 2024-05-09T01:38:55Z

I managed to reproduce this issue with additional debugging information in this task. In it, we have one cache configured:

  cache:
    translations-level-1-checkouts-v3-62df7d36d8ce4e58ade4-bEwo47EJT7-85VhUssFA3Q: /builds/worker/checkouts

My additional debugging consisting of dumping all the files in the cache directory, as well as the .cacherequires and .cachelog files. In the run that hit this issue, we do in fact see that /builds/worker/checkouts has a checkout, but no .cacherequires file.

Interestingly (and annoyingly), the immediate previous run was in a very odd state: it was claimed by a worker that didn't seem to exist in GCP (at least, not under the worker id that was given). It's not clear to me whether or not this is related to the issue, but it's coincidental enough to note here. (The previous time we saw this the errors came after a well handled spot termination.)

It does seem like there's a legitimate bug in run-task somewhere, but this is very difficult to debug given the circumstances required to reproduce :(.

djmitche · 2024-05-12T17:37:32Z

Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache.

A few guesses:

Could multiple generic-worker instance be running in parallel, leading to run-task racing against itself?
Are these spot-terminated instances "rebooting" somehow, and coming back up with partially populated caches?
I notice run-task's comments say this "shouldn't happen" based on an argument about the order things were done in the filesystem. But that ordering isn't always consistent in the face of mountings/unmountings, reboots, etc. - is it possible there's some race condition there? Maybe an fsync would help?

bhearsum · 2024-05-13T15:52:15Z

Thanks Dustin; you were the closest thing we had to an expert based on the blame 😬

Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache.

In this case, we have a mounted cache volume, which (I believe) is used across multiple workers, which could explain this part?

A few guesses:

* Could multiple generic-worker instance be running in parallel, leading to run-task racing against itself?

* Are these spot-terminated instances "rebooting" somehow, and coming back up with partially populated caches?

* I notice `run-task`'s comments say this "shouldn't happen" based on an argument about the order things were done in the filesystem. But that ordering isn't always consistent in the face of mountings/unmountings, reboots, etc. - is it possible there's some race condition there?  Maybe an `fsync` would help?

I'll check into these theories. The last in particular is an interesting theory. We already do things in run-task to ensure that, eg: stdout/stderr is flushed, so it wouldn't surprise me if we need this sort of invalidation as well.

djmitche · 2024-05-13T19:54:35Z

I apologize, I only barely remember this! But, my reading of the run-task script linked above is that it is not downloading anything. And the workers have no "network fileystem" of any sort that would let them share caches. Instead, it's designed to handle caches shared between workers on the same task, from the cache key in the payload. IIRC gps wrote this functionality (and especially the .cacherequires bit) to avoid problems with some change to the task definition failing when run against old caches. I think caches also change names when their requirements change, so this should "never happen". Judging by the comments, when these requires do not match, the choice is to fail quickly when it does happen.

So, I think the place to look when this occurs is the previous run on the same worker.

Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with .cacherequires and .cachelog and then start diving into the nested directory structure of the checkout. So, if for some reason that processing is running on a cache at the same time that it is in use by a task, it might explain what you're seeing.

bhearsum · 2024-05-14T00:46:04Z

I apologize, I only barely remember this!

Sorry, I only saw after pinging you that it was....a decade ago that you review it 😬

But, my reading of the run-task script linked above is that it is not downloading anything. And the workers have no "network fileystem" of any sort that would let them share caches.

Right, right, thank you for pointing this out! I kept reading the cache definitions as a mount definition, but clearly this is not the case after a re-read.

Instead, it's designed to handle caches shared between workers on the same task, from the cache key in the payload. IIRC gps wrote this functionality (and especially the .cacherequires bit) to avoid problems with some change to the task definition failing when run against old caches. I think caches also change names when their requirements change, so this should "never happen". Judging by the comments, when these requires do not match, the choice is to fail quickly when it does happen.

So, I think the place to look when this occurs is the previous run on the same worker.

And indeed, that's right where we find this in my most recent tests:

https://firefox-ci-tc.services.mozilla.com/tasks/ZKsAPcp1QCeKXBxtWc9Kmw/runs/1 (which I cancelled after it was into its payload commands to rerun to try to reproduce
https://firefox-ci-tc.services.mozilla.com/tasks/ZKsAPcp1QCeKXBxtWc9Kmw/runs/13

Curiously though, the original task this was reported in don't seem to have this correlation. We no longer have history for those workers though unfortunately :(.

Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with .cacherequires and .cachelog and then start diving into the nested directory structure of the checkout. So, if for some reason that processing is running on a cache at the same time that it is in use by a task, it might explain what you're seeing.

That does sound very plausible, indeed! Is garbageCollection what you're referring to here?

bhearsum · 2024-05-14T00:50:06Z

One thing I realized while looking at things just now is that the reason we don't hit this on the non-d2g tasks is because none of them have caches configured. The d2g ones have caches configured ostensibly for the docker images, but they end up getting used for the VCS checkout as well AFAICT.

djmitche · 2024-05-14T00:52:57Z

That sounds like a promising lead! I don't know the relevant GW code, but garbageCollection looks like what I was thinking of.

bhearsum · 2024-07-04T00:33:08Z

I had another quick look at this today. I think it's unlikely that garbage collection is a factor here, as we've been seeing this on instances where the task directory is located on a 300gb disk, and the earlier tasks certainly don't fill it up enough to get GC running. There's also no evidence in the log that GC runs prior to the cache error being detected.

In the abscence of other ideas, I'm tempted to say that the best path forward here may be to see if we can get rid of the checkout cache for the d2g tasks. This repository is small enough that is really doesn't give us much benefit.

bhearsum · 2024-07-09T00:24:24Z

I got inspired and ended up setting up my own worker with some extra instrumentation. I'm pretty darn sure this is a generic-worker bug at this point; I've filed taskcluster/taskcluster#7128 about it.

bhearsum · 2024-07-29T14:12:26Z

As an update here, we've landed a couple of fixes upstream already. We're waiting on one edge case to be addressed, after which we'll be unblocked on picking up a new generic worker version that should be free of this issue.

bhearsum · 2024-11-20T15:47:08Z

With some help from the Taskcluster team, I did some testing with the latest version of generic-worker, and it appears to fix this issue. We'll need to upgrade the version we're using on our images, and then we should be able to call this fixed, and switch to d2g/generic-worker for CPU tasks. (Switching GPU tasks to d2g/generic-worker is #710.)

bhearsum · 2024-12-02T17:30:42Z

CPU workers are switching to the new image in #949. We still have a bit more work to do on the image before it's ready for the GPU workers.

bhearsum self-assigned this Apr 26, 2024

bhearsum mentioned this issue May 3, 2024

Re-enable generic-worker for CPU tasks #561

Merged

eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label May 8, 2024

eu9ene mentioned this issue Jun 25, 2024

[meta] Make the pipeline reliable enough to train many languages #311

Open

bhearsum mentioned this issue Jun 28, 2024

switch to generic worker for all tasks #710

Open

bhearsum added a commit to bhearsum/taskcluster that referenced this issue Jul 9, 2024

debug logging for mozilla/translations#538

0d8124e

bhearsum mentioned this issue Jul 9, 2024

generic worker deletes file from preserved cache before task starts taskcluster/taskcluster#7128

Open

bhearsum added a commit to bhearsum/taskcluster that referenced this issue Jul 9, 2024

debug logging for mozilla/translations#538

9a6dfd4

bhearsum mentioned this issue Nov 20, 2024

chore: bump taskcluster version for firefoxci & translations CPU workers mozilla-platform-ops/monopacker#152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker tasks on generic worker sometimes hit issues with caches #538

docker tasks on generic worker sometimes hit issues with caches #538

bhearsum commented Apr 26, 2024

bhearsum commented Apr 26, 2024

bhearsum commented May 9, 2024

djmitche commented May 12, 2024

bhearsum commented May 13, 2024

djmitche commented May 13, 2024

bhearsum commented May 14, 2024

bhearsum commented May 14, 2024

djmitche commented May 14, 2024

bhearsum commented Jul 4, 2024

bhearsum commented Jul 9, 2024

bhearsum commented Jul 29, 2024

bhearsum commented Nov 20, 2024

bhearsum commented Dec 2, 2024

docker tasks on generic worker sometimes hit issues with caches #538

docker tasks on generic worker sometimes hit issues with caches #538

Comments

bhearsum commented Apr 26, 2024

bhearsum commented Apr 26, 2024

bhearsum commented May 9, 2024

djmitche commented May 12, 2024

bhearsum commented May 13, 2024

djmitche commented May 13, 2024

bhearsum commented May 14, 2024

bhearsum commented May 14, 2024

djmitche commented May 14, 2024

bhearsum commented Jul 4, 2024

bhearsum commented Jul 9, 2024

bhearsum commented Jul 29, 2024

bhearsum commented Nov 20, 2024

bhearsum commented Dec 2, 2024