Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker tasks on generic worker sometimes hit issues with caches #538

Open
Tracked by #311
bhearsum opened this issue Apr 26, 2024 · 13 comments
Open
Tracked by #311

docker tasks on generic worker sometimes hit issues with caches #538

bhearsum opened this issue Apr 26, 2024 · 13 comments
Assignees
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@bhearsum
Copy link
Collaborator

For example: https://firefox-ci-tc.services.mozilla.com/tasks/IvbeCQBuRuKIOaeOIGEfHg

We had an initial run of this task which got spot killed. Subsequent runs failed with:

error: cache /builds/worker/checkouts is not empty and is missing a .cacherequires file; the cache names for this task are likely mis-configured or TASKCLUSTER_CACHES is not set properly
@bhearsum bhearsum self-assigned this Apr 26, 2024
@bhearsum
Copy link
Collaborator Author

@matt-boris and I dug into this a bunch today. Here's what we found:

  • The task linked above got into a strange loop where it was quickly spot terminated, then the next run failed with the error above, and then repeated that over and over.
  • There was another task in the same graph that had many exceptions - but all of those were spot terminations. The first rerun after the first spot termination had no issue with caches - they end up in the "cache is empty" code path in run-task just like the first run.
  • The most notable differences between the docker-worker machines and the new workers that are hitting this issue are: the new ones are generic-worker, and that they handle spot terminations correctly. This ought to mean that the cache is preserved between runs.
  • The error message we're seeing comes from run-task, and happens when we have a non-empty cache with no .cacherequires file on it.

There are two very confusing parts of this still:

  1. Why does one task that's been spot killed end up with this error, while another one runs fine?
  2. How exactly could get into this state. The runs that get spot killed get far enough that they have already written a .cacherequires file, which suggests that we should end up in the code path in run-task that finds a .cacherequires file

I think the next step here is to add a bunch of additional logging to run-task to help inspect the state of the cache when we hit a failure like this. It _should_be fairly easy to achieve this by simulating host maintance events on workers.

bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 3, 2024
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 6, 2024
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
@eu9ene eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label May 8, 2024
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 8, 2024
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
bhearsum added a commit to bhearsum/firefox-translations-training that referenced this issue May 9, 2024
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
bhearsum added a commit that referenced this issue May 9, 2024
Analysis done in #538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
@bhearsum
Copy link
Collaborator Author

bhearsum commented May 9, 2024

I managed to reproduce this issue with additional debugging information in this task. In it, we have one cache configured:

  cache:
    translations-level-1-checkouts-v3-62df7d36d8ce4e58ade4-bEwo47EJT7-85VhUssFA3Q: /builds/worker/checkouts

My additional debugging consisting of dumping all the files in the cache directory, as well as the .cacherequires and .cachelog files. In the run that hit this issue, we do in fact see that /builds/worker/checkouts has a checkout, but no .cacherequires file.

Interestingly (and annoyingly), the immediate previous run was in a very odd state: it was claimed by a worker that didn't seem to exist in GCP (at least, not under the worker id that was given). It's not clear to me whether or not this is related to the issue, but it's coincidental enough to note here. (The previous time we saw this the errors came after a well handled spot termination.)

It does seem like there's a legitimate bug in run-task somewhere, but this is very difficult to debug given the circumstances required to reproduce :(.

@djmitche
Copy link

Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache.

A few guesses:

  • Could multiple generic-worker instance be running in parallel, leading to run-task racing against itself?
  • Are these spot-terminated instances "rebooting" somehow, and coming back up with partially populated caches?
  • I notice run-task's comments say this "shouldn't happen" based on an argument about the order things were done in the filesystem. But that ordering isn't always consistent in the face of mountings/unmountings, reboots, etc. - is it possible there's some race condition there? Maybe an fsync would help?

@bhearsum
Copy link
Collaborator Author

Thanks Dustin; you were the closest thing we had to an expert based on the blame 😬

Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache.

In this case, we have a mounted cache volume, which (I believe) is used across multiple workers, which could explain this part?

A few guesses:

* Could multiple generic-worker instance be running in parallel, leading to run-task racing against itself?

* Are these spot-terminated instances "rebooting" somehow, and coming back up with partially populated caches?

* I notice `run-task`'s comments say this "shouldn't happen" based on an argument about the order things were done in the filesystem. But that ordering isn't always consistent in the face of mountings/unmountings, reboots, etc. - is it possible there's some race condition there?  Maybe an `fsync` would help?

I'll check into these theories. The last in particular is an interesting theory. We already do things in run-task to ensure that, eg: stdout/stderr is flushed, so it wouldn't surprise me if we need this sort of invalidation as well.

@djmitche
Copy link

I apologize, I only barely remember this! But, my reading of the run-task script linked above is that it is not downloading anything. And the workers have no "network fileystem" of any sort that would let them share caches. Instead, it's designed to handle caches shared between workers on the same task, from the cache key in the payload. IIRC gps wrote this functionality (and especially the .cacherequires bit) to avoid problems with some change to the task definition failing when run against old caches. I think caches also change names when their requirements change, so this should "never happen". Judging by the comments, when these requires do not match, the choice is to fail quickly when it does happen.

So, I think the place to look when this occurs is the previous run on the same worker.

Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with .cacherequires and .cachelog and then start diving into the nested directory structure of the checkout. So, if for some reason that processing is running on a cache at the same time that it is in use by a task, it might explain what you're seeing.

@bhearsum
Copy link
Collaborator Author

I apologize, I only barely remember this!

Sorry, I only saw after pinging you that it was....a decade ago that you review it 😬

But, my reading of the run-task script linked above is that it is not downloading anything. And the workers have no "network fileystem" of any sort that would let them share caches.

Right, right, thank you for pointing this out! I kept reading the cache definitions as a mount definition, but clearly this is not the case after a re-read.

Instead, it's designed to handle caches shared between workers on the same task, from the cache key in the payload. IIRC gps wrote this functionality (and especially the .cacherequires bit) to avoid problems with some change to the task definition failing when run against old caches. I think caches also change names when their requirements change, so this should "never happen". Judging by the comments, when these requires do not match, the choice is to fail quickly when it does happen.

So, I think the place to look when this occurs is the previous run on the same worker.

And indeed, that's right where we find this in my most recent tests:

Curiously though, the original task this was reported in don't seem to have this correlation. We no longer have history for those workers though unfortunately :(.

Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with .cacherequires and .cachelog and then start diving into the nested directory structure of the checkout. So, if for some reason that processing is running on a cache at the same time that it is in use by a task, it might explain what you're seeing.

That does sound very plausible, indeed! Is garbageCollection what you're referring to here?

@bhearsum
Copy link
Collaborator Author

One thing I realized while looking at things just now is that the reason we don't hit this on the non-d2g tasks is because none of them have caches configured. The d2g ones have caches configured ostensibly for the docker images, but they end up getting used for the VCS checkout as well AFAICT.

@djmitche
Copy link

That sounds like a promising lead! I don't know the relevant GW code, but garbageCollection looks like what I was thinking of.

@bhearsum
Copy link
Collaborator Author

bhearsum commented Jul 4, 2024

I had another quick look at this today. I think it's unlikely that garbage collection is a factor here, as we've been seeing this on instances where the task directory is located on a 300gb disk, and the earlier tasks certainly don't fill it up enough to get GC running. There's also no evidence in the log that GC runs prior to the cache error being detected.

In the abscence of other ideas, I'm tempted to say that the best path forward here may be to see if we can get rid of the checkout cache for the d2g tasks. This repository is small enough that is really doesn't give us much benefit.

@bhearsum
Copy link
Collaborator Author

bhearsum commented Jul 9, 2024

I got inspired and ended up setting up my own worker with some extra instrumentation. I'm pretty darn sure this is a generic-worker bug at this point; I've filed taskcluster/taskcluster#7128 about it.

bhearsum added a commit to bhearsum/taskcluster that referenced this issue Jul 9, 2024
@bhearsum
Copy link
Collaborator Author

As an update here, we've landed a couple of fixes upstream already. We're waiting on one edge case to be addressed, after which we'll be unblocked on picking up a new generic worker version that should be free of this issue.

@bhearsum
Copy link
Collaborator Author

With some help from the Taskcluster team, I did some testing with the latest version of generic-worker, and it appears to fix this issue. We'll need to upgrade the version we're using on our images, and then we should be able to call this fixed, and switch to d2g/generic-worker for CPU tasks. (Switching GPU tasks to d2g/generic-worker is #710.)

@bhearsum
Copy link
Collaborator Author

bhearsum commented Dec 2, 2024

CPU workers are switching to the new image in #949. We still have a bit more work to do on the image before it's ready for the GPU workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

3 participants