-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue #806
Comments
Two more: @bhearsum please take a look. It seems it's blocking proceeding with several remaining models. I think it doesn't go away after restart. |
I see we upload tmp directory which shouldn't be uploaded. Can it be the reason? I see you attempted to fix it here #755. I can give it another try in release branch if this is the reason of failure. |
Notably, this is happening on the new CPU workers.
I don't think I was encountering this error when I opened that - I was just trying to avoid uploading what I thought were unused files, so this pull request isn't likely to be relevant. The relevant bits of the log are:
A couple of things are noteworthy here:
My stab in the dark guess here is that something is going wrong with parallel uploads - perhaps we're getting rate limited by GCS or something of the sort. I think we'll need to log or otherwise find the full response we're getting to make this tractable. @matt-boris - I found worker logs in GCP for this, but they don't appear to log the full response. I didn't see anything in Sentry (but I also may not have access to the right project). Do you know of another place we could possibly get this? I would also appreciate you taking a look over the logs I extract, which I can send privately. |
@matt-boris and I were discussing this a bit. We don't yet have a handle on the root cause, but one thing we did notice is that the alignments files are totally uncompressed. Compressing them (and then decompressing in downstream tasks, of course) might be a way to sidestep this. @eu9ene - are there reasons we avoided compressing these files? Obviously we'd be starting off some time + cpu cycles for artifact size. Based on what they look like though, I suspect that even the worst zstd compression would make a significant difference here. |
I removed temporary files in #841. There's only one output file now. Probably one of the reasons why it's not compressed is because the |
In the worker logs that @bhearsum provided me, I see an artifact upload attempt for the |
Actually I see now that the main artifact is compressed:
It's hard to say how big the final artifact will be as it depend on the trained language and amount of used data. I think compressed it will be in line with our regular sizes for dataset related artifacts. |
Looking at worker logs I see:
The relevant part seems to be:
It looks to me like the signed url we get is only valid for a certain amount of time, and we are running out of time when trying to upload the file. However, I do also see |
It looks to me like the credentials should be valid for 45 minutes. That seems like it should be plenty, so I'm curious why the token should be expired. |
To be clear: we're talking about the artifacts coming out of the alignments tasks, which AFAICT are uncompressed. (I don't see any indications that the inputs to it are uncompressed.) |
Possibly of note: the first upload I see happens at Sep 6 16:53:49. The first |
In a meeting today @eu9ene said that he's made some changes to remove the problematic files from the uploaded artifacts, which may fix this specific issue. We should have an answer on that in the next day or two. @petemoore or @matt-boris - regardless of that, is it worthwhile tracking this general issue and/or preemptively adding some more logging? |
Submitted taskcluster/taskcluster#7260 for tracking. But please keep us posted on whether the changes you guys have made help circumvent this issue. |
My fix helped. All 3 tasks have completed successfully. Here's how the artifacts look now https://firefox-ci-tc.services.mozilla.com/tasks/e4CbMFI-SWGQ29z7zyFLiQ. @bhearsum We can close this issue if we're tracking the underlying Tasckcluster issue separately. |
I'm glad things are unblocked now! taskcluster/taskcluster#7260 should be enough for tracking the underlying issue. |
The task has finished but Taskcluster couldn't upload the artifacts.
It might be fixed with migration to the new CPU pool.
https://firefox-ci-tc.services.mozilla.com/tasks/MHMMjOrnQ0e__tn8zEypAQ/runs/1/logs/public/logs/live.log
The text was updated successfully, but these errors were encountered: