Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue #806

Closed
Tracked by #311
eu9ene opened this issue Aug 20, 2024 · 16 comments
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Aug 20, 2024

The task has finished but Taskcluster couldn't upload the artifacts.

[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557

It might be fixed with migration to the new CPU pool.

https://firefox-ci-tc.services.mozilla.com/tasks/MHMMjOrnQ0e__tn8zEypAQ/runs/1/logs/public/logs/live.log

@eu9ene eu9ene added bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline labels Aug 20, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 6, 2024

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 9, 2024

Two more:
https://firefox-ci-tc.services.mozilla.com/tasks/ZBUKo9_KTEqcQarIGHlALg/runs/1/logs/public/logs/live.log
https://firefox-ci-tc.services.mozilla.com/tasks/TZamYV-bQQaDC7tq8wkbkQ/runs/1/logs/public/logs/live.log

@bhearsum please take a look. It seems it's blocking proceeding with several remaining models. I think it doesn't go away after restart.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 9, 2024

I see we upload tmp directory which shouldn't be uploaded. Can it be the reason? I see you attempted to fix it here #755. I can give it another try in release branch if this is the reason of failure.

@bhearsum
Copy link
Collaborator

It might be fixed with migration to the new CPU pool.

Notably, this is happening on the new CPU workers.

I see you attempted to fix it here #755. I can give it another try in release branch if this is the reason of failure.

I don't think I was encountering this error when I opened that - I was just trying to avoid uploading what I thought were unused files, so this pull request isn't likely to be relevant.

The relevant bits of the log are:

[taskcluster 2024-09-06T15:59:01.289Z] === Task Finished ===
[taskcluster 2024-09-06T15:59:01.289Z] Task Duration: 61h9m58.772398789s
[taskcluster 2024-09-06T16:01:07.700Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725639667.664
[taskcluster 2024-09-06T16:18:07.742Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725640687.705
[taskcluster 2024-09-06T16:35:07.777Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725641707.743
[taskcluster 2024-09-06T16:52:07.824Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725642727.782
[taskcluster 2024-09-06T16:53:35.651Z] Uploading artifact public/build/tmp/aln.rev.ad from file /task_172541804246164/artifact0/tmp/aln.rev.ad with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.656Z] Uploading artifact public/build/tmp/aln.fwd from file /task_172541804246164/artifact0/tmp/aln.fwd with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.658Z] Uploading artifact public/build/tmp/aln.rev from file /task_172541804246164/artifact0/tmp/aln.rev with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.659Z] Uploading artifact public/build/tmp/aln.rev.af from file /task_172541804246164/artifact0/tmp/aln.rev.af with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.659Z] Uploading artifact public/build/tmp/aln.rev.ab from file /task_172541804246164/artifact0/tmp/aln.rev.ab with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.660Z] Uploading artifact public/build/tmp/aln from file /task_172541804246164/artifact0/tmp/aln with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.660Z] Uploading artifact public/build/tmp/aln.rev.ac from file /task_172541804246164/artifact0/tmp/aln.rev.ac with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.662Z] Uploading artifact public/build/tmp/aln.fwd.ab from file /task_172541804246164/artifact0/tmp/aln.fwd.ab with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.662Z] Uploading artifact public/build/tmp/aln.fwd.aa from file /task_172541804246164/artifact0/tmp/aln.fwd.aa with content encoding "gzip", mime type "audio/x-pn-audibleaudio" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.664Z] Uploading artifact public/build/tmp/aln.fwd.ac from file /task_172541804246164/artifact0/tmp/aln.fwd.ac with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.665Z] Uploading artifact public/build/tmp/aln.rev.aa from file /task_172541804246164/artifact0/tmp/aln.rev.aa with content encoding "gzip", mime type "audio/x-pn-audibleaudio" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.667Z] Uploading artifact public/build/corpus.aln.zst from file /task_172541804246164/artifact0/corpus.aln.zst with content encoding "identity", mime type "application/zstd" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.678Z] Uploading artifact public/build/tmp/aln.fwd.ae from file /task_172541804246164/artifact0/tmp/aln.fwd.ae with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.679Z] Uploading artifact public/build/tmp/aln.rev.ae from file /task_172541804246164/artifact0/tmp/aln.rev.ae with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.681Z] Uploading artifact public/build/tmp/aln.fwd.ad from file /task_172541804246164/artifact0/tmp/aln.fwd.ad with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.693Z] Uploading artifact public/build/tmp/aln.fwd.af from file /task_172541804246164/artifact0/tmp/aln.fwd.af with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T17:09:07.863Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725643747.827
[taskcluster 2024-09-06T17:26:07.907Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725644767.869
[taskcluster 2024-09-06T17:43:07.979Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725645787.924
[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557
[taskcluster 2024-09-06T18:00:08.022Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725646807.991
[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557
[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557
[taskcluster 2024-09-06T18:01:53.449Z] [mounts] Preserving cache: Moving "/task_172541804246164/cache0" to "caches/Pb5xwl8FSnumOKwVAsX1bA"
[taskcluster 2024-09-06T18:01:53.529Z] [mounts] Denying task_172541804246164 access to 'caches/Pb5xwl8FSnumOKwVAsX1bA'
[taskcluster 2024-09-06T18:01:53.625Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2024-12-03T00:21:16.511Z
[taskcluster:error] S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557

A couple of things are noteworthy here:

  • We fairly recently switched to uploading artifacts in parallel. This wasn't the case when the aforementioned https://bugzilla.mozilla.org/show_bug.cgi?id=1394557 was happening
  • Since the above bug was filed, we've moved from AWS to GCP. We still upload through the S3 compatible API, but I imagine the behaviour of the backend may be different
  • As I understand it, the files we upload in these alignments tasks are huge, which may be a factor. (They're nowhere near the 5TB limit, but I think they're by far the largest files we upload as part of this pipeline?). The files that failed to upload (public/build/tmp/aln, public/build/tmp/aln.fwd, and public/build/tmp/aln.rev) are certainly expected to be the largest based on a comparison with a successful task
  • Most other alignments tasks don't have the .aa, .ab and other seemingly sequential files. Maybe this is expected. It certainly shouldn't cause task failures either way, but I want to call it out as a difference or possible change.

My stab in the dark guess here is that something is going wrong with parallel uploads - perhaps we're getting rate limited by GCS or something of the sort. I think we'll need to log or otherwise find the full response we're getting to make this tractable. @matt-boris - I found worker logs in GCP for this, but they don't appear to log the full response. I didn't see anything in Sentry (but I also may not have access to the right project). Do you know of another place we could possibly get this? I would also appreciate you taking a look over the logs I extract, which I can send privately.

@bhearsum
Copy link
Collaborator

@matt-boris and I were discussing this a bit. We don't yet have a handle on the root cause, but one thing we did notice is that the alignments files are totally uncompressed. Compressing them (and then decompressing in downstream tasks, of course) might be a way to sidestep this. @eu9ene - are there reasons we avoided compressing these files? Obviously we'd be starting off some time + cpu cycles for artifact size. Based on what they look like though, I suspect that even the worst zstd compression would make a significant difference here.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 11, 2024

I removed temporary files in #841. There's only one output file now. Probably one of the reasons why it's not compressed is because the align.py script is being used as an intermediate step in generate-shortlist.sh. I restarted 3 failed tasks, let's see if it works now.

@matt-boris
Copy link

In the worker logs that @bhearsum provided me, I see an artifact upload attempt for the aln.fwd file, for example, that has content length 8663973832, which is a little over 8.5GB. This is even after generic worker gzips the artifact for upload. Is the uncompressed artifact expected to be even bigger than that?

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 11, 2024

Actually I see now that the main artifact is compressed: public/build/corpus.aln.zst. Everything else is in tmp directory which will be removed in future runs.

is a little over 8.5GB

It's hard to say how big the final artifact will be as it depend on the trained language and amount of used data. I think compressed it will be in line with our regular sizes for dataset related artifacts.

@petemoore
Copy link
Member

Looking at worker logs I see:

start-worker[920]: 2024/09/06 18:01:12 PUT /firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=***********************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=294a8d92cf2be3a03ef7eb822c20a725ebad7f362768c8417874cb6f22810635&X-Amz-SignedHeaders=host&x-id=PutObject HTTP/1.1"
start-worker[920]: Host: storage.googleapis.com"
start-worker[920]: User-Agent: Go-http-client/1.1"
start-worker[920]: Content-Length: 8876827031"
start-worker[920]: Content-Encoding: gzip"
start-worker[920]: Content-Type: application/octet-stream"
start-worker[920]: Accept-Encoding: gzip"
start-worker[920]: #015"
start-worker[920]: 2024/09/06 18:01:35 14 put requests issued to https://storage.googleapis.com/firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln.rev?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=***************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=99d1a366d581000979d296a86aa9c91e92d108462eedbf196efac9542ab46a19&X-Amz-SignedHeaders=host&x-id=PutObject"
start-worker[920]: 2024/09/06 18:01:35 Response"
start-worker[920]: 2024/09/06 18:01:35 HTTP/2.0 400 Bad Request"
start-worker[920]: Content-Length: 220"
start-worker[920]: Content-Type: application/xml; charset=UTF-8"
start-worker[920]: Date: Fri, 06 Sep 2024 18:01:35 GMT"
start-worker[920]: Server: UploadServer"
start-worker[920]: Vary: Origin"
start-worker[920]: X-Guploader-Uploadid: AD-8ljvfChDPR86xb9C4DwXlYd7IfTPEq6wmBreOXpfo_pcztArNgLkdjcH9cVf5VAnBv57SrGI"
start-worker[920]: #015"
start-worker[920]: <?xml version='1.0' encoding='UTF-8'?><Error><Code>ExpiredToken</Code><Message>Invalid argument.</Message><Details>The provided token has expired. Request signature expired at: 2024-09-06T17:38:45+00:00</Details></Error>"
start-worker[920]: 2024/09/06 18:01:49 14 put requests issued to https://storage.googleapis.com/firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=********************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=294a8d92cf2be3a03ef7eb822c20a725ebad7f362768c8417874cb6f22810635&X-Amz-SignedHeaders=host&x-id=PutObject"
start-worker[920]: 2024/09/06 18:01:49 Response"
start-worker[920]: 2024/09/06 18:01:49 HTTP/2.0 400 Bad Request"
start-worker[920]: Content-Length: 220"
start-worker[920]: Content-Type: application/xml; charset=UTF-8"
start-worker[920]: Date: Fri, 06 Sep 2024 18:01:49 GMT"
start-worker[920]: Server: UploadServer"
start-worker[920]: Vary: Origin"
start-worker[920]: X-Guploader-Uploadid: AD-8ljsNrOQwDdv1ZZB6czGW5xU2op4C_rm5573nNuCTttJc_RmDbFiyyD6DGeU0wwNqflKQOw4xiYR77g"
start-worker[920]: #015"
start-worker[920]: <?xml version='1.0' encoding='UTF-8'?><Error><Code>ExpiredToken</Code><Message>Invalid argument.</Message><Details>The provided token has expired. Request signature expired at: 2024-09-06T17:38:45+00:00</Details></Error>"
systemd[1]: Starting GCE Workload Certificate refresh..."
gce_workload_cert_refresh[63126]: 2024/09/06 18:01:50: Error getting config status, workload certificates may not be configured: failed to GET \"instance/gce-workload-certificates/config-status\" from MDS with error: error connecting to metadata server, status code: 404"
gce_workload_cert_refresh[63126]: 2024/09/06 18:01:50: Done"
systemd[1]: gce-workload-cert-refresh.service: Deactivated successfully."
systemd[1]: Finished GCE Workload Certificate refresh."

The relevant part seems to be:

<?xml version='1.0' encoding='UTF-8'?><Error><Code>ExpiredToken</Code><Message>Invalid argument.</Message><Details>The provided token has expired. Request signature expired at: 2024-09-06T17:38:45+00:00</Details></Error>"
start-worker[920]: 2024/09/06 18:01:49 14 put requests issued to https://storage.googleapis.com/firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=*************************************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=294a8d92cf2be3a03ef7eb822c20a725ebad7f362768c8417874cb6f22810635&X-Amz-SignedHeaders=host&x-id=PutObject"
start-worker[920]: 2024/09/06 18:01:49 Response"
start-worker[920]: 2024/09/06 18:01:49 HTTP/2.0 400 Bad Request"

It looks to me like the signed url we get is only valid for a certain amount of time, and we are running out of time when trying to upload the file.

However, I do also see systemd[1]: Starting GCE Workload Certificate refresh in the middle of it all, so I wonder if some background cert refresh process is interfering with the upload.

@petemoore
Copy link
Member

petemoore commented Sep 12, 2024

It looks to me like the credentials should be valid for 45 minutes. That seems like it should be plenty, so I'm curious why the token should be expired.

@bhearsum
Copy link
Collaborator

Actually I see now that the main artifact is compressed: public/build/corpus.aln.zst. Everything else is in tmp directory which will be removed in future runs.

is a little over 8.5GB

It's hard to say how big the final artifact will be as it depend on the trained language and amount of used data. I think compressed it will be in line with our regular sizes for dataset related artifacts.

To be clear: we're talking about the artifacts coming out of the alignments tasks, which AFAICT are uncompressed. (I don't see any indications that the inputs to it are uncompressed.)

@bhearsum
Copy link
Collaborator

It looks to me like the credentials should be valid for 45 minutes. That seems like it should be plenty, so I'm curious why the token should be expired.

Possibly of note: the first upload I see happens at Sep 6 16:53:49. The first ExpiredToken failure I see is at Sep 6 17:55:21. Given the size of the data, and the fact that generic-worker is (presumably) gzipping it first, maybe it's not shocking if an individual upload exceeds 45min? If the token is generated before the compression happens it seems like a plausible cause of this?

@bhearsum
Copy link
Collaborator

In a meeting today @eu9ene said that he's made some changes to remove the problematic files from the uploaded artifacts, which may fix this specific issue. We should have an answer on that in the next day or two.

@petemoore or @matt-boris - regardless of that, is it worthwhile tracking this general issue and/or preemptively adding some more logging?

@matt-boris
Copy link

Submitted taskcluster/taskcluster#7260 for tracking. But please keep us posted on whether the changes you guys have made help circumvent this issue.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 13, 2024

My fix helped. All 3 tasks have completed successfully. Here's how the artifacts look now https://firefox-ci-tc.services.mozilla.com/tasks/e4CbMFI-SWGQ29z7zyFLiQ. @bhearsum We can close this issue if we're tracking the underlying Tasckcluster issue separately.

@bhearsum
Copy link
Collaborator

I'm glad things are unblocked now! taskcluster/taskcluster#7260 should be enough for tracking the underlying issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

4 participants