Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bicleaner-ai-classify intermittently fails to download fasttext model #631

Open
Tracked by #311
eu9ene opened this issue May 24, 2024 · 3 comments
Open
Tracked by #311
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented May 24, 2024

https://firefox-ci-tc.services.mozilla.com/tasks/IVGFh-gRSaOmOGjkaJGqsw/runs/0/logs/public/logs/live.log

[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:38.234457: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:38.234747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14784 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:04.0, compute capability: 7.0
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:38.858408: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:42,211 - INFO - Arguments processed
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:42,211 - INFO - Starting process
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,024 - INFO - Finished
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Total: 66366 rows
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Elapsed time 73.81 s
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Troughput: 899 rows/s
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Program finished
[fetches 2024-05-24T17:10:46.111Z] removing /home/ubuntu/tasks/task_171656999726273/fetches
[fetches 2024-05-24T17:10:47.257Z] finished
[taskcluster 2024-05-24T17:10:47.268Z]    Exit Code: 3
[taskcluster 2024-05-24T17:10:47.268Z]    User Time: 24m49.351441s
[taskcluster 2024-05-24T17:10:47.268Z]  Kernel Time: 2m25.053608s
[taskcluster 2024-05-24T17:10:47.268Z]    Wall Time: 10m49.577491702s
[taskcluster 2024-05-24T17:10:47.268Z]       Result: FAILED
[taskcluster 2024-05-24T17:10:47.268Z] === Task Finished ===
[taskcluster 2024-05-24T17:10:47.268Z] Task Duration: 10m49.579589245s
[taskcluster 2024-05-24T17:10:47.311Z] Uploading artifact public/build/XLEnt_v1_2.scored.zst from file /home/ubuntu/tasks/task_171656999726273/artifacts/XLEnt_v1_2.scored.zst with content encoding "identity", mime type "application/zstd" and expiry 2025-05-24T16:55:28.023Z
[taskcluster 2024-05-24T17:10:47.872Z] [mounts] Preserving cache: Moving "/home/ubuntu/tasks/task_171656999726273/checkouts" to "/home/ubuntu/caches/Ar7S0LJFR6yd4YAYtFVwCQ"
[taskcluster 2024-05-24T17:10:47.901Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2025-05-24T16:55:28.023Z
[taskcluster:error] exit status 3
@eu9ene eu9ene added bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline labels May 24, 2024
@bhearsum
Copy link
Collaborator

bhearsum commented Jul 9, 2024

This task exited with 3, which is simply being rereported at the end of the log.

I'm pretty sure the problem is this error which happens 3 times:

[task 2024-05-24T17:04:06.806Z] 2024-05-24 17:04:01,834 - WARNING - Downloading FastText model...
[task 2024-05-24T17:04:06.806Z] Traceback (most recent call last):
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/bin/bicleaner-ai-classify", line 8, in <module>
[task 2024-05-24T17:04:06.806Z]     sys.exit(main())
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/bicleaner_ai/bicleaner_ai_classifier.py", line 119, in main
[task 2024-05-24T17:04:06.806Z]     perform_classification(args) # Main loop
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/bicleaner_ai/bicleaner_ai_classifier.py", line 108, in perform_classification
[task 2024-05-24T17:04:06.806Z]     nline = classify(args, args.input, args.output)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/bicleaner_ai/classify.py", line 220, in classify
[task 2024-05-24T17:04:06.806Z]     hardrules = Hardrules(args)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/hardrules/hardrules.py", line 115, in __init__
[task 2024-05-24T17:04:06.806Z]     self.fastspell_src = FastSpell(args.source_lang, mode="aggr")
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fastspell/fastspell.py", line 76, in __init__
[task 2024-05-24T17:04:06.806Z]     self.download_fasttext()
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fastspell/fastspell.py", line 93, in download_fasttext
[task 2024-05-24T17:04:06.806Z]     self.model = fasttext.load_model(ft_model_path)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fasttext/FastText.py", line 441, in load_model
[task 2024-05-24T17:04:06.806Z]     return _FastText(model_path=path)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fasttext/FastText.py", line 98, in __init__
[task 2024-05-24T17:04:06.806Z]     self.f.loadModel(model_path)
[task 2024-05-24T17:04:06.806Z] ValueError: /home/ubuntu/.local/lib/python3.10/site-packages/fastspell/lid.176.bin has wrong file format!

That script is run through parallel, which exits as follows:

       1-100 Some  of  the jobs failed. The exit status gives the number of failed jobs. If Y% is used the
             exit status is the percentage of jobs that failed.

@bhearsum bhearsum changed the title [taskcluster:error] exit status 3 on Uploading link artifact bicleaner-ai-classify intermittently fails to download fasttext model Jul 9, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jul 9, 2024

Sometimes fast text fails to download the model. I ran into this issue in OpusCleaner and fixed it with pre-downloading https://github.com/mozilla/firefox-translations-training/blob/6b6b64999edee0e5bb5822bfb6d6e9d6a4e6c94f/pipeline/clean/opuscleaner/clean-corpus.sh#L34

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 21, 2024

If this helps, you can run fastspell-download command during installation and that will download the model to the pythonpath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

3 participants