Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marian error in translate-corpus #679

Open
Tracked by #311
eu9ene opened this issue Jun 17, 2024 · 2 comments
Open
Tracked by #311

Marian error in translate-corpus #679

eu9ene opened this issue Jun 17, 2024 · 2 comments
Labels
bug Something is broken or not correct

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jun 17, 2024

https://firefox-ci-tc.services.mozilla.com/tasks/AsVG4ziaTMKjYq6Z9fhgwg/runs/0/logs/public/logs/live.log
https://firefox-ci-tc.services.mozilla.com/tasks/eScwZPfjS_yHCm6Gf4ufng/runs/0/logs/public/logs/live.log

[task 2024-06-17T06:16:05.464Z] [2024-06-17 06:16:05] [config] workspace: 12000
[task 2024-06-17T06:16:05.464Z] [2024-06-17 06:16:05] [config] Loaded model has been created with Marian v1.12.14 2d067af 2024-02-16 11:44:13 -0500
[task 2024-06-17T06:16:05.466Z] [2024-06-17 06:16:05] [data] Loading SentencePiece vocabulary from file /home/ubuntu/tasks/task_171860488402365/fetches/vocab.spm
[task 2024-06-17T06:16:05.516Z] [2024-06-17 06:16:05] [data] Loading SentencePiece vocabulary from file /home/ubuntu/tasks/task_171860488402365/fetches/vocab.spm
[task 2024-06-17T06:16:05.563Z] [2024-06-17 06:16:05] Loading model from /home/ubuntu/tasks/task_171860488402365/fetches/model1/final.model.npz.best-chrf.npz
[task 2024-06-17T06:16:07.806Z] [2024-06-17 06:16:07] Loading model from /home/ubuntu/tasks/task_171860488402365/fetches/model2/final.model.npz.best-chrf.npz
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.737Z] Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.737Z] [2024-06-17 06:16:09] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-06-17T06:16:09.794Z] 
[task 2024-06-17T06:16:09.794Z] [CALL STACK]
[task 2024-06-17T06:16:09.794Z] [0x64599fabf1af]    marian::CurandRandomGenerator::  CurandRandomGenerator  (unsigned long,  marian::DeviceId) + 0x83f
[task 2024-06-17T06:16:09.794Z] [0x64599fabf849]    marian::  createRandomGenerator  (unsigned long,  marian::DeviceId) + 0x69
[task 2024-06-17T06:16:09.794Z] [0x64599fab8f40]    marian::  BackendByDeviceId  (marian::DeviceId,  unsigned long) + 0xa0
[task 2024-06-17T06:16:09.794Z] [0x64599f7b44f0]    marian::ExpressionGraph::  setDevice  (marian::DeviceId,  std::shared_ptr<marian::Device>) + 0x80
[task 2024-06-17T06:16:09.794Z] [0x64599f639f98]    marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}::  operator()  (marian::DeviceId,  unsigned long) const + 0x1d8
[task 2024-06-17T06:16:09.794Z] [0x64599f63b089]    marian::ThreadPool::enqueue<marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long>(marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long&&)::{lambda()#1}::  operator()  () const + 0x39
[task 2024-06-17T06:16:09.794Z] [0x64599f63bea0]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long>(marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long&&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x30
[task 2024-06-17T06:16:09.794Z] [0x64599f5ea48d]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[task 2024-06-17T06:16:09.794Z] [0x7383d3099ee8]                                                       + 0x99ee8
[task 2024-06-17T06:16:09.794Z] [0x64599f5eb720]    std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long>(marian::Translate<marian::BeamSearch>::Translate(std::shared_ptr<marian::Options>)::{lambda(marian::DeviceId,unsigned long)#1}&,marian::DeviceId&,unsigned long&&)::{lambda()#1},std::allocator<int>,void ()>::  _M_run  () + 0xf0
[task 2024-06-17T06:16:09.794Z] [0x64599f5ecd65]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1a5
[task 2024-06-17T06:16:09.794Z] [0x7383d34dc253]                                                       + 0xdc253
[task 2024-06-17T06:16:09.794Z] [0x7383d3094ac3]                                                       + 0x94ac3
[task 2024-06-17T06:16:09.794Z] [0x7383d3126850]                                                       + 0x126850
[task 2024-06-17T06:16:09.794Z] 
[task 2024-06-17T06:16:10.261Z] /home/ubuntu/tasks/task_171860488402365/checkouts/vcs/pipeline/translate/translate-nbest.sh: line 28: 37694 Aborted                 (core dumped) "${MARIAN}/marian-decoder" -c decoder.yml -m "${models[@]}" -v "${vocab}" "${vocab}" -i "${input}" -o "${input}.nbest" --log "${input}.log" --n-best -d ${GPUS} -w "${WORKSPACE}"
[fetches 2024-06-17T06:16:10.262Z] removing /home/ubuntu/tasks/task_171860488402365/fetches
[fetches 2024-06-17T06:16:12.583Z] finished
[taskcluster 2024-06-17T06:16:12.594Z]    Exit Code: 134
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jun 18, 2024

The tasks pass on rerun. It's probably something with randomization or infrastructure. We should add automatic restarts for these tasks

@eu9ene eu9ene added the bug Something is broken or not correct label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

No branches or pull requests

1 participant