-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Description
The error log shows a traceback that ends with a problem in SymPy's relational.py, specifically when evaluating a fuzzy relation. Do you have any idea how to fix this?
[rank4]: File "dummy/open-r1/src/open_r1/grpo.py", line 240, in <module>
[rank4]: main(script_args, training_args, model_args)
[rank4]: File "dummy/open-r1/src/open_r1/grpo.py", line 192, in main
[rank4]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2184, in train
[rank4]: return inner_training_loop(
[rank4]: ^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank
[rank4]: ^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 335, in canonical
[rank4]: r = self.func(*args)
[rank4]: ^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 852, in __new__
[rank4]: return cls._eval_relation(lhs, rhs, **options)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 859, in _eval_relation
[rank4]: val = cls._eval_fuzzy_relation(lhs, rhs)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 1186, in _eval_fuzzy_relation
[rank4]: return is_lt(lhs, rhs)
[rank4]: ^^^^^^^^^^^^^^^
[rank4]W0203 05:44:57.243000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153217 closing signal SIGTERM
W0203 05:44:57.306000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153218 closing signal SIGTERM
W0203 05:44:57.307000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153219 closing signal SIGTERM
W0203 05:44:57.308000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153220 closing signal SIGTERM
W0203 05:44:57.310000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153222 closing signal SIGTERM
W0203 05:44:57.312000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153223 closing signal SIGTERM
E0203 05:45:01.285000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 4 (pid: 153221) of binary: dummy/envs/openr1/bin/python3.11
Traceback (most recent call last):
File "dummy/envs/openr1/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "dummy/envs/openr1/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "dummy/envs/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1157, in launch_command
deepspeed_launcher(args)
File "dummy/envs/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 845, in deepspeed_launcher
distrib_run.run(args)
File "dummy/envs/openr1/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "dummy/envs/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "dummy/envs/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/open_r1/grpo.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-03_05:44:57
host : dummy
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 153221)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
jamesoneill12 and zsgvivo
Metadata
Metadata
Assignees
Labels
No labels