Skip to content

GRPO training error related to sympy/math expressions #168

@guizhenn

Description

@guizhenn

The error log shows a traceback that ends with a problem in SymPy's relational.py, specifically when evaluating a fuzzy relation. Do you have any idea how to fix this?

[rank4]: File "dummy/open-r1/src/open_r1/grpo.py", line 240, in <module>
[rank4]: main(script_args, training_args, model_args)
[rank4]: File "dummy/open-r1/src/open_r1/grpo.py", line 192, in main
[rank4]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2184, in train
[rank4]: return inner_training_loop(
[rank4]: ^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank
[rank4]: ^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 335, in canonical
[rank4]: r = self.func(*args)
[rank4]: ^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 852, in __new__
[rank4]: return cls._eval_relation(lhs, rhs, **options)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 859, in _eval_relation
[rank4]: val = cls._eval_fuzzy_relation(lhs, rhs)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "dummy/envs/openr1/lib/python3.11/site-packages/sympy/core/relational.py", line 1186, in _eval_fuzzy_relation
[rank4]: return is_lt(lhs, rhs)
[rank4]: ^^^^^^^^^^^^^^^
[rank4]W0203 05:44:57.243000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153217 closing signal SIGTERM
W0203 05:44:57.306000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153218 closing signal SIGTERM
W0203 05:44:57.307000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153219 closing signal SIGTERM
W0203 05:44:57.308000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153220 closing signal SIGTERM
W0203 05:44:57.310000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153222 closing signal SIGTERM
W0203 05:44:57.312000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 153223 closing signal SIGTERM
E0203 05:45:01.285000 151407 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 4 (pid: 153221) of binary: dummy/envs/openr1/bin/python3.11
Traceback (most recent call last):
File "dummy/envs/openr1/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "dummy/envs/openr1/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "dummy/envs/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1157, in launch_command
deepspeed_launcher(args)
File "dummy/envs/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 845, in deepspeed_launcher
distrib_run.run(args)
File "dummy/envs/openr1/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "dummy/envs/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "dummy/envs/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/open_r1/grpo.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-03_05:44:57
host : dummy
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 153221)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions