Unable to export checkpoint #340

JianmingTONG · 2020-12-19T09:07:33Z

Hi,

I run the following code to train two_ways/bid/ & intersections/4lane & double_merge/cross & double_merge/nocross scenarios.

# terminal 1 (path under smarts directory)
scl envision start -s ./scenarios -p 8081

# terminal 2 (path under smarts/benchmark)
 python run.py scenarios/intersections/4lane -f agents/ppo/baseline-continuous-control.yaml

# terminal 3 (path under smarts/benchmark)
 python run.py scenarios/two_ways/bid -f agents/ppo/baseline-continuous-control.yaml

# terminal 4 (path under smarts/benchmark)
 python run.py scenarios/double_merge/cross -f agents/ppo/baseline-continuous-control.yaml

# terminal 5 (path under smarts/benchmark)
 python run.py scenarios/double_merge/nocross -f agents/ppo/baseline-continuous-control.yaml

All scenarios crash after running 40+ iterations and complain the same error. It seems that there is something wrong with checkpoint export. Might I request anyone to help?

== Status ==
Memory usage on this node: 21.7/78.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 2/24 CPUs, 1/1 GPUs, 0.0/35.16 GiB heap, 0.0/12.11 GiB objects (0/1.0 GPUType:GTX)
Result logdir: /home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4
Number of trials: 1 (1 RUNNING)
+------------------------+----------+-------------------+--------+------------------+--------+----------+
| Trial name             | status   | loc               |   iter |   total time (s) |     ts |   reward |
|------------------------+----------+-------------------+--------+------------------+--------+----------|
| PPO_Simple_25705_00000 | RUNNING  | 172.16.0.199:4223 |     52 |            14728 | 208000 |  269.281 |
+------------------------+----------+-------------------+--------+------------------+--------+----------+


Traceback (most recent call last):
  File "run.py", line 186, in <module>
    cluster=args.cluster,
  File "run.py", line 139, in main
    analysis = tune.run(**config["run"])
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/tune.py", line 334, in run
    runner.step()
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 339, in step
    self._process_events()  # blocking
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 444, in _process_events
    self._process_trial_save(trial)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 555, in _process_trial_save
    self._execute_action(trial, decision)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 605, in _execute_action
    self.trial_executor.export_trial_if_needed(trial)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 699, in export_trial_if_needed
    DEFAULT_GET_TIMEOUT)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::PPO.export_model() (pid=4223, ip=172.16.0.199)
  File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trainable.py", line 526, in export_model
    return self._export_model(export_formats, export_dir)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1141, in _export_model
    self.export_policy_checkpoint(path)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1001, in export_policy_checkpoint
    export_dir, filename_prefix, policy_id)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 925, in export_policy_checkpoint
    self.policy_map[policy_id].export_checkpoint(export_dir,
KeyError: 'default_policy'
(pid=4223) /home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/agents/trainer.py:678: ResourceWarning: unclosed file <_io.BufferedWriter name='/home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4/PPO_Simple_0_2020-12-19_15-45-21kzsmnot1/checkpoint_52/checkpoint-52'>
(pid=4223)   pickle.dump(self.__getstate__(), open(checkpoint_path, "wb"))
(pid=4223) ResourceWarning: Enable tracemalloc to get the object allocation traceback
/usr/lib/python3.7/subprocess.py:883: ResourceWarning: subprocess 4141 is still running
  ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4/PPO_Simple_0_2020-12-19_15-45-21kzsmnot1/progress.csv' mode='a' encoding='UTF-8'>
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4/PPO_Simple_0_2020-12-19_15-45-21kzsmnot1/result.json' mode='a' encoding='UTF-8'>

Thanks!

The text was updated successfully, but these errors were encountered:

Gamenot · 2020-12-20T00:23:23Z

@JianmingTONG, thanks again for the report, I seem to remember that this came up before as a troubling issue from ray. I took a check around and it appears as if we have looked at it before: ray-project/ray#5339.

I am going to test out to see if I can find a way around the issue.

Gamenot · 2020-12-23T15:04:33Z

Potentially solving #167 with upgrading ray to 1.X.X will also solve this issue.

Gamenot · 2021-01-27T00:32:58Z

Needs to be tested again before closing.

Gamenot · 2021-02-01T17:58:47Z

This is fixed in v0.4.13.

Meta-YZ · 2021-06-04T06:25:30Z

Potentially solving #167 with upgrading ray to 1.X.X will also solve this issue.

I have updated ray to 1.0.1post, but still happend that error(KeyError: 'default_policy')

Gamenot self-assigned this Dec 23, 2020

Gamenot added this to the 0.5 milestone Dec 23, 2020

Gamenot linked a pull request Jan 4, 2021 that will close this issue

Upgrade ray version #361

Merged

Gamenot closed this as completed in #361 Jan 8, 2021

Gamenot reopened this Jan 13, 2021

Gamenot linked a pull request Jan 13, 2021 that will close this issue

Upgrade to ray 1 0 1 post1 With Fix #420

Merged

Gamenot modified the milestones: 0.5, Backlog Jan 27, 2021

Gamenot added the bug Something isn't working label Feb 1, 2021

Gamenot closed this as completed Feb 1, 2021

zyk516 mentioned this issue Nov 24, 2021

how to run benchmark in the CoRL20 paper #1131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to export checkpoint #340

Unable to export checkpoint #340

JianmingTONG commented Dec 19, 2020

Gamenot commented Dec 20, 2020

Gamenot commented Dec 23, 2020

Gamenot commented Jan 27, 2021

Gamenot commented Feb 1, 2021

Meta-YZ commented Jun 4, 2021

Unable to export checkpoint #340

Unable to export checkpoint #340

Comments

JianmingTONG commented Dec 19, 2020

Gamenot commented Dec 20, 2020

Gamenot commented Dec 23, 2020

Gamenot commented Jan 27, 2021

Gamenot commented Feb 1, 2021

Meta-YZ commented Jun 4, 2021