Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to export checkpoint #340

Closed
JianmingTONG opened this issue Dec 19, 2020 · 5 comments · Fixed by #361 or #420
Closed

Unable to export checkpoint #340

JianmingTONG opened this issue Dec 19, 2020 · 5 comments · Fixed by #361 or #420
Assignees
Labels
bug Something isn't working

Comments

@JianmingTONG
Copy link

Hi,

I run the following code to train two_ways/bid/ & intersections/4lane & double_merge/cross & double_merge/nocross scenarios.

# terminal 1 (path under smarts directory)
scl envision start -s ./scenarios -p 8081

# terminal 2 (path under smarts/benchmark)
 python run.py scenarios/intersections/4lane -f agents/ppo/baseline-continuous-control.yaml

# terminal 3 (path under smarts/benchmark)
 python run.py scenarios/two_ways/bid -f agents/ppo/baseline-continuous-control.yaml

# terminal 4 (path under smarts/benchmark)
 python run.py scenarios/double_merge/cross -f agents/ppo/baseline-continuous-control.yaml

# terminal 5 (path under smarts/benchmark)
 python run.py scenarios/double_merge/nocross -f agents/ppo/baseline-continuous-control.yaml

All scenarios crash after running 40+ iterations and complain the same error. It seems that there is something wrong with checkpoint export. Might I request anyone to help?

== Status ==
Memory usage on this node: 21.7/78.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 2/24 CPUs, 1/1 GPUs, 0.0/35.16 GiB heap, 0.0/12.11 GiB objects (0/1.0 GPUType:GTX)
Result logdir: /home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4
Number of trials: 1 (1 RUNNING)
+------------------------+----------+-------------------+--------+------------------+--------+----------+
| Trial name             | status   | loc               |   iter |   total time (s) |     ts |   reward |
|------------------------+----------+-------------------+--------+------------------+--------+----------|
| PPO_Simple_25705_00000 | RUNNING  | 172.16.0.199:4223 |     52 |            14728 | 208000 |  269.281 |
+------------------------+----------+-------------------+--------+------------------+--------+----------+


Traceback (most recent call last):
  File "run.py", line 186, in <module>
    cluster=args.cluster,
  File "run.py", line 139, in main
    analysis = tune.run(**config["run"])
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/tune.py", line 334, in run
    runner.step()
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 339, in step
    self._process_events()  # blocking
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 444, in _process_events
    self._process_trial_save(trial)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 555, in _process_trial_save
    self._execute_action(trial, decision)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 605, in _execute_action
    self.trial_executor.export_trial_if_needed(trial)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 699, in export_trial_if_needed
    DEFAULT_GET_TIMEOUT)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::PPO.export_model() (pid=4223, ip=172.16.0.199)
  File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/tune/trainable.py", line 526, in export_model
    return self._export_model(export_formats, export_dir)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1141, in _export_model
    self.export_policy_checkpoint(path)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1001, in export_policy_checkpoint
    export_dir, filename_prefix, policy_id)
  File "/home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 925, in export_policy_checkpoint
    self.policy_map[policy_id].export_checkpoint(export_dir,
KeyError: 'default_policy'
(pid=4223) /home/nics/venv/python37_smarts_rllib_gpu/lib/python3.7/site-packages/ray/rllib/agents/trainer.py:678: ResourceWarning: unclosed file <_io.BufferedWriter name='/home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4/PPO_Simple_0_2020-12-19_15-45-21kzsmnot1/checkpoint_52/checkpoint-52'>
(pid=4223)   pickle.dump(self.__getstate__(), open(checkpoint_path, "wb"))
(pid=4223) ResourceWarning: Enable tracemalloc to get the object allocation traceback
/usr/lib/python3.7/subprocess.py:883: ResourceWarning: subprocess 4141 is still running
  ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4/PPO_Simple_0_2020-12-19_15-45-21kzsmnot1/progress.csv' mode='a' encoding='UTF-8'>
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/nics/work/RL/SMARTS/benchmark/log/results/run/bid-4/PPO_Simple_0_2020-12-19_15-45-21kzsmnot1/result.json' mode='a' encoding='UTF-8'>

Thanks!

@Gamenot
Copy link
Collaborator

Gamenot commented Dec 20, 2020

@JianmingTONG, thanks again for the report, I seem to remember that this came up before as a troubling issue from ray. I took a check around and it appears as if we have looked at it before: ray-project/ray#5339.

I am going to test out to see if I can find a way around the issue.

@Gamenot Gamenot self-assigned this Dec 23, 2020
@Gamenot Gamenot added this to the 0.5 milestone Dec 23, 2020
@Gamenot
Copy link
Collaborator

Gamenot commented Dec 23, 2020

Potentially solving #167 with upgrading ray to 1.X.X will also solve this issue.

@Gamenot Gamenot linked a pull request Jan 4, 2021 that will close this issue
@Gamenot Gamenot reopened this Jan 13, 2021
@Gamenot Gamenot linked a pull request Jan 13, 2021 that will close this issue
@Gamenot
Copy link
Collaborator

Gamenot commented Jan 27, 2021

Needs to be tested again before closing.

@Gamenot Gamenot modified the milestones: 0.5, Backlog Jan 27, 2021
@Gamenot Gamenot added the bug Something isn't working label Feb 1, 2021
@Gamenot
Copy link
Collaborator

Gamenot commented Feb 1, 2021

This is fixed in v0.4.13.

@Gamenot Gamenot closed this as completed Feb 1, 2021
@Meta-YZ
Copy link

Meta-YZ commented Jun 4, 2021

Potentially solving #167 with upgrading ray to 1.X.X will also solve this issue.

I have updated ray to 1.0.1post, but still happend that error(KeyError: 'default_policy')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants