-
Notifications
You must be signed in to change notification settings - Fork 7k
[RLlib] Cleanup examples folder vol 32: Enable RLlib + Serve example in CI and translate to new API stack. #48687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
sven1977
merged 4 commits into
ray-project:master
from
sven1977:enable_rllib_plus_serve_example_in_ci
Nov 12, 2024
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,55 +1,105 @@ | ||
| """This example script shows how one can use Ray Serve in combination with RLlib. | ||
| """Example on how to run RLlib in combination with Ray Serve. | ||
|
|
||
| Here, we serve an already trained PyTorch RLModule to provide action computations | ||
| to a Ray Serve client. | ||
| """ | ||
| import argparse | ||
| import atexit | ||
| import os | ||
| This example trains an agent with PPO on the CartPole environment, then creates | ||
| an RLModule checkpoint and returns its location. After that, it sends the checkpoint | ||
| to the Serve deployment for serving the trained RLModule (policy). | ||
|
|
||
| import requests | ||
| import subprocess | ||
| import time | ||
| This example: | ||
| - shows how to set up a Ray Serve deployment for serving an already trained | ||
| RLModule (policy network). | ||
| - shows how to request new actions from the Ray Serve deployment while actually | ||
| running through episodes in an environment (on which the RLModule that's served | ||
| was trained). | ||
|
|
||
| import gymnasium as gym | ||
| from pathlib import Path | ||
|
|
||
| import ray | ||
| from ray.rllib.algorithms.algorithm import AlgorithmConfig | ||
| from ray.rllib.algorithms.ppo import PPOConfig | ||
| How to run this script | ||
| ---------------------- | ||
| `python [script file name].py --enable-new-api-stack --stop-reward=200.0` | ||
|
|
||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--train-iters", type=int, default=3) | ||
| parser.add_argument("--serve-episodes", type=int, default=2) | ||
| parser.add_argument("--no-render", action="store_true") | ||
| Use the `--stop-iters`, `--stop-reward`, and/or `--stop-timesteps` options to | ||
| determine how long to train the policy for. Use the `--serve-episodes` option to | ||
| set the number of episodes to serve (after training) and the `--no-render` option | ||
| to NOT render the environment during the serving phase. | ||
|
|
||
| For debugging, use the following additional command line options | ||
| `--no-tune --num-env-runners=0` | ||
| which should allow you to set breakpoints anywhere in the RLlib code and | ||
| have the execution stop there for inspection and debugging. | ||
|
|
||
| def train_rllib_rl_module(config: AlgorithmConfig, train_iters: int = 1): | ||
| """Trains a PPO (RLModule) on ALE/MsPacman-v5 for n iterations. | ||
| For logging to your WandB account, use: | ||
| `--wandb-key=[your WandB API key] --wandb-project=[some project name] | ||
| --wandb-run-name=[optional: WandB run name (within the defined project)]` | ||
|
|
||
| Saves the trained Algorithm to disk and returns the checkpoint path. | ||
| You can visualize experiment results in ~/ray_results using TensorBoard. | ||
|
|
||
| Args: | ||
| config: The algo config object for the Algorithm. | ||
| train_iters: For how many iterations to train the Algorithm. | ||
|
|
||
| Returns: | ||
| str: The saved checkpoint to restore the RLModule from. | ||
| """ | ||
| # Create algorithm from config. | ||
| algo = config.build() | ||
| Results to expect | ||
| ----------------- | ||
|
|
||
| # Train for n iterations, then save, stop, and return the checkpoint path. | ||
| for _ in range(train_iters): | ||
| print(algo.train()) | ||
| You should see something similar to the following on the command line when using the | ||
| options: `--stop-reward=250.0`, `--num-episodes-served=2`, and `--port=12345`: | ||
|
|
||
| # TODO (sven): Change this example to only storing the RLModule checkpoint, NOT | ||
| # the entire Algorithm. | ||
| checkpoint_result = algo.save() | ||
| [First, the RLModule is trained through PPO] | ||
|
|
||
| algo.stop() | ||
| +-----------------------------+------------+-----------------+--------+ | ||
| | Trial name | status | loc | iter | | ||
| | | | | | | ||
| |-----------------------------+------------+-----------------+--------+ | ||
| | PPO_CartPole-v1_84778_00000 | TERMINATED | 127.0.0.1:40411 | 1 | | ||
| +-----------------------------+------------+-----------------+--------+ | ||
| +------------------+---------------------+------------------------+ | ||
| | total time (s) | episode_return_mean | num_env_steps_sample | | ||
| | | | d_lifetime | | ||
| |------------------+---------------------|------------------------| | ||
| | 2.87052 | 253.2 | 12000 | | ||
| +------------------+---------------------+------------------------+ | ||
|
|
||
| return checkpoint_result.checkpoint | ||
| [The RLModule is deployed through Ray Serve on port 12345] | ||
|
|
||
| Started Ray Serve with PID: 40458 | ||
|
|
||
| [A few episodes are played through using the policy service (w/ greedy, non-exploratory | ||
| actions)] | ||
|
|
||
| Episode R=500.0 | ||
| Episode R=500.0 | ||
| """ | ||
|
|
||
| import atexit | ||
| import os | ||
|
|
||
| import requests | ||
| import subprocess | ||
| import time | ||
|
|
||
| import gymnasium as gym | ||
| from pathlib import Path | ||
|
|
||
| from ray.rllib.algorithms.ppo import PPOConfig | ||
| from ray.rllib.core import ( | ||
| COMPONENT_LEARNER_GROUP, | ||
| COMPONENT_LEARNER, | ||
| COMPONENT_RL_MODULE, | ||
| DEFAULT_MODULE_ID, | ||
| ) | ||
| from ray.rllib.utils.metrics import ( | ||
| ENV_RUNNER_RESULTS, | ||
| EPISODE_RETURN_MEAN, | ||
| ) | ||
| from ray.rllib.utils.test_utils import ( | ||
| add_rllib_example_script_args, | ||
| run_rllib_example_script_experiment, | ||
| ) | ||
|
|
||
| parser = add_rllib_example_script_args() | ||
| parser.set_defaults( | ||
| enable_new_api_stack=True, | ||
| checkpoint_freq=1, | ||
| checkpoint_at_and=True, | ||
| ) | ||
| parser.add_argument("--num-episodes-served", type=int, default=2) | ||
| parser.add_argument("--no-render", action="store_true") | ||
| parser.add_argument("--port", type=int, default=12345) | ||
|
|
||
|
|
||
| def kill_proc(proc): | ||
|
|
@@ -64,18 +114,23 @@ def kill_proc(proc): | |
| if __name__ == "__main__": | ||
| args = parser.parse_args() | ||
|
|
||
| ray.init(num_cpus=8) | ||
|
|
||
| # Config for the served RLlib RLModule/Algorithm. | ||
| config = ( | ||
| PPOConfig() | ||
| .api_stack(enable_rl_module_and_learner=True) | ||
| .environment("CartPole-v1") | ||
| base_config = PPOConfig().environment("CartPole-v1") | ||
|
|
||
| results = run_rllib_example_script_experiment(base_config, args) | ||
| algo_checkpoint = results.get_best_result( | ||
| f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}" | ||
| ).checkpoint.path | ||
| # We only need the RLModule component from the algorithm checkpoint. It's located | ||
| # under "[algo checkpoint dir]/learner_group/learner/rl_module/[default policy ID] | ||
| rl_module_checkpoint = ( | ||
| Path(algo_checkpoint) | ||
| / COMPONENT_LEARNER_GROUP | ||
| / COMPONENT_LEARNER | ||
| / COMPONENT_RL_MODULE | ||
| / DEFAULT_MODULE_ID | ||
| ) | ||
|
|
||
| # Train the Algorithm for some time, then save it and get the checkpoint path. | ||
| checkpoint = train_rllib_rl_module(config, train_iters=args.train_iters) | ||
|
|
||
| path_of_this_file = Path(__file__).parent | ||
| os.chdir(path_of_this_file) | ||
| # Start the serve app with the trained checkpoint. | ||
|
|
@@ -84,7 +139,9 @@ def kill_proc(proc): | |
| "serve", | ||
| "run", | ||
| "classes.cartpole_deployment:rl_module", | ||
| f"checkpoint={checkpoint.path}", | ||
| f"rl_module_checkpoint={rl_module_checkpoint}", | ||
| f"port={args.port}", | ||
| "route_prefix=/rllib-rlmodule", | ||
| ] | ||
| ) | ||
| # Register our `kill_proc` function to be called on exit to stop Ray Serve again. | ||
|
|
@@ -97,35 +154,34 @@ def kill_proc(proc): | |
| # Create the environment that we would like to receive | ||
| # served actions for. | ||
| env = gym.make("CartPole-v1", render_mode="human") | ||
| obs, info = env.reset() | ||
| obs, _ = env.reset() | ||
|
|
||
| num_episodes = 0 | ||
| episode_return = 0.0 | ||
|
|
||
| while num_episodes < args.serve_episodes: | ||
| while num_episodes < args.num_episodes_served: | ||
| # Render env if necessary. | ||
| if not args.no_render: | ||
| env.render() | ||
|
|
||
| # print("-> Requesting action for obs ...") | ||
| # print(f"-> Requesting action for obs={obs} ...", end="") | ||
| # Send a request to serve. | ||
| resp = requests.get( | ||
| "http://localhost:8000/rllib-rlmodule", | ||
| f"http://localhost:{args.port}/rllib-rlmodule", | ||
| json={"observation": obs.tolist()}, | ||
| # timeout=5.0, | ||
| ) | ||
| response = resp.json() | ||
| # print("<- Received response {}".format(response)) | ||
| # print(f" received: action={response['action']}") | ||
|
|
||
| # Apply the action in the env. | ||
| action = response["action"] | ||
| obs, reward, done, _, _ = env.step(action) | ||
| obs, reward, terminated, truncated, _ = env.step(action) | ||
| episode_return += reward | ||
|
|
||
| # If episode done -> reset to get initial observation of new episode. | ||
| if done: | ||
| if terminated or truncated: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome example!! |
||
| print(f"Episode R={episode_return}") | ||
| obs, info = env.reset() | ||
| obs, _ = env.reset() | ||
| num_episodes += 1 | ||
| episode_return = 0.0 | ||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a better practical usability we might need to enlarge this example by a connector run. This might be the default we will see in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense!
Maybe add another example where we restore the EnvToModule pipeline + the RLModule.
Like we do in the
ray.rllib.examples.inference.policy_inference_after_training_w_connector.pyscript.