-
Notifications
You must be signed in to change notification settings - Fork 12
Tasks
The task property is listed below:
Task name | Observation space shape | Action space shape | Have done | Max Timesteps | Create Task |
---|---|---|---|---|---|
HalfCheetah-v3 | 18 | 6 | False | 1000 | env = neorl.make("HalfCheetah-v3") |
Walker2d-v3 | 18 | 6 | True | 1000 | env = neorl.make("Walker2d-v3") |
Hopper-v3 | 12 | 3 | True | 1000 | env = neorl.make("Hopper-v3") |
IB | 180 | 3 | False | 1000 | env = neorl.make("Ib") |
FinRL | 181 | 30 | False | 2516 | env = neorl.make("Finance") |
CL | 74 | 14 | False | 1000 | env = neorl.make("Citylearn") |
SP | 4 | 2 | False | 50 / 30 | env = neorl.make("sp") |
WW | 14 | 4 | False | 287 | env = neorl.make("ww") |
The Gym-MuJoCo continuous control tasks are the standard testbeds for online reinforcement learning algorithms. They are challenging because of their high-dimensional states and action spaces, especially for offline reinforcement learning tasks, even if the physical engines are deterministic. We select three environments and construct the offline RL tasks, i.e., Halfcheetah-v3, Walker2d-v3, and Hopper-v3. The subtle difference is that we augment the observation space with one dimension to record the position. Because part of these three environments' reward function is the distance moved forward, adding the location information simplifies the reward calculation for the current step. We have also found the impact of this position information on training policy can be ignored.
The industrial benchmark (IB) is a reinforcement learning benchmark environment, which aims to simulate the characteristics presented in various industrial control tasks, such as wind or gas turbines, chemical reactors. It includes issues commonly encountered in real-world industrial environments, such as high-dimensional continuous states and action spaces, delayed rewards, complex noise patterns, and high stochasticity of multiple reactive targets. We have also augmented the original industrial benchmark environment by adding two dimensions of the system states to the observation space to calculate instant reward for each step. Since the industrial benchmark environment itself is a high-dimensional and highly stochastic environment, no explicit noise is added to the actions when sampling data on this environment.
The FinRL environment provides a way to build a trading simulator that replicates the real stock market and provides back testing support with important market frictions such as transaction costs, market liquidity, and investor risk aversion. In the Finance environment, one trade can be made per trading day for the stocks in the pool. The reward function is the difference of the total asset value between the end of the day and the previous day. The environment evolves itself as time elapses.
The CityLearn environment is an OpenAI Gym-like environment that reshapes the aggregation curve of electricity demand by controlling energy storage in different types of buildings. High electricity demand raises the price of electricity and the overall cost of the distribution network. Flattening, smoothing, and narrowing the electricity demand curve help to reduce the operating and capital costs of generation, transmission, and distribution. The goal of optimization is to coordinate the control of domestic hot water and chilled water storage by the electricity consumers (i.e., buildings) to reshape the overall curve of electricity demand.
UPDATE (2022/12/29): We add a CityLearn-v1 environment where we clip the action to the range [-2/3,2/3] to alleviate the impact of out-of-range actions. However, we note that the action spaces in different cities vary, so we just use a looser upper bound [-2/3,2/3] to clip each dimension of the action. The policy used to collect CityLearn-v1 datasets is the deterministic version, and thus the policy performance will be slightly lower than CityLearn-v0 (we notice that when the action is beyond the given range, the environment can respond to these actions but will give higher rewards).
Disclaimer: This environment is partly built on our real-world sales promotion projects. All the offline datasets have gone through the data masking process.
The SalesPromotion environment simulates a real-world sales promotion platform, where the platform operator (a human with some data analysis tools) delivers different discount coupons to each user to promote the sales. The number of discount coupons delivered to the user is from 0 to 5 each day, and the discount will be in the range
To build this environment, the user models in the environment are trained from the real-world platform interactive data, which are collected with over 10,000 users from 19/03/2021 to 17/05/2021 (60 days). Each state (the user state) contains the total orders, the average order from the first day, the average fees from the first day, and the day of the week. The user model takes the first three dims of the user state as the input and outputs the user action, which consists of the number of orders and the average fees of a single day.
We sample 10,000 users to make the offline training dataset, and another 1,000 users to make the offline test dataset. The delivered discount coupons and the user actions are made by the real human operator and real users on the platform. We merge the first 10 days, thus the first day in the offline datasets is 29/03/2021 and the state contains the statistics of the first 10 days. After training the operator's policy, it will be tested in the next 30 days starting from 18/05/2021 with the same users. That is, the horizon of the trajectory is 50 for the training and 30 for the test. This setting follows the real-world scenario (also akin to the backtesting in FinRL). However, the environment can simulate different sales promotion periods by resetting the initial states to any day between 19/03/2021 and 17/05/2021.
This model simulates a waterworks' pump control problem in a small city. The whole city contains 5 stations, 4 of which can be controlled by the policy. The action is to control the pressure of the stations, so that the pressure of the critical station (the 5-th dimension of the obs) is under control. The control policy should respond every 5 minutes and a trajectory lasts for 1 day (1440/5 - 1 = 287 time steps). This is a classical industrial control task.
The designed state contains the water flows and pressure of multiple staions, and the external variables include temperature, day of the week, holiday or not, and the time embedding. The transition dynamics maps (obs, ex_var, action) to the next_obs, where the ex_var is from the static data. This transition model is trained from another batch of real-world data.
Note that 1 trajectory corresponds to one day in the real world, so we only collect 1000 trajectories at most, which are about 2.7 years. More trajectories may have less practical value as the vicinity will change a lot over a long time.