CORL (Clean Offline Reinforcement Learning)

🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by cleanrl for online RL, check them out too!

📜 Single-file implementation
📈 Benchmarked Implementation for N algorithms
🖼 Weights and Biases integration

⭐ If you're interested in discrete control, make sure to check out our new library — Katakomba. It provides both discrete control algorithms augmented with recurrence and an offline RL benchmark for the NetHack Learning environment.

Getting started

git clone https://github.com/tinkoff-ai/CORL.git && cd CORL
pip install -r requirements/requirements_dev.txt

# alternatively, you could use docker
docker build -t <image_name> .
docker run --gpus=all -it --rm --name <container_name> <image_name>

Algorithms Implemented

Algorithm	Variants Implemented	Wandb Report
Offline and Offline-to-Online
✅ Conservative Q-Learning for Offline Reinforcement Learning (CQL)	`offline/cql.py` `finetune/cql.py`	`Offline` `Offline-to-online`
✅ Accelerating Online Reinforcement Learning with Offline Datasets (AWAC)	`offline/awac.py` `finetune/awac.py`	`Offline` `Offline-to-online`
✅ Offline Reinforcement Learning with Implicit Q-Learning (IQL)	`offline/iql.py` `finetune/iql.py`	`Offline` `Offline-to-online`
Offline-to-Online only
✅ Supported Policy Optimization for Offline Reinforcement Learning (SPOT)	`finetune/spot.py`	`Offline-to-online`
✅ Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning (Cal-QL)	`finetune/cal_ql.py`	`Offline-to-online`
Offline only
✅ Behavioral Cloning (BC)	`offline/any_percent_bc.py`	`Offline`
✅ Behavioral Cloning-10% (BC-10%)	`offline/any_percent_bc.py`	`Offline`
✅ A Minimalist Approach to Offline Reinforcement Learning (TD3+BC)	`offline/td3_bc.py`	`Offline`
✅ Decision Transformer: Reinforcement Learning via Sequence Modeling (DT)	`offline/dt.py`	`Offline`
✅ Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble (SAC-N)	`offline/sac_n.py`	`Offline`
✅ Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble (EDAC)	`offline/edac.py`	`Offline`
✅ Revisiting the Minimalist Approach to Offline Reinforcement Learning (ReBRAC)	`offline/rebrac.py`	`Offline`
✅ Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size (LB-SAC)	`offline/lb_sac.py`	`Offline Gym-MuJoCo`

D4RL Benchmarks

You can check the links above for learning curves and details. Here, we report reproduced final and best scores. Note that they differ by a significant margin, and some papers may use different approaches, not making it always explicit which reporting methodology they chose. If you want to re-collect our results in a more structured/nuanced manner, see results.

Offline

Last Scores

Gym-MuJoCo

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
halfcheetah-medium-v2	42.40 ± 0.19	42.46 ± 0.70	48.10 ± 0.18	49.46 ± 0.62	47.04 ± 0.22	48.31 ± 0.22	64.04 ± 0.68	68.20 ± 1.28	67.70 ± 1.04	42.20 ± 0.26
halfcheetah-medium-replay-v2	35.66 ± 2.33	23.59 ± 6.95	44.84 ± 0.59	44.70 ± 0.69	45.04 ± 0.27	44.46 ± 0.22	51.18 ± 0.31	60.70 ± 1.01	62.06 ± 1.10	38.91 ± 0.50
halfcheetah-medium-expert-v2	55.95 ± 7.35	90.10 ± 2.45	90.78 ± 6.04	93.62 ± 0.41	95.63 ± 0.42	94.74 ± 0.52	103.80 ± 2.95	98.96 ± 9.31	104.76 ± 0.64	91.55 ± 0.95
hopper-medium-v2	53.51 ± 1.76	55.48 ± 7.30	60.37 ± 3.49	74.45 ± 9.14	59.08 ± 3.77	67.53 ± 3.78	102.29 ± 0.17	40.82 ± 9.91	101.70 ± 0.28	65.10 ± 1.61
hopper-medium-replay-v2	29.81 ± 2.07	70.42 ± 8.66	64.42 ± 21.52	96.39 ± 5.28	95.11 ± 5.27	97.43 ± 6.39	94.98 ± 6.53	100.33 ± 0.78	99.66 ± 0.81	81.77 ± 6.87
hopper-medium-expert-v2	52.30 ± 4.01	111.16 ± 1.03	101.17 ± 9.07	52.73 ± 37.47	99.26 ± 10.91	107.42 ± 7.80	109.45 ± 2.34	101.31 ± 11.63	105.19 ± 10.08	110.44 ± 0.33
walker2d-medium-v2	63.23 ± 16.24	67.34 ± 5.17	82.71 ± 4.78	66.53 ± 26.04	80.75 ± 3.28	80.91 ± 3.17	85.82 ± 0.77	87.47 ± 0.66	93.36 ± 1.38	67.63 ± 2.54
walker2d-medium-replay-v2	21.80 ± 10.15	54.35 ± 6.34	85.62 ± 4.01	82.20 ± 1.05	73.09 ± 13.22	82.15 ± 3.03	84.25 ± 2.25	78.99 ± 0.50	87.10 ± 2.78	59.86 ± 2.73
walker2d-medium-expert-v2	98.96 ± 15.98	108.70 ± 0.25	110.03 ± 0.36	49.41 ± 38.16	109.56 ± 0.39	111.72 ± 0.86	111.86 ± 0.43	114.93 ± 0.41	114.75 ± 0.74	107.11 ± 0.96

locomotion average	50.40	69.29	76.45	67.72	78.28	81.63	89.74	83.52	92.92	73.84

Maze2d

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
maze2d-umaze-v1	0.36 ± 8.69	12.18 ± 4.29	29.41 ± 12.31	82.67 ± 28.30	-8.90 ± 6.11	42.11 ± 0.58	106.87 ± 22.16	130.59 ± 16.52	95.26 ± 6.39	18.08 ± 25.42
maze2d-medium-v1	0.79 ± 3.25	14.25 ± 2.33	59.45 ± 36.25	52.88 ± 55.12	86.11 ± 9.68	34.85 ± 2.72	105.11 ± 31.67	88.61 ± 18.72	57.04 ± 3.45	31.71 ± 26.33
maze2d-large-v1	2.26 ± 4.39	11.32 ± 5.10	97.10 ± 25.41	209.13 ± 8.19	23.75 ± 36.70	61.72 ± 3.50	78.33 ± 61.77	204.76 ± 1.19	95.60 ± 22.92	35.66 ± 28.20

maze2d average	1.13	12.58	61.99	114.89	33.65	46.23	96.77	141.32	82.64	28.48

Antmaze

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
antmaze-umaze-v2	55.25 ± 4.15	65.75 ± 5.26	70.75 ± 39.18	57.75 ± 10.28	92.75 ± 1.92	77.00 ± 5.52	97.75 ± 1.48	0.00 ± 0.00	0.00 ± 0.00	57.00 ± 9.82
antmaze-umaze-diverse-v2	47.25 ± 4.09	44.00 ± 1.00	44.75 ± 11.61	58.00 ± 7.68	37.25 ± 3.70	54.25 ± 5.54	83.50 ± 7.02	0.00 ± 0.00	0.00 ± 0.00	51.75 ± 0.43
antmaze-medium-play-v2	0.00 ± 0.00	2.00 ± 0.71	0.25 ± 0.43	0.00 ± 0.00	65.75 ± 11.61	65.75 ± 11.71	89.50 ± 3.35	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
antmaze-medium-diverse-v2	0.75 ± 0.83	5.75 ± 9.39	0.25 ± 0.43	0.00 ± 0.00	67.25 ± 3.56	73.75 ± 5.45	83.50 ± 8.20	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
antmaze-large-play-v2	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	20.75 ± 7.26	42.00 ± 4.53	52.25 ± 29.01	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
antmaze-large-diverse-v2	0.00 ± 0.00	0.75 ± 0.83	0.00 ± 0.00	0.00 ± 0.00	20.50 ± 13.24	30.25 ± 3.63	64.00 ± 5.43	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00

antmaze average	17.21	19.71	19.33	19.29	50.71	57.17	78.42	0.00	0.00	18.12

Adroit

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
pen-human-v1	71.03 ± 6.26	26.99 ± 9.60	-3.88 ± 0.21	81.12 ± 13.47	13.71 ± 16.98	78.49 ± 8.21	103.16 ± 8.49	6.86 ± 5.93	5.07 ± 6.16	67.68 ± 5.48
pen-cloned-v1	51.92 ± 15.15	46.67 ± 14.25	5.13 ± 5.28	89.56 ± 15.57	1.04 ± 6.62	83.42 ± 8.19	102.79 ± 7.84	31.35 ± 2.14	12.02 ± 1.75	64.43 ± 1.43
pen-expert-v1	109.65 ± 7.28	114.96 ± 2.96	122.53 ± 21.27	160.37 ± 1.21	-1.41 ± 2.34	128.05 ± 9.21	152.16 ± 6.33	87.11 ± 48.95	-1.55 ± 0.81	116.38 ± 1.27
door-human-v1	2.34 ± 4.00	-0.13 ± 0.07	-0.33 ± 0.01	4.60 ± 1.90	5.53 ± 1.31	3.26 ± 1.83	-0.10 ± 0.01	-0.38 ± 0.00	-0.12 ± 0.13	4.44 ± 0.87
door-cloned-v1	-0.09 ± 0.03	0.29 ± 0.59	-0.34 ± 0.01	0.93 ± 1.66	-0.33 ± 0.01	3.07 ± 1.75	0.06 ± 0.05	-0.33 ± 0.00	2.66 ± 2.31	7.64 ± 3.26
door-expert-v1	105.35 ± 0.09	104.04 ± 1.46	-0.33 ± 0.01	104.85 ± 0.24	-0.32 ± 0.02	106.65 ± 0.25	106.37 ± 0.29	-0.33 ± 0.00	106.29 ± 1.73	104.87 ± 0.39
hammer-human-v1	3.03 ± 3.39	-0.19 ± 0.02	1.02 ± 0.24	3.37 ± 1.93	0.14 ± 0.11	1.79 ± 0.80	0.24 ± 0.24	0.24 ± 0.00	0.28 ± 0.18	1.28 ± 0.15
hammer-cloned-v1	0.55 ± 0.16	0.12 ± 0.08	0.25 ± 0.01	0.21 ± 0.24	0.30 ± 0.01	1.50 ± 0.69	5.00 ± 3.75	0.14 ± 0.09	0.19 ± 0.07	1.82 ± 0.55
hammer-expert-v1	126.78 ± 0.64	121.75 ± 7.67	3.11 ± 0.03	127.06 ± 0.29	0.26 ± 0.01	128.68 ± 0.33	133.62 ± 0.27	25.13 ± 43.25	28.52 ± 49.00	117.45 ± 6.65
relocate-human-v1	0.04 ± 0.03	-0.14 ± 0.08	-0.29 ± 0.01	0.05 ± 0.03	0.06 ± 0.03	0.12 ± 0.04	0.16 ± 0.30	-0.31 ± 0.01	-0.17 ± 0.17	0.05 ± 0.01
relocate-cloned-v1	-0.06 ± 0.01	-0.00 ± 0.02	-0.30 ± 0.01	-0.04 ± 0.04	-0.29 ± 0.01	0.04 ± 0.01	1.66 ± 2.59	-0.01 ± 0.10	0.17 ± 0.35	0.16 ± 0.09
relocate-expert-v1	107.58 ± 1.20	97.90 ± 5.21	-1.73 ± 0.96	108.87 ± 0.85	-0.30 ± 0.02	106.11 ± 4.02	107.52 ± 2.28	-0.36 ± 0.00	71.94 ± 18.37	104.28 ± 0.42

adroit average	48.18	42.69	10.40	56.75	1.53	53.43	59.39	12.43	18.78	49.21

Best Scores

Gym-MuJoCo

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
halfcheetah-medium-v2	43.60 ± 0.14	43.90 ± 0.13	48.93 ± 0.11	50.06 ± 0.50	47.62 ± 0.03	48.84 ± 0.07	65.62 ± 0.46	72.21 ± 0.31	69.72 ± 0.92	42.73 ± 0.10
halfcheetah-medium-replay-v2	40.52 ± 0.19	42.27 ± 0.46	45.84 ± 0.26	46.35 ± 0.29	46.43 ± 0.19	45.35 ± 0.08	52.22 ± 0.31	67.29 ± 0.34	66.55 ± 1.05	40.31 ± 0.28
halfcheetah-medium-expert-v2	79.69 ± 3.10	94.11 ± 0.22	96.59 ± 0.87	96.11 ± 0.37	97.04 ± 0.17	95.38 ± 0.17	108.89 ± 1.20	111.73 ± 0.47	110.62 ± 1.04	93.40 ± 0.21
hopper-medium-v2	69.04 ± 2.90	73.84 ± 0.37	70.44 ± 1.18	97.90 ± 0.56	70.80 ± 1.98	80.46 ± 3.09	103.19 ± 0.16	101.79 ± 0.20	103.26 ± 0.14	69.42 ± 3.64
hopper-medium-replay-v2	68.88 ± 10.33	90.57 ± 2.07	98.12 ± 1.16	100.91 ± 1.50	101.63 ± 0.55	102.69 ± 0.96	102.57 ± 0.45	103.83 ± 0.53	103.28 ± 0.49	88.74 ± 3.02
hopper-medium-expert-v2	90.63 ± 10.98	113.13 ± 0.16	113.22 ± 0.43	103.82 ± 12.81	112.84 ± 0.66	113.18 ± 0.38	113.16 ± 0.43	111.24 ± 0.15	111.80 ± 0.11	111.18 ± 0.21
walker2d-medium-v2	80.64 ± 0.91	82.05 ± 0.93	86.91 ± 0.28	83.37 ± 2.82	84.77 ± 0.20	87.58 ± 0.48	87.79 ± 0.19	90.17 ± 0.54	95.78 ± 1.07	74.70 ± 0.56
walker2d-medium-replay-v2	48.41 ± 7.61	76.09 ± 0.40	91.17 ± 0.72	86.51 ± 1.15	89.39 ± 0.88	89.94 ± 0.93	91.11 ± 0.63	85.18 ± 1.63	89.69 ± 1.39	68.22 ± 1.20
walker2d-medium-expert-v2	109.95 ± 0.62	109.90 ± 0.09	112.21 ± 0.06	108.28 ± 9.45	111.63 ± 0.38	113.06 ± 0.53	112.49 ± 0.18	116.93 ± 0.42	116.52 ± 0.75	108.71 ± 0.34

locomotion average	70.15	80.65	84.83	85.92	84.68	86.28	93.00	95.60	96.36	77.49

Maze2d

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
maze2d-umaze-v1	16.09 ± 0.87	22.49 ± 1.52	99.33 ± 16.16	136.61 ± 11.65	92.05 ± 13.66	50.92 ± 4.23	162.28 ± 1.79	153.12 ± 6.49	149.88 ± 1.97	63.83 ± 17.35
maze2d-medium-v1	19.16 ± 1.24	27.64 ± 1.87	150.93 ± 3.89	131.50 ± 25.38	128.66 ± 5.44	122.69 ± 30.00	150.12 ± 4.48	93.80 ± 14.66	154.41 ± 1.58	68.14 ± 12.25
maze2d-large-v1	20.75 ± 6.66	41.83 ± 3.64	197.64 ± 5.26	227.93 ± 1.90	157.51 ± 7.32	162.25 ± 44.18	197.55 ± 5.82	207.51 ± 0.96	182.52 ± 2.68	50.25 ± 19.34

maze2d average	18.67	30.65	149.30	165.35	126.07	111.95	169.98	151.48	162.27	60.74

Antmaze

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
antmaze-umaze-v2	68.50 ± 2.29	77.50 ± 1.50	98.50 ± 0.87	78.75 ± 6.76	94.75 ± 0.83	84.00 ± 4.06	100.00 ± 0.00	0.00 ± 0.00	42.50 ± 28.61	64.50 ± 2.06
antmaze-umaze-diverse-v2	64.75 ± 4.32	63.50 ± 2.18	71.25 ± 5.76	88.25 ± 2.17	53.75 ± 2.05	79.50 ± 3.35	96.75 ± 2.28	0.00 ± 0.00	0.00 ± 0.00	60.50 ± 2.29
antmaze-medium-play-v2	4.50 ± 1.12	6.25 ± 2.38	3.75 ± 1.30	27.50 ± 9.39	80.50 ± 3.35	78.50 ± 3.84	93.50 ± 2.60	0.00 ± 0.00	0.00 ± 0.00	0.75 ± 0.43
antmaze-medium-diverse-v2	4.75 ± 1.09	16.50 ± 5.59	5.50 ± 1.50	33.25 ± 16.81	71.00 ± 4.53	83.50 ± 1.80	91.75 ± 2.05	0.00 ± 0.00	0.00 ± 0.00	0.50 ± 0.50
antmaze-large-play-v2	0.50 ± 0.50	13.50 ± 9.76	1.25 ± 0.43	1.00 ± 0.71	34.75 ± 5.85	53.50 ± 2.50	68.75 ± 13.90	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
antmaze-large-diverse-v2	0.75 ± 0.43	6.25 ± 1.79	0.25 ± 0.43	0.50 ± 0.50	36.25 ± 3.34	53.00 ± 3.00	69.50 ± 7.26	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00

antmaze average	23.96	30.58	30.08	38.21	61.83	72.00	86.71	0.00	7.08	21.04

Adroit

Task-Name	BC	10% BC	TD3+BC	AWAC	CQL	IQL	ReBRAC	SAC-N	EDAC	DT
pen-human-v1	99.69 ± 7.45	59.89 ± 8.03	9.95 ± 8.19	121.05 ± 5.47	58.91 ± 1.81	106.15 ± 10.28	127.28 ± 3.22	56.48 ± 7.17	35.84 ± 10.57	77.83 ± 2.30
pen-cloned-v1	99.14 ± 12.27	83.62 ± 11.75	52.66 ± 6.33	129.66 ± 1.27	14.74 ± 2.31	114.05 ± 4.78	128.64 ± 7.15	52.69 ± 5.30	26.90 ± 7.85	71.17 ± 2.70
pen-expert-v1	128.77 ± 5.88	134.36 ± 3.16	142.83 ± 7.72	162.69 ± 0.23	14.86 ± 4.07	140.01 ± 6.36	157.62 ± 0.26	116.43 ± 40.26	36.04 ± 4.60	119.49 ± 2.31
door-human-v1	9.41 ± 4.55	7.00 ± 6.77	-0.11 ± 0.06	19.28 ± 1.46	13.28 ± 2.77	13.52 ± 1.22	0.27 ± 0.43	-0.10 ± 0.06	2.51 ± 2.26	7.36 ± 1.24
door-cloned-v1	3.40 ± 0.95	10.37 ± 4.09	-0.20 ± 0.11	12.61 ± 0.60	-0.08 ± 0.13	9.02 ± 1.47	7.73 ± 6.80	-0.21 ± 0.10	20.36 ± 1.11	11.18 ± 0.96
door-expert-v1	105.84 ± 0.23	105.92 ± 0.24	4.49 ± 7.39	106.77 ± 0.24	59.47 ± 25.04	107.29 ± 0.37	106.78 ± 0.04	0.05 ± 0.02	109.22 ± 0.24	105.49 ± 0.09
hammer-human-v1	12.61 ± 4.87	6.23 ± 4.79	2.38 ± 0.14	22.03 ± 8.13	0.30 ± 0.05	6.86 ± 2.38	1.18 ± 0.15	0.25 ± 0.00	3.49 ± 2.17	1.68 ± 0.11
hammer-cloned-v1	8.90 ± 4.04	8.72 ± 3.28	0.96 ± 0.30	14.67 ± 1.94	0.32 ± 0.03	11.63 ± 1.70	48.16 ± 6.20	12.67 ± 15.02	0.27 ± 0.01	2.74 ± 0.22
hammer-expert-v1	127.89 ± 0.57	128.15 ± 0.66	33.31 ± 47.65	129.66 ± 0.33	0.93 ± 1.12	129.76 ± 0.37	134.74 ± 0.30	91.74 ± 47.77	69.44 ± 47.00	127.39 ± 0.10
relocate-human-v1	0.59 ± 0.27	0.16 ± 0.14	-0.29 ± 0.01	2.09 ± 0.76	1.03 ± 0.20	1.22 ± 0.28	3.70 ± 2.34	-0.18 ± 0.14	0.05 ± 0.02	0.08 ± 0.02
relocate-cloned-v1	0.45 ± 0.31	0.74 ± 0.45	-0.02 ± 0.04	0.94 ± 0.68	-0.07 ± 0.02	1.78 ± 0.70	9.25 ± 2.56	0.10 ± 0.04	4.11 ± 1.39	0.34 ± 0.09
relocate-expert-v1	110.31 ± 0.36	109.77 ± 0.60	0.23 ± 0.27	111.56 ± 0.17	0.03 ± 0.10	110.12 ± 0.82	111.14 ± 0.23	-0.07 ± 0.08	98.32 ± 3.75	106.49 ± 0.30

adroit average	58.92	54.58	20.51	69.42	13.65	62.62	69.71	27.49	33.88	52.60

Offline-to-Online

Scores

Task-Name	AWAC	CQL	IQL	SPOT	Cal-QL
antmaze-umaze-v2	52.75 ± 8.67 → 98.75 ± 1.09	94.00 ± 1.58 → 99.50 ± 0.87	77.00 ± 0.71 → 96.50 ± 1.12	91.00 ± 2.55 → 99.50 ± 0.50	76.75 ± 7.53 → 99.75 ± 0.43
antmaze-umaze-diverse-v2	56.00 ± 2.74 → 0.00 ± 0.00	9.50 ± 9.91 → 99.00 ± 1.22	59.50 ± 9.55 → 63.75 ± 25.02	36.25 ± 2.17 → 95.00 ± 3.67	32.00 ± 27.79 → 98.50 ± 1.12
antmaze-medium-play-v2	0.00 ± 0.00 → 0.00 ± 0.00	59.00 ± 11.18 → 97.75 ± 1.30	71.75 ± 2.95 → 89.75 ± 1.09	67.25 ± 10.47 → 97.25 ± 1.30	71.75 ± 3.27 → 98.75 ± 1.64
antmaze-medium-diverse-v2	0.00 ± 0.00 → 0.00 ± 0.00	63.50 ± 6.84 → 97.25 ± 1.92	64.25 ± 1.92 → 92.25 ± 2.86	73.75 ± 7.29 → 94.50 ± 1.66	62.00 ± 4.30 → 98.25 ± 1.48
antmaze-large-play-v2	0.00 ± 0.00 → 0.00 ± 0.00	28.75 ± 7.76 → 88.25 ± 2.28	38.50 ± 8.73 → 64.50 ± 17.04	31.50 ± 12.58 → 87.00 ± 3.24	31.75 ± 8.87 → 97.25 ± 1.79
antmaze-large-diverse-v2	0.00 ± 0.00 → 0.00 ± 0.00	35.50 ± 3.64 → 91.75 ± 3.96	26.75 ± 3.77 → 64.25 ± 4.15	17.50 ± 7.26 → 81.00 ± 14.14	44.00 ± 8.69 → 91.50 ± 3.91

antmaze average	18.12 → 16.46	48.38 → 95.58	56.29 → 78.50	52.88 → 92.38	53.04 → 97.33

pen-cloned-v1	88.66 ± 15.10 → 86.82 ± 11.12	-2.76 ± 0.08 → -1.28 ± 2.16	84.19 ± 3.96 → 102.02 ± 20.75	6.19 ± 5.21 → 43.63 ± 20.09	-2.66 ± 0.04 → -2.68 ± 0.12
door-cloned-v1	0.93 ± 1.66 → 0.01 ± 0.00	-0.33 ± 0.01 → -0.33 ± 0.01	1.19 ± 0.93 → 20.34 ± 9.32	-0.21 ± 0.14 → 0.02 ± 0.31	-0.33 ± 0.01 → -0.33 ± 0.01
hammer-cloned-v1	1.80 ± 3.01 → 0.24 ± 0.04	0.56 ± 0.55 → 2.85 ± 4.81	1.35 ± 0.32 → 57.27 ± 28.49	3.97 ± 6.39 → 3.73 ± 4.99	0.25 ± 0.04 → 0.17 ± 0.17
relocate-cloned-v1	-0.04 ± 0.04 → -0.04 ± 0.01	-0.33 ± 0.01 → -0.33 ± 0.01	0.04 ± 0.04 → 0.32 ± 0.38	-0.24 ± 0.01 → -0.15 ± 0.05	-0.31 ± 0.05 → -0.31 ± 0.04

adroit average	22.84 → 21.76	-0.72 → 0.22	21.69 → 44.99	2.43 → 11.81	-0.76 → -0.79

Regrets

Task-Name	AWAC	CQL	IQL	SPOT	Cal-QL
antmaze-umaze-v2	0.04 ± 0.01	0.02 ± 0.00	0.07 ± 0.00	0.02 ± 0.00	0.01 ± 0.00
antmaze-umaze-diverse-v2	0.88 ± 0.01	0.09 ± 0.01	0.43 ± 0.11	0.22 ± 0.07	0.05 ± 0.01
antmaze-medium-play-v2	1.00 ± 0.00	0.08 ± 0.01	0.09 ± 0.01	0.06 ± 0.00	0.04 ± 0.01
antmaze-medium-diverse-v2	1.00 ± 0.00	0.08 ± 0.00	0.10 ± 0.01	0.05 ± 0.01	0.04 ± 0.01
antmaze-large-play-v2	1.00 ± 0.00	0.21 ± 0.02	0.34 ± 0.05	0.29 ± 0.07	0.13 ± 0.02
antmaze-large-diverse-v2	1.00 ± 0.00	0.21 ± 0.03	0.41 ± 0.03	0.23 ± 0.08	0.13 ± 0.02

antmaze average	0.82	0.11	0.24	0.15	0.07

pen-cloned-v1	0.46 ± 0.02	0.97 ± 0.00	0.37 ± 0.01	0.58 ± 0.02	0.98 ± 0.01
door-cloned-v1	1.00 ± 0.00	1.00 ± 0.00	0.83 ± 0.03	0.99 ± 0.01	1.00 ± 0.00
hammer-cloned-v1	1.00 ± 0.00	1.00 ± 0.00	0.65 ± 0.10	0.98 ± 0.01	1.00 ± 0.00
relocate-cloned-v1	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00

adroit average	0.86	0.99	0.71	0.89	0.99

Citing CORL

If you use CORL in your work, please use the following bibtex

@inproceedings{
tarasov2022corl,
  title={{CORL}: Research-oriented Deep Offline Reinforcement Learning Library},
  author={Denis Tarasov and Alexander Nikulin and Dmitry Akimov and Vladislav Kurenkov and Sergey Kolesnikov},
  booktitle={3rd Offline RL Workshop: Offline RL as a ''Launchpad''},
  year={2022},
  url={https://openreview.net/forum?id=SyAS49bBcv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
algorithms		algorithms
configs		configs
requirements		requirements
results		results
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORL (Clean Offline Reinforcement Learning)

Getting started

Algorithms Implemented

D4RL Benchmarks

Offline

Last Scores

Gym-MuJoCo

Maze2d

Antmaze

Adroit

Best Scores

Gym-MuJoCo

Maze2d

Antmaze

Adroit

Offline-to-Online

Scores

Regrets

Citing CORL

About

Releases 1

Packages

Contributors 8

Languages

License

tinkoff-ai/CORL

Folders and files

Latest commit

History

Repository files navigation

CORL (Clean Offline Reinforcement Learning)

Getting started

Algorithms Implemented

D4RL Benchmarks

Offline

Last Scores

Gym-MuJoCo

Maze2d

Antmaze

Adroit

Best Scores

Gym-MuJoCo

Maze2d

Antmaze

Adroit

Offline-to-Online

Scores

Regrets

Citing CORL

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 8

Languages

Packages