verl-project · jens321 · Nov 14, 2025 · Nov 14, 2025 · Nov 14, 2025 · Nov 14, 2025
@@ -0,0 +1,68 @@
+# Recipe: Representation-Based Exploration (RepExp)
+
+Last updated: 11/14/2025.
+
+<div align="center">
+
+Representation-Based Exploration for Language Models: <br> From Test-Time to Post-Training
+
+[📄 arXiv](https://arxiv.org/abs/2510.11686) &nbsp; &nbsp; [🌐 Website](https://rep-exp.github.io) &nbsp; &nbsp; [🐦 Twitter / X ](https://x.com/JensTuyls/status/1978244454617128993)
+
+</div>
+
+
+## Installation 🔌
+
+Our algorithm doesn't require anything beyond the base verl installation, which you can find [here](https://verl.readthedocs.io/en/latest/start/install.html).
+
+## Running the Experiments 🚀
+
+You can reproduce or extend our experiments by running the following commands:
+
+```bash
+# General format
+sh recipe/rep_exp/train_elliptical.sh $TASK $SPARSE_DIM $BETA $SEED
+
+# MATH
+sh recipe/rep_exp/train_elliptical.sh math 32 0.01 42
+
+# GSM8K
+sh recipe/rep_exp/train_elliptical.sh gsm8k 32 0.01 42
+
+# DAPO-WITH-AIME
+sh recipe/rep_exp/train_elliptical.sh dapo-with-aime24 128 0.01 42
+```
+where `$TASK` is the task name, `$SPARSE_DIM` is the sparse dimension, `$BETA` is the beta parameter, and `$SEED` is the seed.
+
+## Evaluation 📊
+Once done training, you can evaluate the model on the test set by following two steps.
+1. Merge the model checkpoint. 
+
+This is necessary because the model checkpoint is saved in multiple shards (depending on the nubmer of GPUs), and we need to merge them into a single checkpoint.
+
+```bash
+sh recipe/rep_exp/model_merge.sh /path/to/global_step_X/actor # where X is the global step of the checkpoint with the best pass@1 on dev
+```
+
+2. Evaluate the merged model.
+
+```bash
+sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev
+```
+
+The results should be in a folder named `eval` and saved as a JSON file.
+
+## Citation 📝
+
+```bibtex
+@article{tuyls2025representation,
+  title={Representation-Based Exploration for Language Models: From Test-Time to Post-Training},
+  author={Tuyls, Jens and Foster, Dylan J and Krishnamurthy, Akshay and Ash, Jordan T},
+  journal={arXiv preprint arXiv:2510.11686},
+  year={2025}
+}
+```
+
+## Contact 📬
+
+If you have any questions or suggestions, feel free to reach out at [jtuyls@princeton.edu](mailto:jtuyls@princeton.edu).
@@ -75,6 +75,7 @@ verl is fast with:
    algo/spin.md
    algo/sppo.md
    algo/entropy.md
+   algo/repexp.md
    algo/opo.md
    algo/baseline.md
    algo/gpg.md

diff --git a/recipe/rep_exp/README.md b/recipe/rep_exp/README.md
@@ -0,0 +1,66 @@
+<div align="center">
+
+# Representation-Based Exploration for Language Models: <br> From Test-Time to Post-Training
+
+[📄 arXiv](https://arxiv.org/abs/2510.11686) &nbsp; &nbsp; [🌐 Website](https://rep-exp.github.io) &nbsp; &nbsp; [🐦 Twitter / X ](https://x.com/JensTuyls/status/1978244454617128993)
+
+</div>
+
+## Installation 🔌
+
+Besides the base verl installation, which you can find [here](https://verl.readthedocs.io/en/latest/start/install.html), the only package to install is scikit-learn.
+```bash
+pip install scikit-learn
+```
+
+## Running the Experiments 🚀
+
+You can reproduce or extend our experiments by running the following commands:
+
+```bash
+# General format
+sh recipe/rep_exp/train_elliptical.sh $TASK $SPARSE_DIM $BETA $SEED
+
+# MATH
+sh recipe/rep_exp/train_elliptical.sh math 32 0.01 42
+
+# GSM8K
+sh recipe/rep_exp/train_elliptical.sh gsm8k 32 0.01 42
+
+# DAPO-WITH-AIME
+sh recipe/rep_exp/train_elliptical.sh dapo-with-aime24 128 0.01 42
+```
+where `$TASK` is the task name, `$SPARSE_DIM` is the sparse dimension, `$BETA` is the beta parameter, and `$SEED` is the seed.
+
+## Evaluation 📊
+Once done training, you can evaluate the model on the test set by following two steps.
+1. Merge the model checkpoint. 
+
+This is necessary because the model checkpoint is saved in multiple shards (depending on the nubmer of GPUs), and we need to merge them into a single checkpoint.
+
+```bash
+sh recipe/rep_exp/model_merge.sh /path/to/global_step_X/actor # where X is the global step of the checkpoint with the best pass@1 on dev
+```
+
+2. Evaluate the merged model.
+
+```bash
+sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev
+```
+
+The results should be in a folder named `eval` and saved as a JSON file.
+
+## Citation 📝
+
+```bibtex
+@article{tuyls2025representation,
+  title={Representation-Based Exploration for Language Models: From Test-Time to Post-Training},
+  author={Tuyls, Jens and Foster, Dylan J and Krishnamurthy, Akshay and Ash, Jordan T},
+  journal={arXiv preprint arXiv:2510.11686},
+  year={2025}
+}
+```
+
+## Contact 📬
+
+If you have any questions or suggestions, feel free to reach out at [jtuyls@princeton.edu](mailto:jtuyls@princeton.edu).
diff --git a/recipe/rep_exp/config/rep_exp_trainer.yaml b/recipe/rep_exp/config/rep_exp_trainer.yaml
@@ -0,0 +1,33 @@
+hydra:
+  searchpath:
+    - file://verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
+reward_model:
+  elliptical:
+    enable: True
+    lamb: 0.01
+    normalization: none # none, rnd, z_score
+    reward_type: leave_one_out # leave_one_out, leverage
+    sparse_dim: 512
+    randomize_sparse_matrix: True
+    persist_covariance: False
+
+  reward_kwargs:
+    elliptical:
+      alpha: 1.0
+      beta: 1.0
+      turn_off_elliptical_if_none_correct: True
+      turn_off_elliptical_if_some_correct: False
+      turn_off_elliptical_if_all_correct: False
+      turn_off_elliptical_if_rollout_incorrect: False
+
+actor_rollout_ref:
+  rollout:
+    val_kwargs:
+      temperature: 1.0
+      n: 128
+      do_sample: True
diff --git a/recipe/rep_exp/data_preprocess/dapo_with_aime.py b/recipe/rep_exp/data_preprocess/dapo_with_aime.py
@@ -0,0 +1,104 @@
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess DAPO dataset to parquet format
+"""
+
+import argparse
+import os
+
+import datasets
+import numpy as np
+
+from verl.utils.hdfs_io import copy, makedirs
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dir", default="~/data/dapo-with-aime24")
+    parser.add_argument("--hdfs_dir", default=None)
+    parser.add_argument("--dapo_dataset_path", type=str, default="ftajwar/deduplicated_dapo_dataset")
+    parser.add_argument("--aime24_part_1_dataset_path", type=str, default="MathArena/aime_2024_I")
+    parser.add_argument("--aime24_part_2_dataset_path", type=str, default="MathArena/aime_2024_II")
+    parser.add_argument("--train_size", type=int, default=4096)
+
+    args = parser.parse_args()
+
+    data_source = "math_dapo"
+
+    # Load DAPO dataset for training
+    dapo_dataset_path = args.dapo_dataset_path
+    dapo_dataset = datasets.load_dataset(dapo_dataset_path, trust_remote_code=True)
+
+    # Load AIME 2024 part 1 dataset for testing
+    aime24_dataset_path_part_1 = args.aime24_part_1_dataset_path
+    aime24_dataset_part_1 = datasets.load_dataset(aime24_dataset_path_part_1, trust_remote_code=True)
+
+    # Load AIME 2024 part 2 dataset for testing
+    aime24_dataset_path_part_2 = args.aime24_part_2_dataset_path
+    aime24_dataset_part_2 = datasets.load_dataset(aime24_dataset_path_part_2, trust_remote_code=True)
+
+    train_dataset = dapo_dataset["train"]
+    train_dataset = train_dataset.select(np.random.choice(len(train_dataset), size=args.train_size, replace=False))
+
+    dev_dataset_aime24_part_1 = aime24_dataset_part_1["train"]
+    dev_dataset_aime24_part_2 = aime24_dataset_part_2["train"]
+    dev_dataset = datasets.concatenate_datasets([dev_dataset_aime24_part_1, dev_dataset_aime24_part_2])
+
+    instruction_following = "Let's think step by step and output the final answer within \\boxed{}."
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+        def process_fn(example, idx):
+            if "prompt" in example:
+                question = example.pop("prompt")
+            elif "problem" in example:
+                question = example.pop("problem")
+            else:
+                raise ValueError(f"Unknown question type: {example}")
+
+            question = question + " " + instruction_following
+
+            if "answer" in example:
+                solution = example.pop("answer")
+            else:
+                raise ValueError(f"Unknown answer type: {example}")
+            solution = str(solution)
+
+            data = {
+                "data_source": data_source,
+                "prompt": [{"role": "user", "content": question}],
+                "ability": "math",
+                "reward_model": {
+                    "style": "rule",
+                    "ground_truth": solution,
+                },
+                "extra_info": {"split": split, "index": idx},
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
+    dev_dataset = dev_dataset.map(function=make_map_fn("test"), with_indices=True)
+
+    local_dir = args.local_dir
-    local_dir = args.local_dir
+    local_dir = os.path.expanduser(args.local_dir)
-    local_dir = args.local_dir
+    local_dir = os.path.expanduser(args.local_dir)
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, "train.parquet"))
+    dev_dataset.to_parquet(os.path.join(local_dir, "dev.parquet"))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_dir, dst=hdfs_dir)
diff --git a/recipe/rep_exp/data_preprocess/gsm8k.py b/recipe/rep_exp/data_preprocess/gsm8k.py
@@ -0,0 +1,112 @@
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the GSM8k dataset to parquet format
+"""
+
+import argparse
+import os
+import re
+
+import datasets
+import numpy as np
+
+from verl.utils.hdfs_io import copy, makedirs
+
+
+def extract_solution(solution_str):
+    solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str)
+    assert solution is not None
+    final_solution = solution.group(0)
+    final_solution = final_solution.split("#### ")[1].replace(",", "")
+    return final_solution
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dir", default=None, help="The save directory for the preprocessed dataset.")
+    parser.add_argument("--hdfs_dir", default=None)
+    parser.add_argument("--local_dataset_path", default=None, help="The local path to the raw dataset, if it exists.")
+    parser.add_argument(
+        "--local_save_dir", default="~/data/gsm8k", help="The save directory for the preprocessed dataset."
+    )
+
+    args = parser.parse_args()
+    local_dataset_path = args.local_dataset_path
+
+    data_source = "openai/gsm8k"
+
+    if local_dataset_path is not None:
+        dataset = datasets.load_dataset(local_dataset_path, "main")
+    else:
+        dataset = datasets.load_dataset(data_source, "main")
+
+    train_dataset = dataset["train"]
+    test_dataset = dataset["test"]
+
+    instruction_following = 'Let\'s think step by step and output the final answer after "####".'
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+        def process_fn(example, idx):
+            question_raw = example.pop("question")
+
+            question = question_raw + " " + instruction_following
+
+            answer_raw = example.pop("answer")
+            solution = extract_solution(answer_raw)
+            data = {
+                "data_source": data_source,
+                "prompt": [
+                    {
+                        "role": "user",
+                        "content": question,
+                    }
+                ],
+                "ability": "math",
+                "reward_model": {"style": "rule", "ground_truth": solution},
+                "extra_info": {
+                    "split": split,
+                    "index": idx,
+                    "answer": answer_raw,
+                    "question": question_raw,
+                },
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True)
+    # split test into dev and test by picking random subset of 512 examples
+    all_test_indices = range(len(test_dataset))
+    all_test_indices = list(all_test_indices)
+    np.random.shuffle(all_test_indices)
+    dev_dataset = test_dataset.select(all_test_indices[:512])
+    test_dataset = test_dataset.select(all_test_indices[512:])
+
+    hdfs_dir = args.hdfs_dir
+    local_save_dir = args.local_dir
+    if local_save_dir is not None:
+        print("Warning: Argument 'local_dir' is deprecated. Please use 'local_save_dir' instead.")
+    else:
+        local_save_dir = args.local_save_dir
+
+    train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
+    test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_save_dir, dst=hdfs_dir)