Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions docs/algo/repexp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Recipe: Representation-Based Exploration (RepExp)

Last updated: 11/14/2025.

<div align="center">

Representation-Based Exploration for Language Models: <br> From Test-Time to Post-Training

[📄 arXiv](https://arxiv.org/abs/2510.11686) &nbsp; &nbsp; [🌐 Website](https://rep-exp.github.io) &nbsp; &nbsp; [🐦 Twitter / X ](https://x.com/JensTuyls/status/1978244454617128993)

</div>


## Installation 🔌

Our algorithm doesn't require anything beyond the base verl installation, which you can find [here](https://verl.readthedocs.io/en/latest/start/install.html).

## Running the Experiments 🚀

You can reproduce or extend our experiments by running the following commands:

```bash
# General format
sh recipe/rep_exp/train_elliptical.sh $TASK $SPARSE_DIM $BETA $SEED

# MATH
sh recipe/rep_exp/train_elliptical.sh math 32 0.01 42

# GSM8K
sh recipe/rep_exp/train_elliptical.sh gsm8k 32 0.01 42

# DAPO-WITH-AIME
sh recipe/rep_exp/train_elliptical.sh dapo-with-aime24 128 0.01 42
```
where `$TASK` is the task name, `$SPARSE_DIM` is the sparse dimension, `$BETA` is the beta parameter, and `$SEED` is the seed.

## Evaluation 📊
Once done training, you can evaluate the model on the test set by following two steps.
1. Merge the model checkpoint.

This is necessary because the model checkpoint is saved in multiple shards (depending on the nubmer of GPUs), and we need to merge them into a single checkpoint.

```bash
sh recipe/rep_exp/model_merge.sh /path/to/global_step_X/actor # where X is the global step of the checkpoint with the best pass@1 on dev
```

2. Evaluate the merged model.

```bash
sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev
```

The results should be in a folder named `eval` and saved as a JSON file.

## Citation 📝

```bibtex
@article{tuyls2025representation,
title={Representation-Based Exploration for Language Models: From Test-Time to Post-Training},
author={Tuyls, Jens and Foster, Dylan J and Krishnamurthy, Akshay and Ash, Jordan T},
journal={arXiv preprint arXiv:2510.11686},
year={2025}
}
```

## Contact 📬

If you have any questions or suggestions, feel free to reach out at [jtuyls@princeton.edu](mailto:jtuyls@princeton.edu).
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ verl is fast with:
algo/spin.md
algo/sppo.md
algo/entropy.md
algo/repexp.md
algo/opo.md
algo/baseline.md
algo/gpg.md
Expand Down
66 changes: 66 additions & 0 deletions recipe/rep_exp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
<div align="center">

# Representation-Based Exploration for Language Models: <br> From Test-Time to Post-Training

[📄 arXiv](https://arxiv.org/abs/2510.11686) &nbsp; &nbsp; [🌐 Website](https://rep-exp.github.io) &nbsp; &nbsp; [🐦 Twitter / X ](https://x.com/JensTuyls/status/1978244454617128993)

</div>

## Installation 🔌

Besides the base verl installation, which you can find [here](https://verl.readthedocs.io/en/latest/start/install.html), the only package to install is scikit-learn.
```bash
pip install scikit-learn
```

## Running the Experiments 🚀

You can reproduce or extend our experiments by running the following commands:

```bash
# General format
sh recipe/rep_exp/train_elliptical.sh $TASK $SPARSE_DIM $BETA $SEED

# MATH
sh recipe/rep_exp/train_elliptical.sh math 32 0.01 42

# GSM8K
sh recipe/rep_exp/train_elliptical.sh gsm8k 32 0.01 42

# DAPO-WITH-AIME
sh recipe/rep_exp/train_elliptical.sh dapo-with-aime24 128 0.01 42
```
where `$TASK` is the task name, `$SPARSE_DIM` is the sparse dimension, `$BETA` is the beta parameter, and `$SEED` is the seed.

## Evaluation 📊
Once done training, you can evaluate the model on the test set by following two steps.
1. Merge the model checkpoint.

This is necessary because the model checkpoint is saved in multiple shards (depending on the nubmer of GPUs), and we need to merge them into a single checkpoint.

```bash
sh recipe/rep_exp/model_merge.sh /path/to/global_step_X/actor # where X is the global step of the checkpoint with the best pass@1 on dev
```

2. Evaluate the merged model.

```bash
sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev
```

The results should be in a folder named `eval` and saved as a JSON file.

## Citation 📝

```bibtex
@article{tuyls2025representation,
title={Representation-Based Exploration for Language Models: From Test-Time to Post-Training},
author={Tuyls, Jens and Foster, Dylan J and Krishnamurthy, Akshay and Ash, Jordan T},
journal={arXiv preprint arXiv:2510.11686},
year={2025}
}
```

## Contact 📬

If you have any questions or suggestions, feel free to reach out at [jtuyls@princeton.edu](mailto:jtuyls@princeton.edu).
33 changes: 33 additions & 0 deletions recipe/rep_exp/config/rep_exp_trainer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
hydra:
searchpath:
- file://verl/trainer/config

defaults:
- ppo_trainer
- _self_

reward_model:
elliptical:
enable: True
lamb: 0.01
normalization: none # none, rnd, z_score
reward_type: leave_one_out # leave_one_out, leverage
sparse_dim: 512
randomize_sparse_matrix: True
persist_covariance: False

reward_kwargs:
elliptical:
alpha: 1.0
beta: 1.0
turn_off_elliptical_if_none_correct: True
turn_off_elliptical_if_some_correct: False
turn_off_elliptical_if_all_correct: False
turn_off_elliptical_if_rollout_incorrect: False

actor_rollout_ref:
rollout:
val_kwargs:
temperature: 1.0
n: 128
do_sample: True
104 changes: 104 additions & 0 deletions recipe/rep_exp/data_preprocess/dapo_with_aime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Preprocess DAPO dataset to parquet format
"""

import argparse
import os

import datasets
import numpy as np

from verl.utils.hdfs_io import copy, makedirs

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_dir", default="~/data/dapo-with-aime24")
parser.add_argument("--hdfs_dir", default=None)
parser.add_argument("--dapo_dataset_path", type=str, default="ftajwar/deduplicated_dapo_dataset")
parser.add_argument("--aime24_part_1_dataset_path", type=str, default="MathArena/aime_2024_I")
parser.add_argument("--aime24_part_2_dataset_path", type=str, default="MathArena/aime_2024_II")
parser.add_argument("--train_size", type=int, default=4096)

args = parser.parse_args()

data_source = "math_dapo"

# Load DAPO dataset for training
dapo_dataset_path = args.dapo_dataset_path
dapo_dataset = datasets.load_dataset(dapo_dataset_path, trust_remote_code=True)

# Load AIME 2024 part 1 dataset for testing
aime24_dataset_path_part_1 = args.aime24_part_1_dataset_path
aime24_dataset_part_1 = datasets.load_dataset(aime24_dataset_path_part_1, trust_remote_code=True)

# Load AIME 2024 part 2 dataset for testing
aime24_dataset_path_part_2 = args.aime24_part_2_dataset_path
aime24_dataset_part_2 = datasets.load_dataset(aime24_dataset_path_part_2, trust_remote_code=True)

train_dataset = dapo_dataset["train"]
train_dataset = train_dataset.select(np.random.choice(len(train_dataset), size=args.train_size, replace=False))

dev_dataset_aime24_part_1 = aime24_dataset_part_1["train"]
dev_dataset_aime24_part_2 = aime24_dataset_part_2["train"]
dev_dataset = datasets.concatenate_datasets([dev_dataset_aime24_part_1, dev_dataset_aime24_part_2])

instruction_following = "Let's think step by step and output the final answer within \\boxed{}."

# add a row to each data item that represents a unique id
def make_map_fn(split):
def process_fn(example, idx):
if "prompt" in example:
question = example.pop("prompt")
elif "problem" in example:
question = example.pop("problem")
else:
raise ValueError(f"Unknown question type: {example}")

question = question + " " + instruction_following

if "answer" in example:
solution = example.pop("answer")
else:
raise ValueError(f"Unknown answer type: {example}")
solution = str(solution)

data = {
"data_source": data_source,
"prompt": [{"role": "user", "content": question}],
"ability": "math",
"reward_model": {
"style": "rule",
"ground_truth": solution,
},
"extra_info": {"split": split, "index": idx},
}
return data

return process_fn

train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
dev_dataset = dev_dataset.map(function=make_map_fn("test"), with_indices=True)

local_dir = args.local_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default value for --local_dir contains a tilde (~), which is not automatically expanded by argparse. This will result in creating a directory named ~ in the current working directory, instead of using the user's home directory. To fix this, you should expand the user's home directory path.

Suggested change
local_dir = args.local_dir
local_dir = os.path.expanduser(args.local_dir)

hdfs_dir = args.hdfs_dir

train_dataset.to_parquet(os.path.join(local_dir, "train.parquet"))
dev_dataset.to_parquet(os.path.join(local_dir, "dev.parquet"))

if hdfs_dir is not None:
makedirs(hdfs_dir)

copy(src=local_dir, dst=hdfs_dir)
112 changes: 112 additions & 0 deletions recipe/rep_exp/data_preprocess/gsm8k.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Preprocess the GSM8k dataset to parquet format
"""

import argparse
import os
import re

import datasets
import numpy as np

from verl.utils.hdfs_io import copy, makedirs


def extract_solution(solution_str):
solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str)
assert solution is not None
final_solution = solution.group(0)
final_solution = final_solution.split("#### ")[1].replace(",", "")
return final_solution


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_dir", default=None, help="The save directory for the preprocessed dataset.")
parser.add_argument("--hdfs_dir", default=None)
parser.add_argument("--local_dataset_path", default=None, help="The local path to the raw dataset, if it exists.")
parser.add_argument(
"--local_save_dir", default="~/data/gsm8k", help="The save directory for the preprocessed dataset."
)

args = parser.parse_args()
local_dataset_path = args.local_dataset_path

data_source = "openai/gsm8k"

if local_dataset_path is not None:
dataset = datasets.load_dataset(local_dataset_path, "main")
else:
dataset = datasets.load_dataset(data_source, "main")

train_dataset = dataset["train"]
test_dataset = dataset["test"]

instruction_following = 'Let\'s think step by step and output the final answer after "####".'

# add a row to each data item that represents a unique id
def make_map_fn(split):
def process_fn(example, idx):
question_raw = example.pop("question")

question = question_raw + " " + instruction_following

answer_raw = example.pop("answer")
solution = extract_solution(answer_raw)
data = {
"data_source": data_source,
"prompt": [
{
"role": "user",
"content": question,
}
],
"ability": "math",
"reward_model": {"style": "rule", "ground_truth": solution},
"extra_info": {
"split": split,
"index": idx,
"answer": answer_raw,
"question": question_raw,
},
}
return data

return process_fn

train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True)
# split test into dev and test by picking random subset of 512 examples
all_test_indices = range(len(test_dataset))
all_test_indices = list(all_test_indices)
np.random.shuffle(all_test_indices)
dev_dataset = test_dataset.select(all_test_indices[:512])
test_dataset = test_dataset.select(all_test_indices[512:])

hdfs_dir = args.hdfs_dir
local_save_dir = args.local_dir
if local_save_dir is not None:
print("Warning: Argument 'local_dir' is deprecated. Please use 'local_save_dir' instead.")
else:
local_save_dir = args.local_save_dir

train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))

if hdfs_dir is not None:
makedirs(hdfs_dir)

copy(src=local_save_dir, dst=hdfs_dir)
Loading