-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[recipe, algo] feat: Representation-based Exploration (RepExp) #4278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
a9fd825
add RepExp to recipe
jens321 99dc1c2
remove test script
jens321 2bbfd57
add RepExp to docs
jens321 8f0faca
change explicit cuda calls to use device api
jens321 b77cbe6
rearrange init_workers
jens321 df1eb6a
update scripts and main setup and trainer to latest verl changes
jens321 4e1833a
add scikit-learn installation in readme
jens321 db3d17f
aime dataset process bug fix
jens321 86d081d
fix TASK to $TASK in sample script
jens321 ed1722a
fix TASK to $TASK in docs
jens321 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # Recipe: Representation-Based Exploration (RepExp) | ||
|
|
||
| Last updated: 11/14/2025. | ||
|
|
||
| <div align="center"> | ||
|
|
||
| Representation-Based Exploration for Language Models: <br> From Test-Time to Post-Training | ||
|
|
||
| [📄 arXiv](https://arxiv.org/abs/2510.11686) [🌐 Website](https://rep-exp.github.io) [🐦 Twitter / X ](https://x.com/JensTuyls/status/1978244454617128993) | ||
|
|
||
| </div> | ||
|
|
||
|
|
||
| ## Installation 🔌 | ||
|
|
||
| Our algorithm doesn't require anything beyond the base verl installation, which you can find [here](https://verl.readthedocs.io/en/latest/start/install.html). | ||
|
|
||
| ## Running the Experiments 🚀 | ||
|
|
||
| You can reproduce or extend our experiments by running the following commands: | ||
|
|
||
| ```bash | ||
| # General format | ||
| sh recipe/rep_exp/train_elliptical.sh $TASK $SPARSE_DIM $BETA $SEED | ||
|
|
||
| # MATH | ||
| sh recipe/rep_exp/train_elliptical.sh math 32 0.01 42 | ||
|
|
||
| # GSM8K | ||
| sh recipe/rep_exp/train_elliptical.sh gsm8k 32 0.01 42 | ||
|
|
||
| # DAPO-WITH-AIME | ||
| sh recipe/rep_exp/train_elliptical.sh dapo-with-aime24 128 0.01 42 | ||
| ``` | ||
| where `$TASK` is the task name, `$SPARSE_DIM` is the sparse dimension, `$BETA` is the beta parameter, and `$SEED` is the seed. | ||
|
|
||
| ## Evaluation 📊 | ||
| Once done training, you can evaluate the model on the test set by following two steps. | ||
| 1. Merge the model checkpoint. | ||
|
|
||
| This is necessary because the model checkpoint is saved in multiple shards (depending on the nubmer of GPUs), and we need to merge them into a single checkpoint. | ||
|
|
||
| ```bash | ||
| sh recipe/rep_exp/model_merge.sh /path/to/global_step_X/actor # where X is the global step of the checkpoint with the best pass@1 on dev | ||
| ``` | ||
|
|
||
| 2. Evaluate the merged model. | ||
|
|
||
| ```bash | ||
| sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev | ||
| ``` | ||
|
|
||
| The results should be in a folder named `eval` and saved as a JSON file. | ||
|
|
||
| ## Citation 📝 | ||
|
|
||
| ```bibtex | ||
| @article{tuyls2025representation, | ||
| title={Representation-Based Exploration for Language Models: From Test-Time to Post-Training}, | ||
| author={Tuyls, Jens and Foster, Dylan J and Krishnamurthy, Akshay and Ash, Jordan T}, | ||
| journal={arXiv preprint arXiv:2510.11686}, | ||
| year={2025} | ||
| } | ||
| ``` | ||
|
|
||
| ## Contact 📬 | ||
|
|
||
| If you have any questions or suggestions, feel free to reach out at [jtuyls@princeton.edu](mailto:jtuyls@princeton.edu). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| <div align="center"> | ||
|
|
||
| # Representation-Based Exploration for Language Models: <br> From Test-Time to Post-Training | ||
|
|
||
| [📄 arXiv](https://arxiv.org/abs/2510.11686) [🌐 Website](https://rep-exp.github.io) [🐦 Twitter / X ](https://x.com/JensTuyls/status/1978244454617128993) | ||
|
|
||
| </div> | ||
|
|
||
| ## Installation 🔌 | ||
|
|
||
| Besides the base verl installation, which you can find [here](https://verl.readthedocs.io/en/latest/start/install.html), the only package to install is scikit-learn. | ||
| ```bash | ||
| pip install scikit-learn | ||
| ``` | ||
|
|
||
| ## Running the Experiments 🚀 | ||
|
|
||
| You can reproduce or extend our experiments by running the following commands: | ||
|
|
||
| ```bash | ||
| # General format | ||
| sh recipe/rep_exp/train_elliptical.sh $TASK $SPARSE_DIM $BETA $SEED | ||
|
|
||
| # MATH | ||
| sh recipe/rep_exp/train_elliptical.sh math 32 0.01 42 | ||
|
|
||
| # GSM8K | ||
| sh recipe/rep_exp/train_elliptical.sh gsm8k 32 0.01 42 | ||
|
|
||
| # DAPO-WITH-AIME | ||
| sh recipe/rep_exp/train_elliptical.sh dapo-with-aime24 128 0.01 42 | ||
| ``` | ||
| where `$TASK` is the task name, `$SPARSE_DIM` is the sparse dimension, `$BETA` is the beta parameter, and `$SEED` is the seed. | ||
|
|
||
| ## Evaluation 📊 | ||
| Once done training, you can evaluate the model on the test set by following two steps. | ||
| 1. Merge the model checkpoint. | ||
|
|
||
| This is necessary because the model checkpoint is saved in multiple shards (depending on the nubmer of GPUs), and we need to merge them into a single checkpoint. | ||
|
|
||
| ```bash | ||
| sh recipe/rep_exp/model_merge.sh /path/to/global_step_X/actor # where X is the global step of the checkpoint with the best pass@1 on dev | ||
| ``` | ||
|
|
||
| 2. Evaluate the merged model. | ||
|
|
||
| ```bash | ||
| sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev | ||
| ``` | ||
|
|
||
| The results should be in a folder named `eval` and saved as a JSON file. | ||
|
|
||
| ## Citation 📝 | ||
|
|
||
| ```bibtex | ||
| @article{tuyls2025representation, | ||
| title={Representation-Based Exploration for Language Models: From Test-Time to Post-Training}, | ||
| author={Tuyls, Jens and Foster, Dylan J and Krishnamurthy, Akshay and Ash, Jordan T}, | ||
| journal={arXiv preprint arXiv:2510.11686}, | ||
| year={2025} | ||
| } | ||
| ``` | ||
|
|
||
| ## Contact 📬 | ||
|
|
||
| If you have any questions or suggestions, feel free to reach out at [jtuyls@princeton.edu](mailto:jtuyls@princeton.edu). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| hydra: | ||
| searchpath: | ||
| - file://verl/trainer/config | ||
|
|
||
| defaults: | ||
| - ppo_trainer | ||
| - _self_ | ||
|
|
||
| reward_model: | ||
| elliptical: | ||
| enable: True | ||
| lamb: 0.01 | ||
| normalization: none # none, rnd, z_score | ||
| reward_type: leave_one_out # leave_one_out, leverage | ||
| sparse_dim: 512 | ||
| randomize_sparse_matrix: True | ||
| persist_covariance: False | ||
|
|
||
| reward_kwargs: | ||
| elliptical: | ||
| alpha: 1.0 | ||
| beta: 1.0 | ||
| turn_off_elliptical_if_none_correct: True | ||
| turn_off_elliptical_if_some_correct: False | ||
| turn_off_elliptical_if_all_correct: False | ||
| turn_off_elliptical_if_rollout_incorrect: False | ||
|
|
||
| actor_rollout_ref: | ||
| rollout: | ||
| val_kwargs: | ||
| temperature: 1.0 | ||
| n: 128 | ||
| do_sample: True |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| # Copyright 2024 Bytedance Ltd. and/or its affiliates | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """ | ||
| Preprocess DAPO dataset to parquet format | ||
| """ | ||
|
|
||
| import argparse | ||
| import os | ||
|
|
||
| import datasets | ||
| import numpy as np | ||
|
|
||
| from verl.utils.hdfs_io import copy, makedirs | ||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--local_dir", default="~/data/dapo-with-aime24") | ||
| parser.add_argument("--hdfs_dir", default=None) | ||
| parser.add_argument("--dapo_dataset_path", type=str, default="ftajwar/deduplicated_dapo_dataset") | ||
| parser.add_argument("--aime24_part_1_dataset_path", type=str, default="MathArena/aime_2024_I") | ||
| parser.add_argument("--aime24_part_2_dataset_path", type=str, default="MathArena/aime_2024_II") | ||
| parser.add_argument("--train_size", type=int, default=4096) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| data_source = "math_dapo" | ||
|
|
||
| # Load DAPO dataset for training | ||
| dapo_dataset_path = args.dapo_dataset_path | ||
| dapo_dataset = datasets.load_dataset(dapo_dataset_path, trust_remote_code=True) | ||
|
|
||
| # Load AIME 2024 part 1 dataset for testing | ||
| aime24_dataset_path_part_1 = args.aime24_part_1_dataset_path | ||
| aime24_dataset_part_1 = datasets.load_dataset(aime24_dataset_path_part_1, trust_remote_code=True) | ||
|
|
||
| # Load AIME 2024 part 2 dataset for testing | ||
| aime24_dataset_path_part_2 = args.aime24_part_2_dataset_path | ||
| aime24_dataset_part_2 = datasets.load_dataset(aime24_dataset_path_part_2, trust_remote_code=True) | ||
|
|
||
| train_dataset = dapo_dataset["train"] | ||
| train_dataset = train_dataset.select(np.random.choice(len(train_dataset), size=args.train_size, replace=False)) | ||
|
|
||
| dev_dataset_aime24_part_1 = aime24_dataset_part_1["train"] | ||
| dev_dataset_aime24_part_2 = aime24_dataset_part_2["train"] | ||
| dev_dataset = datasets.concatenate_datasets([dev_dataset_aime24_part_1, dev_dataset_aime24_part_2]) | ||
|
|
||
| instruction_following = "Let's think step by step and output the final answer within \\boxed{}." | ||
|
|
||
| # add a row to each data item that represents a unique id | ||
| def make_map_fn(split): | ||
| def process_fn(example, idx): | ||
| if "prompt" in example: | ||
| question = example.pop("prompt") | ||
| elif "problem" in example: | ||
| question = example.pop("problem") | ||
| else: | ||
| raise ValueError(f"Unknown question type: {example}") | ||
|
|
||
| question = question + " " + instruction_following | ||
|
|
||
| if "answer" in example: | ||
| solution = example.pop("answer") | ||
| else: | ||
| raise ValueError(f"Unknown answer type: {example}") | ||
| solution = str(solution) | ||
|
|
||
| data = { | ||
| "data_source": data_source, | ||
| "prompt": [{"role": "user", "content": question}], | ||
| "ability": "math", | ||
| "reward_model": { | ||
| "style": "rule", | ||
| "ground_truth": solution, | ||
| }, | ||
| "extra_info": {"split": split, "index": idx}, | ||
| } | ||
| return data | ||
|
|
||
| return process_fn | ||
|
|
||
| train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True) | ||
| dev_dataset = dev_dataset.map(function=make_map_fn("test"), with_indices=True) | ||
|
|
||
| local_dir = args.local_dir | ||
| hdfs_dir = args.hdfs_dir | ||
|
|
||
| train_dataset.to_parquet(os.path.join(local_dir, "train.parquet")) | ||
| dev_dataset.to_parquet(os.path.join(local_dir, "dev.parquet")) | ||
|
|
||
| if hdfs_dir is not None: | ||
| makedirs(hdfs_dir) | ||
|
|
||
| copy(src=local_dir, dst=hdfs_dir) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| # Copyright 2024 Bytedance Ltd. and/or its affiliates | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """ | ||
| Preprocess the GSM8k dataset to parquet format | ||
| """ | ||
|
|
||
| import argparse | ||
| import os | ||
| import re | ||
|
|
||
| import datasets | ||
| import numpy as np | ||
|
|
||
| from verl.utils.hdfs_io import copy, makedirs | ||
|
|
||
|
|
||
| def extract_solution(solution_str): | ||
| solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str) | ||
| assert solution is not None | ||
| final_solution = solution.group(0) | ||
| final_solution = final_solution.split("#### ")[1].replace(",", "") | ||
| return final_solution | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--local_dir", default=None, help="The save directory for the preprocessed dataset.") | ||
| parser.add_argument("--hdfs_dir", default=None) | ||
| parser.add_argument("--local_dataset_path", default=None, help="The local path to the raw dataset, if it exists.") | ||
| parser.add_argument( | ||
| "--local_save_dir", default="~/data/gsm8k", help="The save directory for the preprocessed dataset." | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
| local_dataset_path = args.local_dataset_path | ||
|
|
||
| data_source = "openai/gsm8k" | ||
|
|
||
| if local_dataset_path is not None: | ||
| dataset = datasets.load_dataset(local_dataset_path, "main") | ||
| else: | ||
| dataset = datasets.load_dataset(data_source, "main") | ||
|
|
||
| train_dataset = dataset["train"] | ||
| test_dataset = dataset["test"] | ||
|
|
||
| instruction_following = 'Let\'s think step by step and output the final answer after "####".' | ||
|
|
||
| # add a row to each data item that represents a unique id | ||
| def make_map_fn(split): | ||
| def process_fn(example, idx): | ||
| question_raw = example.pop("question") | ||
|
|
||
| question = question_raw + " " + instruction_following | ||
|
|
||
| answer_raw = example.pop("answer") | ||
| solution = extract_solution(answer_raw) | ||
| data = { | ||
| "data_source": data_source, | ||
| "prompt": [ | ||
| { | ||
| "role": "user", | ||
| "content": question, | ||
| } | ||
| ], | ||
| "ability": "math", | ||
| "reward_model": {"style": "rule", "ground_truth": solution}, | ||
| "extra_info": { | ||
| "split": split, | ||
| "index": idx, | ||
| "answer": answer_raw, | ||
| "question": question_raw, | ||
| }, | ||
| } | ||
| return data | ||
|
|
||
| return process_fn | ||
|
|
||
| train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True) | ||
| test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True) | ||
| # split test into dev and test by picking random subset of 512 examples | ||
| all_test_indices = range(len(test_dataset)) | ||
| all_test_indices = list(all_test_indices) | ||
| np.random.shuffle(all_test_indices) | ||
| dev_dataset = test_dataset.select(all_test_indices[:512]) | ||
| test_dataset = test_dataset.select(all_test_indices[512:]) | ||
|
|
||
| hdfs_dir = args.hdfs_dir | ||
| local_save_dir = args.local_dir | ||
| if local_save_dir is not None: | ||
| print("Warning: Argument 'local_dir' is deprecated. Please use 'local_save_dir' instead.") | ||
| else: | ||
| local_save_dir = args.local_save_dir | ||
|
|
||
| train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet")) | ||
| test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet")) | ||
|
|
||
| if hdfs_dir is not None: | ||
| makedirs(hdfs_dir) | ||
|
|
||
| copy(src=local_save_dir, dst=hdfs_dir) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for
--local_dircontains a tilde (~), which is not automatically expanded byargparse. This will result in creating a directory named~in the current working directory, instead of using the user's home directory. To fix this, you should expand the user's home directory path.