[recipe, algo] feat: Representation-based Exploration (RepExp) by jens321 · Pull Request #4278 · verl-project/verl

jens321 · 2025-11-24T20:51:03Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Add support for the training and evaluation of the RepExp method introduced in section 5 of the paper Representation-Based Exploration for Language Models: From Test-Time to Post-Training.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [algo] feat: integrate Clip-Cov and KL-Cov methods #1830
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Please refer to Figure 2 in https://arxiv.org/abs/2510.11686 for evaluation results.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

The general format for training with our method is as follows

sh recipe/rep_exp/train_elliptical.sh $TASK $SPARSE_DIM $BETA $SEED

where $TASK is the task name, $SPARSE_DIM is the sparse dimension, $BETA is the beta parameter, and $SEED is the seed.

For example for training on MATH with the original parameters from the paper, one would do

sh recipe/rep_exp/train_elliptical.sh math 32 0.01 42

Once done training, one can evaluate the model on the test set by following two steps.

Merge the model checkpoint.

This is necessary because the model checkpoint is saved in multiple shards (depending on the nubmer of GPUs), and we need to merge them into a single checkpoint.

sh recipe/rep_exp/model_merge.sh /path/to/global_step_X/actor # where X is the global step of the checkpoint with the best pass@1 on dev

Evaluate the merged model.

sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

All changes are contained in the recipe/rep_exp folder and summarized below:

rep_exp/main_rep_exp.py: copy of verl/trainer/main_ppo.py but imports EllipticalRewardModelWorker instead of the standard RewardModelWorker
rep_exp/rep_exp_trainer.py: copy of verl/trainer/ppo/ray_trainer.py but adds computing of hidden states and elliptical reward scores as follows:

with marked_timer("reward", timing_raw, color="yellow"):
    # compute reward model score
    if self.use_rm and "rm_scores" not in batch.batch.keys():
        if self.config.reward_model.elliptical.enable:
            hidden_states = self.rm_wg.compute_hidden_states(batch)
            batch = batch.union(hidden_states)
            reward_tensor = self.rm_wg.compute_rm_score(batch)
        else:
            reward_tensor = self.rm_wg.compute_rm_score(batch)
        batch = batch.union(reward_tensor)

In addition, there are a few lines of code that help keep track of the best pass@1 seen so far on validation.

rep_exp/metric_utils.py: adds and extends some utility functions that provide additional metrics for our method that could be helpful for debugging.
rep_exp/workers/elliptical_reward_model_worker.py: adds the EllipticalRewardModelWorker class which provides functionality for (1) computing hidden states and (2) computing elliptical reward scores based on the hidden states.

class EllipticalRewardModelWorker(RewardModelWorker):
    def __init__(self, config):
        super().__init__(config)
        self.lamb = config.elliptical.lamb
        self.normalization = config.elliptical.normalization
        self.sparse_dim = config.elliptical.sparse_dim
        self.sparse_matrix = None
        self.randomize_sparse_matrix = config.elliptical.randomize_sparse_matrix
        self.persist_covariance = config.elliptical.persist_covariance
        self.cov_inv_dict = {}
        self.mean_hidden_states_mu_dict = {}
        self.hidden_mean_counter_dict = {}

    @staticmethod
    def _construct_sparse_matrix(features: torch.Tensor, sparse_dim: int) -> torch.Tensor:
        from sklearn.random_projection import SparseRandomProjection

        sparse_proj = SparseRandomProjection(sparse_dim, density="auto")
        sparse_proj.fit(features)
        sparse_matrix = sparse_proj.components_
        sparse_matrix_coo = sparse_matrix.tocoo()

        # Convert the row and col lists to numpy arrays and then to a LongTensor (speed up)
        indices = torch.LongTensor(np.array([sparse_matrix_coo.row, sparse_matrix_coo.col]))
        values = torch.FloatTensor(sparse_matrix_coo.data)

        sparse_mat = torch.sparse_coo_tensor(indices, values, [sparse_dim, features.shape[1]]).t()

        return sparse_mat

    def _build_model(self, config):
        # the following line is necessary
        from torch.distributed.fsdp import CPUOffload
        from transformers import AutoConfig, AutoModel

        use_shm = config.model.get("use_shm", False)
        # download the checkpoint from hdfs
        local_path = copy_to_local(config.model.path, use_shm=use_shm)

        if self.config.model.input_tokenizer is None:
            self._do_switch_chat_template = False
        else:
            self._do_switch_chat_template = True
            input_tokenizer_local_path = copy_to_local(config.model.input_tokenizer, use_shm=use_shm)
            self.input_tokenizer = hf_tokenizer(
                input_tokenizer_local_path, trust_remote_code=config.model.get("trust_remote_code", False)
            )
            self.tokenizer = hf_tokenizer(local_path, trust_remote_code=config.model.get("trust_remote_code", False))

        trust_remote_code = config.model.get("trust_remote_code", False)
        model_config = AutoConfig.from_pretrained(local_path, trust_remote_code=trust_remote_code)
        model_config.num_labels = 1

        # note that we have to create model in fp32. Otherwise, the optimizer is in bf16, which is incorrect
        init_context = get_init_weight_context_manager(
            use_meta_tensor=not model_config.tie_word_embeddings, mesh=self.device_mesh
        )

        with init_context(), warnings.catch_warnings():
            warnings.simplefilter("ignore")
            model_config.classifier_dropout = 0.0
            reward_module = AutoModel.from_pretrained(
                pretrained_model_name_or_path=local_path,
                config=model_config,
                torch_dtype=torch.bfloat16,
                attn_implementation="flash_attention_2",
                trust_remote_code=trust_remote_code,
            )

            apply_monkey_patch(
                model=reward_module,
                use_remove_padding=config.model.get("use_remove_padding", False),
                ulysses_sp_size=self.ulysses_sequence_parallel_size,
            )

            reward_module.to(torch.bfloat16)

        auto_wrap_policy = get_fsdp_wrap_policy(module=reward_module, config=self.config.model.fsdp_config)

        fsdp_mesh = self.device_mesh
        sharding_strategy = get_sharding_strategy(fsdp_mesh)

        if config.strategy == "fsdp":
            reward_module = FSDP(
                reward_module,
                param_init_fn=init_fn,
                use_orig_params=False,
                auto_wrap_policy=auto_wrap_policy,
                device_id=get_device_id(),
                sharding_strategy=sharding_strategy,  # zero3
                sync_module_states=True,
                cpu_offload=CPUOffload(offload_params=True),
                forward_prefetch=self.config.model.fsdp_config.forward_prefetch,
                device_mesh=self.device_mesh,
            )
        elif config.strategy == "fsdp2":
            assert CPUOffloadPolicy is not None, "PyTorch version >= 2.4 is required for using fully_shard API (FSDP2)"
            cpu_offload = CPUOffloadPolicy(pin_memory=True)
            fsdp_kwargs = {
                "mesh": fsdp_mesh,
                "offload_policy": cpu_offload,
                "reshard_after_forward": config.model.fsdp_config.reshard_after_forward,
                "shard_placement_fn": get_shard_placement_fn(fsdp_size=self.device_mesh.shape[-1]),
            }
            full_state = reward_module.state_dict()
            apply_fsdp2(reward_module, fsdp_kwargs, config.model.fsdp_config)
            fsdp2_load_full_state_dict(reward_module, full_state, fsdp_mesh, cpu_offload)
        else:
            raise NotImplementedError(f"Unknown strategy: {config.strategy}")
        return reward_module

    def _forward_micro_batch(self, micro_batch, start_of_response: int):
        with torch.no_grad(), torch.autocast(device_type=device_name, dtype=torch.bfloat16):
            input_ids = micro_batch["input_ids"]
            batch_size, seqlen = input_ids.shape
            attention_mask = micro_batch["attention_mask"]
            position_ids = micro_batch["position_ids"]
            if position_ids.dim() == 3:  # qwen2vl mrope
                position_ids = position_ids.transpose(0, 1)  # (bsz, 3, seqlen) -> (3, bsz, seqlen)

            if self.use_remove_padding:
                raise NotImplementedError("Remove padding is not implemented for elliptical reward model")
            else:
                output = self.reward_module(
                    input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, use_cache=False
                )

                sequence_lengths = attention_mask[:, start_of_response:].sum(dim=1)
                mean_hidden_states = []
                for i, seq_len in enumerate(sequence_lengths):
                    mean_hidden_states.append(
                        output.last_hidden_state[i, start_of_response : start_of_response + seq_len].mean(dim=0)
                    )
                mean_hidden_states = torch.stack(mean_hidden_states)

            return mean_hidden_states

    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
    @DistProfiler.annotate(color="brown")
    def compute_hidden_states(self, data: DataProto):
        import itertools

        from verl.utils.seqlen_balancing import get_reverse_idx, rearrange_micro_batches

        # Support all hardwares
        data = data.to(get_device_id())
        if self._do_switch_chat_template:
            rm_data = self._switch_chat_template(data)
        else:
            rm_input_ids = data.batch["input_ids"]
            rm_attention_mask = data.batch["attention_mask"]
            rm_position_ids = data.batch["position_ids"]
            rm_inputs = {
                "input_ids": rm_input_ids,
                "attention_mask": rm_attention_mask,
                "position_ids": rm_position_ids,
            }
            rm_data = DataProto.from_dict(rm_inputs)

        # Support all hardwares
        rm_data = rm_data.to(get_device_id())

        # perform forward computation
        with self.ulysses_sharding_manager:
            use_dynamic_bsz = self.config.use_dynamic_bsz
            if use_dynamic_bsz:
                max_token_len = self.config.forward_max_token_len_per_gpu * self.ulysses_sequence_parallel_size
                micro_batches, indices = rearrange_micro_batches(batch=rm_data.batch, max_token_len=max_token_len)
            else:
                micro_batches = rm_data.batch.split(self.config.micro_batch_size_per_gpu)
            output = []
            for micro_batch in micro_batches:
                mean_hidden_states = self._forward_micro_batch(
                    micro_batch, start_of_response=data.batch["prompts"].shape[-1]
                )
                output.append(mean_hidden_states)
            mean_hidden_states = torch.cat(output, dim=0)  # (batch_size)

            # NOTE(Jens): this has not been thoroughly checked
            if use_dynamic_bsz:
                indices = list(itertools.chain.from_iterable(indices))
                assert len(indices) == mean_hidden_states.size(0), f"{len(indices)} vs. {mean_hidden_states.size()}"
                revert_indices = torch.tensor(get_reverse_idx(indices), dtype=torch.long)
                mean_hidden_states = mean_hidden_states[revert_indices]

            # Note that this is only the scores, may not be the final rewards used to train RL
            output = DataProto.from_dict(tensors={"mean_hidden_states": mean_hidden_states})

        # https://pytorch.org/docs/stable/notes/fsdp.html#fsdp-notes
        # unshard the root FSDP module
        if self.world_size > 1 and fsdp_version(self.reward_module) == 1:
            self.reward_module._handle.reshard(True)

        output = output.to("cpu")
        return output

    def _compute_bonuses(self, hidden_states, cov_inv, prompt_index: int):
        if self.config.elliptical.reward_type == "leave_one_out":
            if self.persist_covariance:
                raise NotImplementedError("Leave-one-out with persistence is not implemented")
            else:
                bonuses = []
                for i, hidden_state in enumerate(hidden_states):
                    chosen_samp = hidden_state.unsqueeze(1)
                    middle_part = torch.inverse(1 - chosen_samp.t() @ cov_inv @ chosen_samp)
                    leave_one_out_cov_inv = cov_inv + cov_inv @ chosen_samp @ middle_part @ chosen_samp.t() @ cov_inv
                    bonus = (chosen_samp.t() @ leave_one_out_cov_inv @ chosen_samp).flatten().float()
                    bonuses.append(bonus)

                bonuses = torch.concat(bonuses)

        elif self.config.elliptical.reward_type == "leverage":
            if self.persist_covariance:
                hidden_mean = self.mean_hidden_states_mu_dict[prompt_index]
                hidden_mean_counter = self.hidden_mean_counter_dict[prompt_index]

                hidden_states = hidden_states - hidden_mean

                numerator = cov_inv @ hidden_mean.unsqueeze(1) @ hidden_mean.unsqueeze(0) @ cov_inv
                denominator = -1 / hidden_mean_counter + hidden_mean.t() @ cov_inv @ hidden_mean
                cov_inv_mean_adjusted = cov_inv - numerator / denominator
                batch_cov_inv = cov_inv_mean_adjusted.unsqueeze(0).expand(hidden_states.shape[0], -1, -1)
            else:
                batch_cov_inv = cov_inv.unsqueeze(0).expand(hidden_states.shape[0], -1, -1)

            bonuses = (hidden_states.unsqueeze(1) @ batch_cov_inv @ hidden_states.unsqueeze(2)).flatten().float()

        return bonuses

    def _normalize_bonuses(self, bonuses):
        if self.normalization == "none":
            pass
        elif self.normalization == "rnd":
            std = torch.std(bonuses)
            if std > 0:
                bonuses = bonuses / std
        elif self.normalization == "z_score":
            mean = torch.mean(bonuses)
            std = torch.std(bonuses)
            if std > 0:
                bonuses = (bonuses - mean) / std
            else:
                bonuses = bonuses - mean
        else:
            raise ValueError(f"Unknown normalization: {self.normalization}")

        return bonuses

    @register(dispatch_mode=Dispatch.ALL_TO_ALL, execute_mode=Execute.RANK_ZERO)
    @DistProfiler.annotate(color="brown")
    def compute_rm_score(self, data: DataProto):
        if self.sparse_matrix is None:
            d = data.batch["mean_hidden_states"].shape[-1]
            sparse_matrix = self._construct_sparse_matrix(torch.randn(1, d), self.sparse_dim)
            if not self.randomize_sparse_matrix:
                self.sparse_matrix = sparse_matrix
        else:
            sparse_matrix = self.sparse_matrix

        mean_hidden_states = data.batch["mean_hidden_states"].cuda().float()

        # sparse project
        mean_hidden_states = mean_hidden_states @ sparse_matrix.cuda()

        # upgrade to float64
        mean_hidden_states = mean_hidden_states.to(torch.float64)

        seen_uids = set()
        reward_tensor = torch.zeros_like(data.batch["responses"], dtype=torch.float32).cuda()
        raw_bonuses_tensor = torch.zeros_like(data.batch["responses"], dtype=torch.float32).cuda()
        for i in range(len(data)):
            data_item = data[i]
            uid = data_item.non_tensor_batch["uid"]
            if uid in seen_uids:
                continue

            seen_uids.add(uid)
            mask = data.non_tensor_batch["uid"] == uid
            filtered_mean_hidden_states = mean_hidden_states[mask]

            prompt_index = data_item.non_tensor_batch["extra_info"]["index"]

            if self.persist_covariance:
                # first update the mean hidden states mu
                if prompt_index not in self.mean_hidden_states_mu_dict:
                    self.mean_hidden_states_mu_dict[prompt_index] = filtered_mean_hidden_states.mean(dim=0)
                    self.hidden_mean_counter_dict[prompt_index] = mask.sum()
                else:
                    total_count = self.hidden_mean_counter_dict[prompt_index] + mask.sum()
                    old_mu = self.mean_hidden_states_mu_dict[prompt_index]
                    new_mu = (
                        old_mu * self.hidden_mean_counter_dict[prompt_index]
                        + filtered_mean_hidden_states.mean(dim=0) * mask.sum()
                    ) / total_count
                    self.mean_hidden_states_mu_dict[prompt_index] = new_mu
                    self.hidden_mean_counter_dict[prompt_index] = total_count

                # NOTE: we don't center here since otherwise the covariance will accumulate stale means
                final_mean_hidden_states = filtered_mean_hidden_states

                if prompt_index not in self.cov_inv_dict:
                    d = final_mean_hidden_states.shape[-1]
                    self.cov_inv_dict[prompt_index] = torch.eye(d, dtype=torch.float64).cuda() * self.lamb**-1
                cov_inv = self.cov_inv_dict[prompt_index]
            else:
                centered_mean_hidden_states = filtered_mean_hidden_states - filtered_mean_hidden_states.mean(dim=0)
                final_mean_hidden_states = centered_mean_hidden_states

                d = final_mean_hidden_states.shape[-1]
                cov_inv = torch.eye(d, dtype=torch.float64).cuda() * self.lamb**-1

            # update inverse covariance matrix with rank-1 updates
            for hidden_state in final_mean_hidden_states:
                chosen_samp = hidden_state.unsqueeze(1)
                middle_part = torch.inverse(1 + chosen_samp.t() @ cov_inv @ chosen_samp)
                cov_inv = cov_inv - cov_inv @ chosen_samp @ middle_part @ chosen_samp.t() @ cov_inv

            if self.persist_covariance:
                self.cov_inv_dict[prompt_index] = cov_inv

            raw_bonuses = self._compute_bonuses(final_mean_hidden_states, cov_inv, prompt_index)
            normalized_bonuses = self._normalize_bonuses(raw_bonuses)

            prompt_ids = data.batch["prompts"][mask]
            prompt_length = prompt_ids.shape[-1]
            valid_response_lengths = data.batch["attention_mask"][mask, prompt_length:].sum(-1)

            raw_bonuses_tensor[mask, valid_response_lengths - 1] = raw_bonuses
            reward_tensor[mask, valid_response_lengths - 1] = normalized_bonuses

        output = DataProto.from_dict(
            tensors={"rm_scores": reward_tensor}, non_tensors={"raw_bonuses": raw_bonuses_tensor.cpu().numpy()}
        )
        return output.to("cpu")

rep_exp/reward_manager: adds the EllipticalRewardManager class that handles combining the external reward (from a verifier) and the elliptical reward.

@register("elliptical")
class EllipticalRewardManager(NaiveRewardManager):
    """The reward manager."""

    def __init__(
        self,
        tokenizer,
        num_examine,
        compute_score=None,
        reward_fn_key="data_source",
        beta: int = 1.0,
        turn_off_elliptical_if_none_correct: bool = False,
        turn_off_elliptical_if_some_correct: bool = False,
        turn_off_elliptical_if_all_correct: bool = False,
        turn_off_elliptical_if_rollout_incorrect: bool = False,
        alpha: float = 1.0,
    ) -> None:
        """
        Initialize the NaiveRewardManager instance.

        Args:
            tokenizer: The tokenizer used to decode token IDs into text.
            num_examine: The number of batches of decoded responses to print to the console for debugging purpose.
            compute_score: A function to compute the reward score. If None, `default_compute_score` will be used.
            reward_fn_key: The key used to access the data source in the non-tensor batch data. Defaults to
                "data_source".
        """
        super().__init__(tokenizer, num_examine, default_compute_score, reward_fn_key)
        self.beta = beta
        self.turn_off_elliptical_if_none_correct = turn_off_elliptical_if_none_correct
        self.turn_off_elliptical_if_some_correct = turn_off_elliptical_if_some_correct
        self.turn_off_elliptical_if_all_correct = turn_off_elliptical_if_all_correct
        self.turn_off_elliptical_if_rollout_incorrect = turn_off_elliptical_if_rollout_incorrect
        self.alpha = alpha

    def __call__(self, data: DataProto, return_dict=False):
        if "rm_scores" not in data.batch:
            # this means we're doing validation, so we don't need to compute the elliptical reward
            return super().__call__(data, return_dict=return_dict)

        reward_extra_info = defaultdict(list)

        intrinsic_reward_tensor = data.batch["rm_scores"]
        data.pop(batch_keys=["rm_scores"])

        extrinsic_reward_result = super().__call__(data, return_dict=True)
        extrinsic_reward_tensor = extrinsic_reward_result["reward_tensor"]
        extrinsic_reward_extra_info = extrinsic_reward_result["reward_extra_info"]

        self._maybe_turn_off_elliptical(data, extrinsic_reward_tensor, intrinsic_reward_tensor)

        reward_tensor = self.alpha * extrinsic_reward_tensor + self.beta * intrinsic_reward_tensor

        # Intrinsic reward extra info
        reward_extra_info["intrinsic_reward"] = intrinsic_reward_tensor.numpy()
        reward_extra_info["beta_scaled_intrinsic_reward"] = self.beta * intrinsic_reward_tensor.numpy()
        reward_extra_info["extrinsic_reward"] = extrinsic_reward_tensor.numpy()
        reward_extra_info["alpha_scaled_extrinsic_reward"] = self.alpha * extrinsic_reward_tensor.numpy()
        reward_extra_info["total_reward"] = reward_tensor.numpy()

        # Update with extrinsic reward extra info
        reward_extra_info.update(extrinsic_reward_extra_info)

        if return_dict:
            return {
                "reward_tensor": reward_tensor,
                "reward_extra_info": reward_extra_info,
            }
        else:
            return reward_tensor

    def _maybe_turn_off_elliptical(
        self, data: DataProto, extrinsic_reward_tensor: torch.Tensor, intrinsic_reward_tensor: torch.Tensor
    ) -> None:
        """
        Potentially turn off the elliptical reward for samples that have one of the following properties:
            (1) any of the rollouts have the correct answer
            (2) all of the rollouts have the correct answer

        Args:
            data (DataProto): The data proto containing the batch data.
            extrinsic_reward_tensor (torch.Tensor): The extrinsic reward tensor.
            intrinsic_reward_tensor (torch.Tensor): The intrinsic reward tensor.

        Returns:
            None
        """
        if self.turn_off_elliptical_if_rollout_incorrect:
            mask = extrinsic_reward_tensor.sum(dim=-1) == 0
            intrinsic_reward_tensor[mask] = 0.0

        visited_uids = set()
        for uid in data.non_tensor_batch["uid"]:
            if uid in visited_uids:
                continue

            visited_uids.add(uid)
            mask = torch.from_numpy(data.non_tensor_batch["uid"] == uid)

            # Potentially turn off elliptical if **no** rollout has the correct answer
            if self.turn_off_elliptical_if_none_correct and extrinsic_reward_tensor[mask].sum() == 0:
                intrinsic_reward_tensor[mask] = 0.0

            # Potentially turn off elliptical if **some** rollouts have the correct answer
            if (
                self.turn_off_elliptical_if_some_correct
                and extrinsic_reward_tensor[mask].sum() > 0
                and extrinsic_reward_tensor[mask].sum() < mask.sum()
            ):
                intrinsic_reward_tensor[mask] = 0.0

            # Potentially turn off elliptical if **all** rollouts have the correct answer
            if self.turn_off_elliptical_if_all_correct and extrinsic_reward_tensor[mask].sum() == mask.sum():
                intrinsic_reward_tensor[mask] = 0.0

rep_exp/utils/tracking.py and rep_exp/utils/aggregate_logger.py: adds the JsonEvalLogger which can be used to log final evaluation results to a json file.

class JsonEvalLogger:
    """
    A logger that logs to a json file.
    Args:
        save_path: The path to the checkpoint to resume from.
        task: The task name, used to name the experiment.
    """

    def __init__(self, save_path: str, task: str):
        self.root = "eval"
        if save_path is not None and save_path != "":
            self.experiment_name = save_path.split("/")[-2]
            self.checkpoint_type = save_path.split("/")[-1]
        else:
            self.experiment_name = f"{task}_untrained"
            self.checkpoint_type = ""

    def flush(self):
        pass

    def log(self, data, step):
        # Create eval folder
        save_folder = os.path.join(self.root, self.experiment_name, self.checkpoint_type)
        os.makedirs(save_folder, exist_ok=True)

        # Save to json
        with open(os.path.join(save_folder, "eval.json"), "w") as f:
            json.dump(data, f)

rep_exp/data_preprocess.py: contains a script for each of MATH, GSM8K, and AIME 2024 that provide the logic for getting the dataset splits (train, dev, test).
rep_exp/reward_score/__init__.py: copy of verl/utils/reward_score/__init__.py but uses math_verify for dapo training

elif data_source in ["math_dapo", "math", "math_dapo_reasoning"] or data_source.startswith("aime"):
    # res = math_dapo.compute_score(solution_str, ground_truth)
    from verl.utils.reward_score import math_verify

    res = math_verify.compute_score(solution_str, ground_truth)

rep_exp/plot_pass_at_k.py: sample script that provides basic plotting code that plots a pass@k curve based on the logged json files that are saved after running the evaluation script.
rep_exp/config/rep_exp_trainer.yaml: overwrites and adds any RepExp specific configuration parameters.
rep_exp/train_elliptical.sh: training script
rep_exp/model_merge.sh: script to merge model checkpoints
rep_exp/eval.sh: evaluation script

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: algorithm support
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

CLAassistant · 2025-11-24T20:51:11Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This PR introduces the RepExp algorithm. The changes are extensive, adding a new recipe with multiple components. The implementation looks mostly correct, but I've found several issues that need to be addressed. There are critical bugs in the data preprocessing scripts that will cause them to crash. Some scripts use ~ in default file paths, which is not robust. There are brittle path manipulations in the logging utilities that can lead to IndexError. The documentation contains copy-paste errors in example commands. Finally, there's a performance issue in the EllipticalRewardModelWorker where a tensor is repeatedly moved to the GPU.

gemini-code-assist · 2025-11-24T20:59:17Z

recipe/rep_exp/data_preprocess/dapo_with_aime.py

+    dataset = datasets.load_dataset(
+        "parquet",
+    )


This call to datasets.load_dataset("parquet") is incorrect as it's missing the required data_files argument, and it will cause the script to crash. Since the dataset variable is not used anywhere else, this block of code should be removed.

gemini-code-assist · 2025-11-24T20:59:18Z

docs/algo/repexp.md

+2. Evaluate the merged model.
+
+```bash
+sh recipe/rep_exp/eval.sh TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev


The example evaluation command is incorrect. TASK is used as a literal string, but it should be the variable $TASK to match the shell command format. This will cause the command to fail if a user copies and pastes it.

Suggested change

sh recipe/rep_exp/eval.sh TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev

sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev

gemini-code-assist · 2025-11-24T20:59:18Z

recipe/rep_exp/README.md

+2. Evaluate the merged model.
+
+```bash
+sh recipe/rep_exp/eval.sh TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev


The example evaluation command is incorrect. TASK is used as a literal string, but it should be the variable $TASK to match the shell command format. This will cause the command to fail if a user copies and pastes it.

Suggested change

sh recipe/rep_exp/eval.sh TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev

sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev

gemini-code-assist · 2025-11-24T20:59:18Z

recipe/rep_exp/data_preprocess/dapo_with_aime.py

+    train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
+    dev_dataset = dev_dataset.map(function=make_map_fn("test"), with_indices=True)
+
+    local_dir = args.local_dir


The default value for --local_dir contains a tilde (~), which is not automatically expanded by argparse. This will result in creating a directory named ~ in the current working directory, instead of using the user's home directory. To fix this, you should expand the user's home directory path.

Suggested change

local_dir = args.local_dir

local_dir = os.path.expanduser(args.local_dir)

gemini-code-assist · 2025-11-24T20:59:18Z

recipe/rep_exp/data_preprocess/math_dataset.py

+
+    test_dataset = test_dataset.filter(filter_dev_indices, with_indices=True)
+
+    local_dir = args.local_dir


The default value for --local_dir contains a tilde (~), which is not automatically expanded. This will lead to incorrect path resolution. You should use os.path.expanduser to correctly resolve the home directory.

Suggested change

local_dir = args.local_dir

local_dir = os.path.expanduser(args.local_dir)

gemini-code-assist · 2025-11-24T20:59:19Z

recipe/rep_exp/utils/aggregate_logger.py

+        if save_path is not None and save_path != "":
+            self.experiment_name = save_path.split("/")[-2]
+            self.checkpoint_type = save_path.split("/")[-1]


This path splitting logic is not robust. If save_path does not have the expected directory structure (e.g., it's a top-level directory or a file), save_path.split("/")[-2] will raise an IndexError. You should add checks to handle different path structures gracefully.

Suggested change

if save_path is not None and save_path != "":

self.experiment_name = save_path.split("/")[-2]

self.checkpoint_type = save_path.split("/")[-1]

if save_path is not None and save_path != "":

parts = save_path.rstrip("/").split("/")

self.experiment_name = parts[-2] if len(parts) > 1 else f"{task}_untrained"

self.checkpoint_type = parts[-1]

gemini-code-assist · 2025-11-24T20:59:19Z

recipe/rep_exp/utils/tracking.py

+            if model_path.endswith("actor/hf"):
+                # Case where the model path is a saved checkpoint
+                save_path = model_path.split("/")[-4:-2]
+                save_path = "/".join(save_path)


This path splitting logic is brittle and assumes a fixed directory structure. If model_path does not match this structure, model_path.split("/")[-4:-2] could lead to an IndexError or an incorrect save_path. Consider using pathlib or more robust path parsing to avoid potential crashes.

Suggested change

if model_path.endswith("actor/hf"):

# Case where the model path is a saved checkpoint

save_path = model_path.split("/")[-4:-2]

save_path = "/".join(save_path)

path_parts = model_path.split("/")

if len(path_parts) >= 4:

save_path = "/".join(path_parts[-4:-2])

else:

save_path = ""

gemini-code-assist · 2025-11-24T20:59:19Z

recipe/rep_exp/utils/tracking.py

+
+            # Parse task from config
+            train_file = config["data"]["train_files"][0]
+            task = train_file.split("/")[-2]


Hardcoding the index [-2] to extract the task name makes the code fragile. If the train_file path structure changes, this will either fail with an IndexError or extract the wrong directory name. Using pathlib would make this more robust. You would need to add from pathlib import Path at the top of the file.

Suggested change

task = train_file.split("/")[-2]

task = Path(train_file).parent.name

gemini-code-assist · 2025-11-24T20:59:19Z

recipe/rep_exp/workers/elliptical_reward_model_worker.py

+        if self.sparse_matrix is None:
+            d = data.batch["mean_hidden_states"].shape[-1]
+            sparse_matrix = self._construct_sparse_matrix(torch.randn(1, d), self.sparse_dim)
+            if not self.randomize_sparse_matrix:
+                self.sparse_matrix = sparse_matrix
+        else:
+            sparse_matrix = self.sparse_matrix
+
+        mean_hidden_states = data.batch["mean_hidden_states"].to(get_device_id()).float()
+
+        # sparse project
+        mean_hidden_states = mean_hidden_states @ sparse_matrix.to(get_device_id())


There is a performance issue here. The sparse_matrix is moved to the GPU on every call to compute_rm_score inside the mean_hidden_states @ sparse_matrix.to(get_device_id()) operation. When randomize_sparse_matrix is False, the matrix is cached but remains on the CPU. To improve efficiency, the matrix should be moved to the GPU only once when it is created.

Suggested change

if self.sparse_matrix is None:

d = data.batch["mean_hidden_states"].shape[-1]

sparse_matrix = self._construct_sparse_matrix(torch.randn(1, d), self.sparse_dim)

if not self.randomize_sparse_matrix:

self.sparse_matrix = sparse_matrix

else:

sparse_matrix = self.sparse_matrix

mean_hidden_states = data.batch["mean_hidden_states"].to(get_device_id()).float()

# sparse project

mean_hidden_states = mean_hidden_states @ sparse_matrix.to(get_device_id())

if self.sparse_matrix is None:

d = data.batch["mean_hidden_states"].shape[-1]

sparse_matrix = self._construct_sparse_matrix(torch.randn(1, d), self.sparse_dim).to(get_device_id())

if not self.randomize_sparse_matrix:

self.sparse_matrix = sparse_matrix

else:

sparse_matrix = self.sparse_matrix

mean_hidden_states = data.batch["mean_hidden_states"].to(get_device_id()).float()

# sparse project

mean_hidden_states = mean_hidden_states @ sparse_matrix

wuxibin89 · 2025-11-25T05:13:50Z

@jens321 Hi, thanks for your contribution. We're moving recipe to a separate project verl-project/verl-recipe, could you submit a PR to this project? #4283

jens321 · 2025-11-25T11:43:31Z

@wuxibin89 Thanks for the quick reply! Sounds good, will go ahead and close this one then.

jens321 added 7 commits November 24, 2025 14:43

add RepExp to recipe

a9fd825

remove test script

99dc1c2

add RepExp to docs

2bbfd57

change explicit cuda calls to use device api

8f0faca

rearrange init_workers

b77cbe6

update scripts and main setup and trainer to latest verl changes

df1eb6a

add scikit-learn installation in readme

4e1833a

jens321 requested review from eric-haibin-lin and zhaochenyang20 as code owners November 24, 2025 20:51

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

jens321 added 3 commits November 24, 2025 16:26

aime dataset process bug fix

db3d17f

fix TASK to $TASK in sample script

86d081d

fix TASK to $TASK in docs

ed1722a

jens321 closed this Nov 25, 2025

	sh recipe/rep_exp/eval.sh TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev
	sh recipe/rep_exp/eval.sh $TASK /path/to/global_step_X/actor/hf #where X is the global step of the checkpoint with the best pass@1 on dev

	local_dir = args.local_dir
	local_dir = os.path.expanduser(args.local_dir)


		test_dataset = test_dataset.filter(filter_dev_indices, with_indices=True)

		local_dir = args.local_dir

	task = train_file.split("/")[-2]
	task = Path(train_file).parent.name

Conversation

jens321 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jens321 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jens321 commented Nov 24, 2025 •

edited

Loading

wuxibin89 commented Nov 25, 2025 •

edited

Loading