Skip to content
olgavrou edited this page Nov 11, 2022 · 13 revisions

What: Evaluate exploration algorithms

The goal of explore eval is to evaluate different exploration algorithms using the data from a logged policy. The eval policy does not learn from all of the logged examples but there is some rejection sampling in order to do counterfactual simulation:

  • for each example in the logged policy
    • get the pmf of the prediction of the policy being evaluated (eval policy)
    • for the action that was logged (which has a probability of p_log) find the eval policy probability for that action p_eval
    • calculate a threshold p_eval / p_log
    • flip a biased coin (using threshold)
    • have the eval policy learn from this example or skip to the next example

The --multiplier cli argument can be provided which will be applied to the threshold (threshold *= multiplier) and will affect the sampling rejection rate.

For all examples the average loss is calculated using the IPS technique (logged_cost * (p_eval / p_log)) and reported at the end.

Explore eval once run, gives us some information about the sampling rate:

update count = <N>
violation count = <V>
final multiplier = <M>

where:

  • update count is the number of examples that were used to update the policy being evaluated
  • violation count is the number of examples that had a threshold > 1 which means the eval policy had a larger probability for the logged action than the logged probability, and therefore is for sure used to update the eval policy
  • final multiplier is the final multiplier used

We can see that for eval policies' that are similar to the logged policy, the rejection rate will be lower than if the eval policies are very different to the logged policy. This can result in very different confidence intervals for different eval policies.

One way to tackle this is playing around with the multiplier for different policies. Another way is to use the --block_size cli argument

The examples will be processed in blocks of block_size. If an example update is found in that block no other examples in the block will be used to update the policy. If an example is not used in the block then the quota rolls over and the next block can update more than one examples. This has the effect of the acceptance rate not going over the example count that we want, while at the same time sampling evenly from the entire example set (not just the first N examples).

The best way to use this argument is to first do a run of explore_eval (without the block_size set), on all of the exploration policies being evaluated. Then find the smallest update count and use that to set the block_size by doing num_of_logged_examples / smallest_update_count.

This way all policies should be evaluated with a similar rejection rate

Clone this wiki locally