Skip to content
olgavrou edited this page Nov 17, 2022 · 13 revisions

What: Evaluate exploration algorithms

The goal of explore eval is to evaluate different exploration algorithms using the data from a logged policy. The eval policy does not learn from all of the logged examples but there is some rejection sampling in order to do counterfactual simulation:

  • for each example in the logged policy
    • get the pmf of the prediction of the policy being evaluated (eval policy)
    • for the action that was logged (which has a probability of p_log) find the eval policy probability for that action p_eval
    • calculate a threshold p_eval / p_log
    • flip a biased coin (using threshold)
    • have the eval policy learn from this example or skip to the next example

The --multiplier cli argument can be provided which will be applied to the threshold (threshold *= multiplier) and will affect the sampling rejection rate.

For all examples the average loss is calculated using the IPS technique (logged_cost * (p_eval / p_log)) and reported at the end.

Explore eval once run, gives us some information about the sampling rate:

update count = <N>
violation count = <V>
final multiplier = <M>

where:

  • update count is the number of examples that were used to update the policy being evaluated
  • violation count is the number of examples that had a threshold > 1 which means the eval policy had a larger probability for the logged action than the logged probability, and therefore is for sure used to update the eval policy
  • final multiplier is the final multiplier used

We can see that for eval policies' that are similar to the logged policy, the rejection rate will be lower than if the eval policies are very different to the logged policy. This can result in very different confidence intervals for different eval policies.

One way to tackle this is playing around with the multiplier for different policies. All policies should be evaluated with a similar rejection rate.

Clone this wiki locally