Skip to content

Interaction Grounded Learning

Jack Gerrits edited this page Nov 18, 2021 · 5 revisions

Note: this reduction is experimental and is subject to change

Paper: https://arxiv.org/pdf/2106.04887.pdf

VW's learning algorithm attempts to minimize loss, and the contextual bandit input format specifically calls for cost. However, in the setting on reinforcement learning and contextual bandits it is common to use reward in a label for a data point. And for rewards the agent wishes to maximize reward. Accidentally supplying a reward in place of a cost for contextual bandit label in VW will result in incorrect learning as minimizing this value is the opposite of what is intended.

This reduction tracks incoming labels and determines whether they are rewards or costs. Note that because positive value are assumed to be rewards and negative values costs, if your dataset is labelled such that positive values are still costs but are used to penalize the learner then this automatic translation will not work.

Internally it trains two models, one for rewards and and for costs. Or, more specifically it trains one model with the labels as is, and the other with the negated value. Then when making a prediction, the model used is selected based on if more rewards or costs have been seen thus far.

To enable this reduction use the option --igl. This reduction expects contextual bandit data as input and produces action scores.

One important thing to note is that the loss calculation in VW when using this IGL reduction is not aware of dynamically selecting two different models and so if rewards are being used the loss calculation will be incorrect.

Clone this wiki locally