Page 1 of 1

Non Stationary Bandits

Posted: Wed Jan 13, 2021 4:48 pm
by jiyer
I reviewed the material on Non Stationary Bandits - and I understand the formula below works is a running calculation of the exponential weighted average favoring recent data:

new_mean = (1- alpha) * old_mean + alpha * x

I was trying to understand - how I could adapt this formula to perform a running calculation for the parameters - "a" and "b" of the beta distribution.
I haven't been able to figure this out yet. The closest I could get was to maintain a buffer containing the last N rewards - and then use that to estimate "a" and "b".

The problem with this approach is that my posterior is always roughly the same width - it doesn't get "skinnier" or "fatter".

Ideally, If the rewards of my bandit are changing over time - I was hoping the posteriors would shift from "skinny" (old stable click-thru rate) to "fat" (period of uncertainty) and then finally "skinny" again (new stable click-thru rate).

It would be great if the training material could touch on this subject.
BTW - I just hope I didn't miss it if you already covered this in the training.