## Epsilon-Greedy, Optimistic Initial Values

ppsev
Posts: 1
Joined: Wed Jun 23, 2021 2:50 am

### Epsilon-Greedy, Optimistic Initial Values

Hi, I'm doing the "Return of the multi-armed bandit" videos and I have some doubts regarding the videos related to Epsilon-Greedy and Optimistic Initial Values.

At first, I thought that EG should include an updating Epsilon, so we can start exploring and then just exploiting through time, but I noticed that this wasn't the case. In fact, an updating epsilon was proposed (and I really enjoyed that!).

Then, using optimistic initial values removed every part that included Epsilon. Why? I thought that it should be accumulative by been combined with what we saw in the EG videos (like combining the optimistic values with a decaying EG fashion).

And maybe I got something wrong, but are any of these actually used in real world problems? Or more complex variants are the ones that are used and we should just keep this as part of our learning curve? And if they are, how? Because, in a real world problem we won't know the BANDIT_PROBABILITIES variables. In this case, this algorithms are used to just estimate those probabilities or are used to actually take the decisions to try to maximize the total reward?

Thank you! I'm really loving all the content lazyprogrammer
Posts: 85
Joined: Sat Jul 28, 2018 3:46 am

### Re: Epsilon-Greedy, Optimistic Initial Values

Thanks for your questions.

> so we can start exploring and then just exploiting through time, but I noticed that this wasn't the case

Actually there is a lecture that demonstrates this. I believe it may be in the Legacy bandit lectures (if you are taking the deeplearningcourses.com version).

However note that the epsilon-greedy algorithm is normally presented with a constant epsilon - thus, that's the version initially covered.

> Then, using optimistic initial values removed every part that included Epsilon. Why?

It's a totally separate algorithm. Notice that epsilon is not needed because the exploration mechanism is different.

> but are any of these actually used in real world problems?

Yes, there is a lecture stating exactly this with various examples. I assure you "online advertising", which is the main example used, is a billion dollar industry.

It's very real.

If you have any questions about it, I'd be happy to answer them.

In fact, the applications are discussed in the 2nd lecture in the section: "Applications of the Explore-Exploit Dilemma".

It's mentioned again at the end: "Bandit Summary, Real Data, and Online Learning".

> Or more complex variants are the ones that are used and we should just keep this as part of our learning curve?

In addition to the applications already mentioned (via the 2 lectures above), it's applied in the subsequent sections of the course (Monte Carlo, TD).

> Because, in a real world problem we won't know the BANDIT_PROBABILITIES variables. In this case, this algorithms are used to just estimate those probabilities or are used to actually take the decisions to try to maximize the total reward?

This is discussed extensively in the lecture "Bandit Summary, Real Data, and Online Learning (06:29)", in the 2nd half.

Remember that the purpose of studying synthetic data is to write the algorithm and to confirm it converges to the correct answer.

Recall that in the "real world", you don't know the correct answer. Therefore, it's not possible to verify.

It's only possible to verify with synthetic data or data for which you know the answer.

Furthermore, remember that "all data is the same".

Therefore, the code does not change whether you are using synthetic data, online advertising data, or any other kind of data.

So you have learned everything you need to know about how to use these algorithms in the real world, whether it's in biology, finance, or online advertising.

The same code works, no matter what data you plug in.