# C2.1 Multi Armed Bandits

# Intro

The particular non-associative, evaluative feedback problem that we explore is a simple version of the k-armed bandit problem.

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
Each of the k actions has an expected or mean reward given that that action is selected - the value of that action. Action selected on time step t as A_t, and the corresponding reward as R_t. The value then of an arbitrary action a, denoted q_*(a), is the expected reward given that a is selected: q_*(a) = \mathbb E[R_t | A_t = a]
We do not know q_*(a) and thus we denote the estimated value of action a at time step t as Q_t(a)
At any time step, there is atleast one action with greatest estimate - greedy action and all others are non greedy. If this is selected, we are exploiting else exploring.
Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.
In any specific case, whether it is better to explore or exploit depends in a complex way on the precise values of the estimates, uncertainties, and the number of remaining steps.

Methods for estimating the values of actions and for using the estimates to make action selection decisions, which we collectively call action-value methods.
Q_t(a) = sum of rewards when a taken prior to t / number of times a taken prior to t Q_t(a)=\frac{\sum_{i=1}^{t-1}R_i \cdot \mathbb 1_{A_i=a}}{\sum_{i=1}^{t-1}\mathbb 1_{A_i=a}}
As denominator tends to inf, Q_t(a) converges to q_*(a). We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards
The greedy action selection methods is given by A_t = \underset a{argmax}\;Q_t(a)
An alternative explorative method - \epsilon\; greedy - would be to behave greedily but every once in a while with some prob \epsilon, select randomly from actions with equal prob