## Multiarmed Bandit Problem

• Example for the exploration vs exploitation dilemma
• $K$ independent gambling machines (armed bandits)
• Each machine has an unknown stationary probability distribution for generating the reward
• Observed rewards when playing machine $i$ : $X_{i, 1}, X_{i, 2}, \ldots$
• Policy $A$ chooses next machine based on previous sequence of plays and rewards

#### Regret of policy $A$

$$\mu^* n-\mu_j \sum_{j=1}^K \mathbb{E}\left[T_j(n)\right]$$
$\mu_i \quad$ expectation of machine $i$
$\mu^* \quad$ expectation of optimal machine
$T_j \quad$ number of times machine $j$ was played
The regret is the expected loss after $n$ plays due to the fact that the policy does not always play the optimal machine.

#### A learning algorithm to use the empirical average rewards obtained for each arm

Empirical average
$$\hat{r}{t, i} \triangleq \frac{1}{N{t, i}} \sum_{k=1}^t r_{k, i} \mathbb{I}\left{a_k=i\right}, \quad \text { where } N_{t, i} \triangleq \sum_{k=1}^t \mathbb{I}\left{a_k=i\right}$$
and $r_{k, i}$ denotes the (random) reward the learner receives upon choosing arm $i$ at step $k$.

Simply always choosing the arm with best the empirical average reward so far is not the best idea, because you might get stuck with a sub-optimal arm: If the optimal arm underperforms at the beginning, so that its empirical average is far below the true mean of a suboptimal arm, it will never be chosen again. A better strategy is to choose arms optimistically. Intuitively, as long as an arm has a significant chance of being the best, you play it every now and then. One simple way to implement this is shown in the following UCB1 algorithm.

\begin{tabular}{l}
Algorithm  UCB1
\hline input $\mathcal{A}$ \
Choose each arm once to obtain an initial estimate. \
for $t=1, \ldots$ do \
$\quad$ Choose arm $a_t=\arg \max {i \in \mathcal{A}}\left{\hat{r}{t-1, i}+\sqrt{\frac{2 \ln t}{N_{t-1, i}}}\right}$. \
end for
\end{tabular}

## Upper confidence upper confidence bound value of UCB

Theorem 1.

The expected regret of UCB1 after $T$ time steps is at most
$E L_T(U C B 1) \leq \sum_{i: r(i)<r^} \frac{8 \ln T}{r^-r(i)}+5 \sum_i\left(r^*-r(i)\right)$.

Proof By Wald’s identity the expected regret can be written as
$$\mathbb{E} L_T=\mathbb{E} \sum_{t=1}^T\left(r^-r_t\right)=\sum_i\left(r^-r(i)\right) \mathbb{E} N_{T, i},$$
so that we focus on bounding $\mathbb{E} N_{t, i}$. Thus, let $i$ be an arbitrary suboptimal arm, for which we shall consider when it will be chosen by the algorithm. Write $B_{t, s}=\sqrt{(2 \ln t) / s}$ for the bonus value at step $t$ after $s$ observations. Note that for fixed values of $t, s, s_i \in \mathbb{N}$ under the assumption that $N_{t, i}=s_i$ and (the count of the optimal action) $N_{t, *}=s$, we have by the Hoeffding bound that
$$\begin{array}{r} \mathbb{P}\left(\hat{r}{t, i} \geq r(i)+B{t, s_i}\right) \leq e^{-4 \ln t}=t^{-4}, \ \mathbb{P}\left(\hat{r}{t, *} \leq r^*-B{t, s}\right) \leq e^{-4 \ln t}=t^{-4} . \end{array}$$
Accordingly we may assume that (taking care of the contribution of the error probabilities to $\mathbb{E} N_{t, i}$ below)

\begin{aligned} \hat{r}{t, i} & {t, N_{t, i}}, \ r^* & <\hat{r}{t, *}+B{t, N_{t, *}} . \end{aligned}
Now note that for $s \geq\left\lceil(8 \ln T) /\left(r^-r(i)\right)^2\right\rceil$ it holds that $$2 B_{t, s} \leq\left(r^-r(i)\right),$$

$$2 B_{t, s} \leq\left(r^-r(i)\right),$$ so that after arm $i$ has been chosen $\left\lceil(8 \ln T) /\left(r^-r(i)\right)^2\right\rceil$ times we get that
\begin{aligned} \hat{r}{t, i}+B{t, N_{t, i}} & <r(i)+2 B_{t, N_{t, i}} \leq r^* \ & <\hat{r}{t, *}+B{t, N_{t, *} .} \end{aligned}
This shows that after $\left\lceil(8 \ln T) /\left(r^*-r(i)\right)^2\right\rceil$ samples from arm $i$, the algorithm won’t choose it again. Taking into account the error probabilities, arm $i$ may be played once at each step $t$ whenever either equation does not hold. Summing over all possible values for $t, N_{t, i}$ and $N_{t, *}$ this shows that
$$\mathbb{E} N_{t, i} \leq\left\lceil\frac{8 \ln T}{\left(r^*-r(i)\right)^2}\right\rceil+\sum_{\tau \geq 1} \sum_{s \leq \tau} \sum_{s_i \leq \tau} 2 \tau^{-4} .$$
Noting that the sum converges to a value $<4$, proves the regret bound.

The UCB1 algorithm is actually not the first algorithm employing optimism in the face of uncertainty to deal with the exploration-exploitation dilemma, nor the first that uses confidence intervals for that purpose.

For so-called distribution-independent bounds that do not depend on problem parameters like the “gaps” $r^*-r(i)$. In general, these bounds cannot be logarithmic in $T$ anymore, as the gaps may be of order $1 / \sqrt{T}$ resulting in bounds that are $O(\sqrt{K T})$.

Machine Learning机器学习作业代写请认准UprivateTA™. UprivateTA™为您的留学生涯保驾护航。

## Course Search

Keyword(s)SearchReset

## Search Results

Course Prefix:CSECourse #: 365Keywords: showing 0 to 1

### CSE 365LR Introduction to Computer Security

View ScheduleCSE 365LR Introduction to Computer SecurityLecture

This is an undergraduate-level course intended for junior and senior-level students and will teach them introductory concepts of computer security. The main foci of this course will be network, web security, and application security. Part of the work will be dedicated to ethical aspects of security, and online privacy. The course will be heavily hands-on, as opposed to theoretical teaching.Credits: 4