•  ST455 Reinforcement Learning， LSE
• CS 285. Deep Reinforcement Learning
• IE 3186 – APPROXIMATE DYNAMIC PROGRAMMING， University of Pittsburgh
• CS 7642: Reinforcement Learning | OMSCS – Georgia Tech

## Operation of the Algorithm

The Q-Learning algorithm is one of the most efficient Reinforcement Learning algorithms . The following demonstrates step-by-step how it works.

This algorithmwas first proposed as a theorem by Chris Watkins in 1989 and further developed by Watkins himself and Peter Dayan in 1992. The authors have made a significant advance in Reinforcement Learning research.
Being in widespread use, Q-Learning works by successively improving evaluations of the quality of certain actions in certain states. See Fig for the formula that represents how this algorithm works:

To predict the next actions in the context of learning in a complex system, it is not possible to rely only on the next rewards, as this would be a limited view. Thus, Watkins and Dayan’s (1992) proposal is to look at the quality of the action. The new quality of action would result from the immediate reward added to the future reward. Thus, we have

• $Q^{\prime}(s, a)$ : new quality to be obtained.
• $Q(s, a)$ : the quality value of the state-action.
• $\alpha$ : learning rate (what is the desired relevance of what will be learned). For example, if the value of $\alpha$ is 1 , which is the maximum value, the intention is for the machine to learn the maximum. However, there is a threshold between maximum and minimum learning, since by establishing maximum learning quickly – in a single operation – the resulting intensity is lower than in the case of repeating minimum learning several times.
• $R(s, a)$ : reward for the current action.
• $\gamma$ : discount factor.
• $\max Q^{\prime}\left(s^{\prime}, a^{\prime}\right)$ : the greatest value of $Q$ among the possible actions.
The pseudocode of the Q-Learning algorithm is schematized simplify its understanding.
The algorithm can be interpreted as follows:
1. Initialize the Q-value table (i.e., the stock quality table).
2. Observe the current $\operatorname{state}(s)$.
3. Based on the selection policy, choose an action $(a)$ to be performed.
4. Take action $(a)$, reach the new state $\left(s^{\prime}\right)$, and obtain the reward $(r)$.
5. Update the $Q$ value for the next state, using the observed reward and the maximum possible reward for the next state.
6. Repeat the process until a terminal state is reached.
The idea of the Q-Learning algorithm is that an agent interacts with a given environment to obtain data that are not previously presented. The agent will then map the set of states, actions, and rewards obtained into a table (Q-Table). This combination of state, action, and reward is called “quality” (Q-Value).
7. The construction of the Q-Table occurs during the training phase, in which the agent’s actions vary between Exploration and Exploitation. Once the Q-Table is learned, it becomes the agent’s policy. In other words, the data contained in the Q-Table will dictate the policy of actions. Later, in the test step, the agent will choose the best action from this policy based on the values of $Q$.

## Construction of the Q-Table

Let us illustrate the behavior of a reward-seeking agent in an unknown environment to understand the construction of the Q-Table in Fig.

\begin{aligned} & \text { Quality }=(1-\text { ” LearningRate” }) * \text { Current } Q(s, a) \text { ” }{ }^{\prime \prime} \text { LearningRate” } \ & \left.\text { *( “ CurrentReward” + ” DiscountRate” * } \max Q\left(s^{\prime}, a^{\prime}\right){ }^{\prime \prime}\right) \ & \end{aligned}

Therefore, with the current values, you have:

Reward = (1 –C1)B8 +C1(B6 +C2B7) Reward = (1 –0.5)0 +0.5(−1 +0.90)Reward = −0.5

Machine Learning机器学习作业代写请认准UprivateTA™. UprivateTA™为您的留学生涯保驾护航。

## Course Search

Keyword(s)SearchReset

## Search Results

Course Prefix:CSECourse #: 365Keywords: showing 0 to 1

### CSE 365LR Introduction to Computer Security

View ScheduleCSE 365LR Introduction to Computer SecurityLecture

This is an undergraduate-level course intended for junior and senior-level students and will teach them introductory concepts of computer security. The main foci of this course will be network, web security, and application security. Part of the work will be dedicated to ethical aspects of security, and online privacy. The course will be heavily hands-on, as opposed to theoretical teaching.Credits: 4