强化学习简介

by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

Reinforcement learning is an important type of Machine Learning where an agent learn how to behave in a environment by performing actions and seeing the results.

强化学习是机器学习的一种重要类型，代理可以通过以下方式学习如何在环境中表现执行动作并查看结果。

In recent years, we’ve seen a lot of improvements in this fascinating area of research. Examples include DeepMind and the Deep Q learning architecture in 2014, beating the champion of the game of Go with AlphaGo in 2016, OpenAI and the PPO in 2017, amongst others.

近年来，我们在这一引人入胜的研究领域中看到了许多进步。例子包括2014年的DeepMind和Deep Q学习架构，2016年击败AlphaGo的Go游戏冠军， 2017年OpenAI和PPO 的冠军。

In this series of articles, we will focus on learning the different architectures used today to solve Reinforcement Learning problems. These will include Q -learning, Deep Q-learning, Policy Gradients, Actor Critic, and PPO.

在本系列文章中，我们将集中于学习当今用于解决强化学习问题的不同体系结构。这些将包括Q学习，深度Q学习，策略梯度，Actor评论家和PPO。

In this first article, you’ll learn:

在第一篇文章中，您将学习：

What Reinforcement Learning is, and how rewards are the central idea
什么是强化学习，以及奖励是中心思想
The three approaches of Reinforcement Learning
强化学习的三种方法
What the “Deep” in Deep Reinforcement Learning means
深度强化学习中的“深度”是什么意思

It’s really important to master these elements before diving into implementing Deep Reinforcement Learning agents.

在深入研究实施深度强化学习代理之前，掌握这些要素非常重要。

The idea behind Reinforcement Learning is that an agent will learn from the environment by interacting with it and receiving rewards for performing actions.

强化学习背后的想法是，特工将通过与环境互动并从执行行动中获得奖励来从环境中学习。

Learning from interaction with the environment comes from our natural experiences. Imagine you’re a child in a living room. You see a fireplace, and you approach it.

与环境互动中的学习来自我们的自然经验。想象你是一个客厅里的孩子。您看到一个壁炉，然后接近它。

It’s warm, it’s positive, you feel good (Positive Reward +1). You understand that fire is a positive thing.

很温暖，很积极，感觉很好(积极奖励+1)。 您了解火灾是一件积极的事情。

But then you try to touch the fire. Ouch! It burns your hand (Negative reward -1). You’ve just understood that fire is positive when you are a sufficient distance away, because it produces warmth. But get too close to it and you will be burned.

但是，然后您尝试着火。哎哟! 它灼伤您的手(负奖励-1) 。您刚刚了解，距离足够远时，火是积极的，因为它会产生热量。但是离它太近了，你会被烫伤。

That’s how humans learn, through interaction. Reinforcement Learning is just a computational approach of learning from action.

人类就是通过互动来学习的。强化学习只是从行动中学习的一种计算方法。

强化学习过程 (The Reinforcement Learning Process)

Let’s imagine an agent learning to play Super Mario Bros as a working example. The Reinforcement Learning (RL) process can be modeled as a loop that works like this:

让我们想象一个学习如何玩《超级马里奥兄弟》的特工。强化学习(RL)流程可以建模为如下所示的循环：

Our Agent receives state S0 from the Environment (In our case we receive the first frame of our game (state) from Super Mario Bros (environment))
我们的代理从环境接收状态S0 (在本例中，我们从超级马里奥兄弟(环境)接收游戏的第一帧(状态))
Based on that state S0, agent takes an action A0 (our agent will move right)
根据状态S0，代理采取行动A0 (我们的代理将向右移动)
Environment transitions to a new state S1 (new frame)
环境过渡到新 状态S1 (新框架)
Environment gives some reward R1 to the agent (not dead: +1)
环境给代理人一些奖励R1 (未死亡：+1)

This RL loop outputs a sequence of state, action and reward.

RL循环输出状态，动作和奖励的序列。

The goal of the agent is to maximize the expected cumulative reward.

代理商的目标是最大化预期的累积奖励。

奖励假说的中心思想 (The central idea of the Reward Hypothesis)

Why is the goal of the agent to maximize the expected cumulative reward?

为什么代理商的目标是最大化预期的累积奖励？

Well, Reinforcement Learning is based on the idea of the reward hypothesis. All goals can be described by the maximization of the expected cumulative reward.

好吧，强化学习是基于奖励假设的思想。所有目标都可以通过预期累积奖励的最大化来描述。

That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.

这就是为什么在强化学习中，要具有最佳行为，我们需要最大化预期的累积奖励。

The cumulative reward at each time step t can be written as:

每个时间步长t的累积奖励可以写成：

Which is equivalent to:

等效于：

However, in reality, we can’t just add the rewards like that. The rewards that come sooner (in the beginning of the game) are more probable to happen, since they are more predictable than the long term future reward.

但是，实际上，我们不能仅仅添加这样的奖励。 (在游戏开始时)较早出现的奖励更有可能发生，因为与长期的未来奖励相比，它们更可预测。

Let say your agent is this small mouse and your opponent is the cat. Your goal is to eat the maximum amount of cheese before being eaten by the cat.

假设您的经纪人是这只小老鼠，而您的对手是那只猫。您的目标是在被猫吃掉之前先食用最大量的奶酪。

As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

从图中可以看出，我们附近的奶酪比靠近猫的奶酪更有可能吃(我们离猫越近，危险就越大)。

As a consequence, the reward near the cat, even if it is bigger (more cheese), will be discounted. We’re not really sure we’ll be able to eat it.

结果，即使靠近猫(更大的奶酪)，其奖励也会被打折。我们不太确定是否可以食用。

To discount the rewards, we proceed like this:

为了打折奖励，我们按以下步骤进行：

We define a discount rate called gamma. It must be between 0 and 1.

我们定义了称为伽玛的折扣率。必须在0到1之间。

The larger the gamma, the smaller the discount. This means the learning agent cares more about the long term reward.
伽玛值越大，折扣越小。这意味着学习代理更在乎长期奖励。
On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).
另一方面，伽玛值越小，折扣越大。这意味着我们的代理商更关心短期奖励(最近的奶酪)。

Our discounted cumulative expected rewards is:

我们的折现累积预期奖励为：

To be simple, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less probable to happen.

为简单起见，每个奖励都将按伽玛折扣至时间步长的指数。随着时间步长的增加，猫离我们越来越近，因此未来的奖励越来越不可能发生。

突发性或持续性任务 (Episodic or Continuing tasks)

A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.

任务是强化学习问题的一个实例。我们可以有两种任务：情景任务和连续任务。

情景任务 (Episodic task)

In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and New States.

在这种情况下，我们有一个起点和一个终点(终端状态)。 这会产生一个情节 ：状态，动作，奖励和新状态的列表。

For instance think about Super Mario Bros, an episode begin at the launch of a new Mario and ending: when you’re killed or you’re reach the end of the level.

例如，考虑一下《超级马里奥兄弟》，这一集从新马里奥的发布开始，到结束：当您被杀死或达到关卡的尽头时。

连续任务 (Continuous tasks)

These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.

这些是永远持续的任务(无终端状态)。 在这种情况下，代理必须学习如何选择最佳操作并同时与环境交互。

For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop him.

例如，执行自动股票交易的代理。对于此任务，没有起点和终点状态。 代理一直在运行，直到我们决定停止他为止。

蒙特卡洛vs TD学习方法 (Monte Carlo vs TD Learning methods)

We have two ways of learning:

我们有两种学习方式：

Collecting the rewards at the end of the episode and then calculating the maximum expected future reward: Monte Carlo Approach
在剧集结束时收集奖励，然后计算最大预期未来奖励 ： 蒙特卡洛方法
Estimate the rewards at each step: Temporal Difference Learning
估算每一步的回报 ： 时间差异学习

蒙特卡洛 (Monte Carlo)

When the episode ends (the agent reaches a “terminal state”), the agent looks at the total cumulative reward to see how well it did. In Monte Carlo approach, rewards are only received at the end of the game.

当情节结束时(特工达到“终极状态”)， 特工查看总累积奖励以查看其效果如何。 在蒙特卡洛方法中，奖励仅在游戏结束时获得。

Then, we start a new game with the added knowledge. The agent makes better decisions with each iteration.

然后，我们以增加的知识开始新游戏。 代理在每次迭代时都会做出更好的决策。

Let’s take an example:

让我们举个例子：

If we take the maze environment:

如果我们采用迷宫环境：

We always start at the same starting point.
我们总是从同一起点开始。
We terminate the episode if the cat eats us or if we move > 20 steps.
如果猫吃了我们或者我们移动了20步以上，我们将终止该情节。
At the end of the episode, we have a list of State, Actions, Rewards, and New States.
在剧集的结尾，我们列出了状态，动作，奖励和新状态。
The agent will sum the total rewards Gt (to see how well it did).
代理将对总奖励Gt求和(以查看其效果如何)。
It will then update V(st) based on the formula above.
然后，它将根据以上公式更新V(st)。
Then start a new game with this new knowledge.
然后以这种新知识开始新游戏。

By running more and more episodes, the agent will learn to play better and better.

通过运行越来越多的剧集， 代理将学会播放越来越好。

时差学习：每个时间学习 (Temporal Difference Learning : learning at each time step)

TD Learning, on the other hand, will not wait until the end of the episode to update the maximum expected future reward estimation: it will update its value estimation V for the non-terminal states St occurring at that experience.

另一方面，TD Learning不会等到情节结束才更新最大预期未来奖励估算：它将针对该体验中发生的非终端状态St更新其值估算V。

This method is called TD(0) or one step TD (update the value function after any individual step).

此方法称为TD(0)或一个步骤TD(在任何单个步骤之后更新值函数)。

TD methods only wait until the next time step to update the value estimates. At time t+1 they immediately form a TD target using the observed reward Rt+1 and the current estimate V(St+1).

TD方法仅等到下一个步骤更新值估计值。 在时间t + 1，他们立即使用观察到的奖励Rt + 1和当前估计值V(St + 1)形成TD目标。

TD target is an estimation: in fact you update the previous estimate V(St) by updating it towards a one-step target.

TD目标是一个估计值：实际上，您通过朝一个一步的目标进行更新来更新先前的估计值V(St) 。

勘探/开发权衡 (Exploration/Exploitation trade off)

Before looking at the different strategies to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.

在研究解决强化学习问题的不同策略之前，我们必须涵盖另一个非常重要的主题：探索/开发权衡。

Exploration is finding more information about the environment.
探索正在寻找有关环境的更多信息。
Exploitation is exploiting known information to maximize the reward.
剥削是利用已知信息来最大化回报。

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

请记住，我们的RL代理商的目标是最大化预期的累积奖励。但是，我们会陷入一个普遍的陷阱。

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze there is a gigantic sum of cheese (+1000).

在这个游戏中，我们的老鼠可以吃无限量的小奶酪(每个+1)。但是在迷宫的顶部，有一块巨大的奶酪(+1000)。

However, if we only focus on reward, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).

但是，如果我们仅专注于奖励，我们的代理商将永远无法拿到巨大的奶酪。取而代之的是，它将仅利用最近的奖励来源，即使该来源很小(利用)。

But if our agent does a little bit of exploration, it can find the big reward.

但是，如果我们的经纪人做一点探索，就能找到丰厚的回报。

This is what we call the exploration/exploitation trade off. We must define a rule that helps to handle this trade-off. We’ll see in future articles different ways to handle it.

这就是我们所谓的勘探/开发权衡。我们必须定义一个规则来帮助处理这种折衷。我们将在以后的文章中看到处理它的不同方法。

强化学习的三种方法 (Three approaches to Reinforcement Learning)

Now that we defined the main elements of Reinforcement Learning, let’s move on to the three approaches to solve a Reinforcement Learning problem. These are value-based, policy-based, and model-based.

现在，我们定义了强化学习的主要元素，让我们继续介绍解决强化学习问题的三种方法。这些是基于价值，基于策略和基于模型的。

基于价值 (Value Based)

In value-based RL, the goal is to optimize the value function V(s).

在基于值的RL中，目标是优化值函数V(s) 。

The value function is a function that tells us the maximum expected future reward the agent will get at each state.

价值函数是一个告诉我们代理在每种状态下可获得的最大预期未来回报的函数。

The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state.

每个州的价值是代理商从该州开始可以期望在将来累积的总奖励金额。

The agent will use this value function to select which state to choose at each step. The agent takes the state with the biggest value.

代理将使用此值功能来选择每个步骤要选择的状态。代理采取具有最大价值的状态。

In the maze example, at each step we will take the biggest value: -7, then -6, then -5 (and so on) to attain the goal.

在迷宫示例中，在每一步中，我们将采用最大值：-7，然后-6，然后-5(依此类推)以达到目标。

基于政策 (Policy Based)

In policy-based RL, we want to directly optimize the policy function π(s) without using a value function.

在基于策略的RL中，我们希望不使用值函数直接优化策略函数π(s) 。

The policy is what defines the agent behavior at a given time.

该策略定义了给定时间的代理行为。

We learn a policy function. This lets us map each state to the best corresponding action.

我们学习政策功能。这使我们可以将每个状态映射到最佳的对应动作。

We have two types of policy:

我们有两种类型的政策：

Deterministic: a policy at a given state will always return the same action.
确定性的：处于给定状态的策略将始终返回相同的操作。
Stochastic: output a distribution probability over actions.
随机的：输出在action上的分布概率。

As we can see here, the policy directly indicates the best action to take for each steps.

正如我们在此处看到的，该策略直接指示针对每个步骤应采取的最佳措施。

基于模型 (Model Based)

In model-based RL, we model the environment. This means we create a model of the behavior of the environment.

在基于模型的RL中，我们对环境进行建模。这意味着我们将创建一个环境行为模型。

The problem is each environment will need a different model representation. That’s why we will not speak about this type of Reinforcement Learning in the upcoming articles.

问题在于每个环境将需要不同的模型表示。这就是为什么我们在接下来的文章中不会谈论这种强化学习。

引入深度强化学习 (Introducing Deep Reinforcement Learning)

Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems — hence the name “deep.”

深度强化学习引入了深度神经网络来解决强化学习问题，因此被称为“深度”。

For instance, in the next article we’ll work on Q-Learning (classic Reinforcement Learning) and Deep Q-Learning.

例如，在下一篇文章中，我们将研究Q-Learning(经典强化学习)和Deep Q-Learning。

You’ll see the difference is that in the first approach, we use a traditional algorithm to create a Q table that helps us find what action to take for each state.

您会看到不同之处在于，在第一种方法中，我们使用传统算法来创建Q表，该表有助于我们找到针对每种状态采取的操作。

In the second approach, we will use a Neural Network (to approximate the reward based on state: q value).

在第二种方法中，我们将使用神经网络(根据状态q值来估算奖励)。

Congrats! There was a lot of information in this article. Be sure to really grasp the material before continuing. It’s important to master these elements before entering the fun part: creating AI that plays video games.

恭喜！本文中有很多信息。在继续之前，请务必真正掌握材料。在进入有趣的部分之前，掌握这些元素很重要：创建可玩视频游戏的AI。

Important: this article is the first part of a free series of blog posts about Deep Reinforcement Learning. For more information and more resources, check out the syllabus.

重要：本文是有关深度强化学习的一系列免费博客文章的第一部分。有关更多信息和更多资源，请查阅教学大纲。

Next time we’ll work on a Q-learning agent that learns to play the Frozen Lake game.

下次，我们将与Q学习代理合作，学习如何玩《冰雪奇缘》游戏。

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜欢我的文章， 请单击“？”。 您可以根据自己喜欢该文章的次数在下面进行搜索，以便其他人可以在Medium上看到此内容。并且不要忘记跟随我！

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

如果您有任何想法，意见，问题，请在下面发表评论，或给我发送电子邮件：hello@simoninithomas.com或向我发送@ThomasSimonini信息。

Cheers!

干杯!

深度强化学习课程： (Deep Reinforcement Learning Course:)

We’re making a video version of the Deep Reinforcement Learning Course with Tensorflow ? where we focus on the implementation part with tensorflow here.
我们正在使用Tensorflow制作深度强化学习课程的视频版本吗？我们在这里重点关注带有tensorflow的实现部分。

Part 1: An introduction to Reinforcement Learning

第1部分：强化学习简介

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

第2部分：通过Q学习更深入地学习强化学习

Part 3: An introduction to Deep Q-Learning: let’s play Doom

第3部分：深度Q学习简介：让我们玩《毁灭战士》

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

第3部分+：深度Q学习中的改进：双重DQN，优先体验重播和固定Q目标

Part 4: An introduction to Policy Gradients with Doom and Cartpole

第4部分： Doom和Cartpole的策略梯度简介

Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

第5部分：优势演员评论家方法简介：让我们玩刺猬索尼克吧！

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

第6部分：使用刺猬索尼克2和3的近距离策略优化(PPO)

Part 7: Curiosity-Driven Learning made easy Part I

第七部分：好奇心驱动学习变得简单

翻译自: https://www.freecodecamp.org/news/an-introduction-to-reinforcement-learning-4339519de419/