网页升级紧急通知中_咨询公司排名前十_品牌传播推广方案_百度搜索关键词技巧

在强化学习中，时序差分（TD）、蒙特卡洛（MC）与动态规划（DP）这三个方法是强化学习（RL）中最常见的话题，其中，TD 与 DP 一样都采用自举的思想，故无需等待交互的最终结果，而可以基于已得到的其他状态的估计值来更新当前状态的价值函数。在这篇文章中，我就将介绍TD，具体将包括其基本概念、相关算法伪代码以及TD与DP的异同对比。

一、基本概念

时序差分学习（Temporal Difference Learning, TD Learning）是强化学习中一种重要的价值函数估计方法，它结合了 MC 与 DP 的优点，简单来说就是 TD 通过在每个时间步利用即时奖励和对未来奖励的估计之间的差异来逐步更新对状态价值的预测。

1.1 时序与差分

时序（temporal）与差分（difference）直接反应了TD的核心思想，其中时序即表示了算法时间时间这一维度上进行的，并且是一种逐时间步进行而非等到整个幕（有时也称为“剧集”）结束才进行更新。

而差分是指当前估计值与新估计值之间的差异，具体来说，TD学习通过计算即时奖励加上后续状态的价值估计（未来的预测），与当前状态价值估计之间的差异来更新价值函数。

1.2 自举

TD 使用了自举的思想，即可以用一个估计值去更新另一个估计值，也就是说，它可以根据后续它会根据后续状态的价值估计来调整当前状态的价值估计。所以这样的方法允许TD在每个时间步后立即利用部分信息去更新，从而提高时间效率。

1.3 在线学习与离线学习

在线学习是指模型在接收到数据后立即进行更新。这样的学习对于那些需要得到及时响应的模型十分重要，比如金融市场方面的数据模型，也因此，这样的学习方法对于内存较友好，换言之，即在得到数据后即利用而不需要将之存储而占用大量存储资源，此外，该方法还可以有较高的计算效率。

离线学习是指模型基于预先收集的批量数据进行训练与更新，即当有新数据来时先存储着，然后等待某一时机再去将之进行批量更新。如此一来，离线学习与上述的在线学习便在优点部分有较大的差异，离线学习对于内存不友好，且学习效率低，但相对的，离线学习可以提供较稳定的性能以及可以进行深度训练并且容易实现并行化处理。

1.4 同轨策略与离轨策略

同轨策略（On-policy）是指学习和行为遵循相同的策略。这意味着 agent 根据当前正在评估和改进的策略来采取行动，并基于这些交互产生的数据进行学习。

离轨策略（Off-policy）是指学习和行为可以遵循不同的策略。agent 可以按照一个行为策略（Behavior Policy）采取行动，但同时学习另一个目标策略（Target Policy）。这种分离允许智能体从任何类型的行为中学习，即使这些行为不是由目标策略产生的。

1.5 双学习

首先我们知道在之前的一些算法中会产生正偏差，而为了避免它就诞生了双学习，对于双学习的思想简单来说就是：我们首先将样本划分为两个集合，并用它们学习两个独立的对真实价值 q(a) 的估计 $Q_1(a)$ 和 $Q_2(a)$ ，然后任用一个来估计，比如 $Q_1(a)$ ，来确定最大的动作 $A^*=argmax_a Q_1(a)$ ，然后再用另一个来估计得到 $Q_2(A^*)=Q_2(argmax_aQ_1(a))$ 并且由于 $E[Q_2(A^*)|=]q(A^*)$ ，所以这个估计是无偏的。如果我们将 $Q_1(a)$ 与 $Q_2(a)$ 交换也是同理。

二、相关算法

2.1 TD(0)算法

最简单的 TD 算法会在状态转移到 $S_{t+1}$ 并收到 $R_{t+1}$ 的收益时做出如下的更新：

$V(S_t)\leftarrow V(S_t)+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_t)]$

也就是说，TD 更新的目标是 $R_{t+1}+\gamma V(S_{t+1})$ 。而这种TD算法被称为 TD（0），或单步 TD。

如果用伪代码来表示就是如下这样：

input: π (policy)
output: V(s) (state-value function)initialize V(s) for all states s to some initial value (e.g., 0)while true:s <- initial state  # Start from the initial statewhile not terminal state:A_n <- action(π, s)  # Choose an action based on the policy πR, s' <- A(A_n)  # Take the action and observe the reward and next stateV(s) <- V(s) + α[R + γ * V(s') - V(s)]  # Update the value functions <- s'  # Move to the next state

2.2 Sarsa算法

在同轨策略下，并且仍然遵循广义策略迭代（GPI）的模式，在评估和预测部分使用时序差分方法。然后我们定义一个五元组，即 $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ ，然后利用这个五元组对于如下这个算法命名即得到 Sarsa 算法。具体关于这个算法，同样用如上的伪代码来描述：

input: π (policy)
output: Q(s, a) (state-action value function)initialize Q(s, a) for all states s and actions a to some initial value (e.g., 0)while true:s <- initial state  # Start from the initial statea <- action(π, s)  # Choose an initial action based on the policy πwhile not terminal state:a' <- action(π, s')  # Choose the next action based on the policy πR, s' <- A(a)  # Take the action a and observe the reward R and next state s'Q(s, a) <- Q(s, a) + α[R + γ * Q(s', a') - Q(s, a)]  # Update the Q-values <- s'  # Move to the next statea <- a'  # Update the current action to the next action# Optionally, update the policy π based on the learned Q-values

2.3 Q学习

在离轨策略下的时序差分控制算法即为 Q 学习，其定义为：

$Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha[R_{t+1}+\gamma max_a Q(S_{t+1,a})-Q(S_t,A_t)]$

具体关于这个算法可以通过如下的伪代码来理解：

input: π (exploration policy, e.g., ε-greedy)
output: Q(s, a) (state-action value function)initialize Q(s, a) for all states s and actions a to some initial value (e.g., 0)while true:s <- initial state  # Start from the initial statewhile not terminal state:a <- action(π, s)  # Choose an action based on the exploration policy πR, s' <- A(a)  # Take the action a and observe the reward R and next state s'a_max <- argmax_a' Q(s', a')  # Choose the action with the highest Q-value for the next stateQ(s, a) <- Q(s, a) + α[R + γ * Q(s', a_max) - Q(s, a)]  # Update the Q-values <- s'  # Move to the next state# Optionally, update the policy π based on the learned Q-values

2.4 期望Sarsa

期望 Sarsa 是 Sarsa 算法的一个变种，它与Q学习十分类似，但把对于下一个状态动作二元组最大值换成取期望的学习算法，具体可以通过如下的伪代码来理解：

input: π (policy)
output: Q(s, a) (state-action value function)initialize Q(s, a) for all states s and actions a to some initial value (e.g., 0)while true:s <- initial state  # Start from the initial statea <- action(π, s)  # Choose an initial action based on the policy πwhile not terminal state:a' <- action(π, s')  # Choose the next action based on the policy πR, s' <- A(a)  # Take the action a and observe the reward R and next state s'# Compute the expected value of the next stateexpected_value <- 0for a'' in actions:expected_value <- expected_value + π(a'' | s') * Q(s', a'')Q(s, a) <- Q(s, a) + α[R + γ * expected_value - Q(s, a)]  # Update the Q-values <- s'  # Move to the next statea <- a'  # Update the current action to the next action# Optionally, update the policy π based on the learned Q-values

2.5 双Q学习

双 Q 学习是指将双学习的思想运用到 Q 学习后得到的新算法，同样地，还可以将双学习的思想运行用到 Sarsa 算法中以及期望 Sarsa 算法中去。在这里展示了双 Q 学习的伪代码，具体如下：

input: π (exploration policy, e.g., ε-greedy)
output: Q1(s, a), Q2(s, a) (two state-action value functions)initialize Q1(s, a) and Q2(s, a) for all states s and actions a to some initial value (e.g., 0)while true:s <- initial state  # Start from the initial statewhile not terminal state:a <- action(π, s)  # Choose an action based on the exploration policy πR, s' <- A(a)  # Take the action a and observe the reward R and next state s'# Choose which Q function to use for the updateif random() < 0.5:Q_target <- Q1Q_online <- Q2else:Q_target <- Q2Q_online <- Q1# Choose the action with the highest Q value from the other Q functiona_max <- argmax_a' Q_online(s', a')# Update the selected Q functionQ_target(s, a) <- Q_target(s, a) + α[R + γ * Q_target(s', a_max) - Q_target(s, a)]s <- s'  # Move to the next state# Optionally, update the policy π based on the learned Q-values

三、DP与MC与TD的对比

我们可以通过如下的表格来理解这三者的区别，表格如下：

方法名称	是否需要环境模型	学习方法	是否应用自举思想
TD	不需要	在线学习	是
MC	不需要	离线学习	否
DP	需要	离线学习	是

此上