强化学习整理-经典论文之Miscellaneous

七月de风 2023-03-28 原文

# Model-Free RL: Distributional RL

1. C51 (Categorical DQN)

2017: A Distributional Perspective on Reinforcement Learning

传统的RL模型都是对expected return进行建模（学习价值函数value function），这篇论文提出对random return的分布进行建模（学习价值分布value distribution）
两者之间的联系是：random return分布的期望就是expected return

the main object of our study is the random return whose expectation is the value .

价值函数和价值分布的Bellman equation对比：
$Q(x, a) = \mathbb{E} R(x, a) + \gamma \mathbb{E} Q(X', A') \\ Z(x, a) \overset{\underset{\mathrm{D}}{}}{=} R(x, a) + \gamma Z(X', A')$

在一系列理论分析之后，作者提出使用具有如下参数的离散分布来建模价值分布：

论文提出将sample Bellman update $\hat{\tau}Z_{\theta}$ 投影到 $Z_\theta$ ，有效地将Bellman update变成多分类，使用distributional Bellman operator进行更新的一个演示，其中 $\Phi \hat{\tau}Z_{\hat{\theta}}(x, a)$ 是投影更新：

训练使用的损失是 $\mathcal{L}_{x, a}(\theta)$ ，表示KL散度 $D_{KL} (\Phi \hat{\tau}Z_{\hat{\theta}}(x, a) || Z_\theta(x, a))$ ，其中：
$\hat{\tau}_{z_j} := r + \gamma z_j$

在上述基础上形成的算法叫Categorical Algorithm，如下所示：

当atom数量时，上述Categorical Algorithm算法效果显著，故称。

As in DQN, we use a simple $\epsilon$ -greedy policy over the expected action-values; we leave as future work the many ways in which an agent could select actions on the basis of the full distribution.

2. QR-DQN

2017: Distributional Reinforcement Learning with Quantile Regression

C51算法首先执行启发式的投影步骤，然后最小化projected Bellman update和prediction之间的KL散度。然而遗留的问题是Wasserstein-metric理论与实际算法之间仍然有一个较大的gap。
这篇论文中提出使用quantile regression（分位数回归）进行更彻底的distributional RL。

论文提出参数化分位数分布替代C51中的参数化分布：

与原始参数化（C51）相比，参数化分位数分布的好处有三个：

First, we are not restricted to prespecified bounds on the support, or a uniform resolution, potentially leading to significantly more accurate predictions when the range of returns vary greatly across states.

This also lets us do away with the unwieldy projection step present in C51, as there are no issues of disjoint supports. Together, these obviate the need for domain knowledge about the bounds of the return distribution when applying the algorithm to new tasks.

Finally, this reparametrization allows us to minimize the Wasserstein loss, without suffering from biased gradients, specifically, using quantile regression.

分位数回归（quantile regression）损失定义为：
$\mathcal{L}_{QR}^{\tau}(\theta) := \mathbb{E}_{\hat{Z} \sim Z} [ \rho_{\tau} (\hat{Z} - \theta)], ~ \text{where} \\ \rho_{tau} (u) = u(\tau - \delta_{\{u < 0\}}), \forall u \in \mathbb{R}$

由于分位数回归损失在零点处不平滑，因此论文中使用一种改进的分位数Huber损失，该损失在零附近的区间中充当非对称平方损失，并在此区间外恢复为标准的分位数损失。
Huber loss定义为：
$\mathcal{L}_k (u) = \begin{cases} \frac{1}{2}u^2 ,& \text{ if } |u| \leq k \\ k(|u| - \frac{1}{2}) ,& \text{ otherwise } \end{cases}$

Quantile Huber loss定义为Huber loss的非对称变体：
$\rho_{\tau}^k (u) = |\tau - \delta_{\{u<0\}}| \mathcal{L}_k (u)$

最终形成的Quantile Regression Q-learning（QR-DQN）算法：

3. IQN

2018: Implicit Quantile Networks for Distributional Reinforcement Learning

可以看做是DQN的分布泛化版本，同时综合了C51和QR-DQN的优势。
QR-DQN学习的是一个离散的分位数集，而IQN旨在学习一个完整的分位数函数，即从概率到回报的连续映射。结合基本分布如，可以形成一个隐式网络，能够在给定网络容量的情况下近似任何收益分布。

IQN主要具有以下三个优势：

First, the approximation error for the distribution is no longer controlled by the number of quantiles output by the network, but by the size of the network itself, and the amount of training.

Second, IQN can be used with as few, or as many, samples per update as desired, providing improved data efficiency with increasing number of samples per training update.

Third, the implicit representation of the return distribution allows us to expand the class of policies to more fully take advantage of the learned distribution. Specifically, by taking the base distribution to be non-uniform, we expand the class of policies to $\epsilon$ -greedy policies on arbitrary distortion risk measures.

分位数函数（quantile function）其实就是累积分布函数的反函数（inverse cumulative distribution function），因此分位数函数的定义域是，值域是随机变量的取值范围。

回顾QR-DQN的关键公式：

IQN是一种经过训练的确定性参数函数，用于重新参数化来自基本分布的样本（如 $\tau \sim U([0, 1])$ ），对应于目标分布的各个分位数。
定义 $F_Z^{-1}$ 表示随机变量在 $\tau \in [0, 1]$ 处的分位数函数，简写成 $Z_{\tau} := F_Z^{-1}$ ；定义 $\beta : [0, 1] \rightarrow [0, 1]$ 是一个扭曲风险度量（distortion risk measure）
在 $\beta$ 下的扭曲期望（distorted expectation）定义为：

$Q_{\beta}(x, a) := \mathop{\mathbb{E}}_{\tau \sim U([0, 1])} [Z_{\beta(\tau)}(x, a)] = \int_{0}^{1}F_Z^{-1}(\tau) d \beta(\tau)$
Any distorted expectation can be represented as a weighted sum over the quantiles !

风险敏感的贪心策略 $\pi_{\beta}$ 表示为：
$\pi_{\beta}(x) = \mathop{argmax}_{a \in \mathcal{A}} Q_{\beta}(x, a)$

对于两个samples $\tau, \tau^{\prime} \sim U([0, 1])$ 和policy $\pi_{\beta}$ ，在step 的sampled TD error为：
$\delta_t^{\tau, \tau^{\prime}} = r_t + \gamma Z_{\tau^{'}} (x_{t+1}, \pi_{\beta}(x_{t+1})) - Z_{\tau}(x_t, a_t)$

IQN的损失函数为（和 $N^{'}$ 分别表示i.i.d.样本 $\tau_i, \tau_j^{'} \sim U([0, 1])$ 的数量）：
$\mathcal{L} (x_t, a_t, r_t, x_{t+1}) = \frac{1}{N'} \sum_{i=1}^{N} \sum_{j=1}^{N'} \rho_{\tau_i}^{k} (\delta^{\tau_i, \tau_j'})$

基于样本的风险敏感的策略 $Q_{\beta}$ 可以如下计算：
$\tilde{\pi}_{\beta}(x) = \mathop{argmax}_{a \in \mathcal{A}} \frac{1}{K} \sum_{k=1}^{K} Z_{\beta(\tilde{\tau}_k)} (x, a)$

有关IQN的具体实现细节（在DQN基础上的改进）：

4. Dopamine (code repository)

2018: Dopamine: A Research Framework for Deep Reinforcement Learning

包含DQN、C51、Rainbow和IQN的代码实现。

# Model-Free RL: Path-Consistency Learning

5. PCL

2017: Bridging the Gap Between Value and Policy Based Reinforcement Learning

建立了value-based和policy-based RL之间的联系，其基础是softmax temporal value consistency和policy potimality under entropy regularization之间的关系，证明了softmax一致的动作值对应于任何动作序列上的最优熵正则化策略梯度。
policy-based方法一般是on-policy的，优点是训练稳定，缺点是sample efficiency低；value-based方法一般是off-policy的，优点是sample efficiency高，缺点是不稳定。
Expected discounted objective $O_{ER}(s, \pi)$ 可以递归定义为：
$O_{ER}(s, \pi) = \sum_a \pi (a | s) [r(s, a) + \gamma O_{ER} (s', \pi)]$

定义regularized expected reward为expected reward和discounted entropy term的和：
$O_{ENT} (s, \pi) = O_{ER} (s, \pi) + \tau \mathbb{H}(s, \pi)$

$\mathbb{H}(s, \pi)$ 和 $O_{ENT}(s, \pi)$ 可以递归表示为：
$\mathbb{H}(s, \pi) = \sum_a \pi(a|s) [-\log \pi(a|s) + \gamma \mathbb{H}(s', \pi)] \\ O_{ENT} (s, \pi) = \sum_a \pi(a|s) [r(s, a) - \tau \log \pi(a|s) + \gamma O_{ENT}(s', \pi)]$

推导出的关键公式为：
$V^*(s_1) - \gamma^{t-1} V^*(s_t) = \sum_{i=1}^{t-1}\gamma^{i-1} [r(s_i, a_i) - \tau \log \pi^*(a_i|s_i)]$

PCL算法（policy $\pi_{\theta}$ ，参数为 $\theta$ ；value function $V_{\phi}$ ，参数为 $\phi$ ）
$C(s_{i:i+d}, \theta, \phi) = -V_{\phi}(s_i) + \gamma^d V_{\phi}(s_{i+d}) + \sum_{j=0}^{d-1}\gamma^j [r(s_{i+j}, a_{i+j}) - \tau \log \pi_{\theta}(a_{i+j}|s_{i+j})] \\ O_{PCL} (\theta, \phi) = \sum_{s_{i:i+d} \in E} \frac{1}{2} C(s_{i:i+d}, \theta, \phi)^2 \\ \Delta \theta = \eta_{\pi} C(s_{i: i+d}, \theta, \phi) \sum_{j-1}^{d-1} \gamma^j \nabla_{\theta} \log \pi_{\theta} (a_{i+j} | s_{i+j}) \\ \Delta \phi= \eta_v C(s_{i: i+d}, \theta, \phi) (\nabla_{\phi} V_{\phi}(s_i) - \gamma^d \nabla_{\phi} V_{\phi}(s_{i+d}))$

Unified PCL算法
$V_{\rho} (s) = \tau \log \sum_a \exp \left\{Q_{\rho} (s, a) / \tau \right\} \\ \pi_{\rho} (a|s) = \exp \left\{ (Q_{\rho}(s, a) - V_{\rho}(s)) / \tau \right\} \\ \Delta \rho = \eta_{\pi} C(s_{i: i+d}, \rho) \sum_{j-1}^{d-1} \gamma^j \nabla_{\rho} \log \pi_{\rho} (a_{i+j} | s_{i+j}) + \eta_v C(s_{i: i+d}, \rho) (\nabla_{\rho} V_{\rho}(s_i) - \gamma^d \nabla_{\rho} V_{\rho}(s_{i+d}))$

6. Trust-PCL

2017: Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

结合了TRPO和PCL的优点，同时使用相对熵（relative entropy）和熵正则化（entropy regularization）进行策略优化。熵正则化有助于提高exploration，相对熵可以提高训练稳定性并允许更快的学习率。

$V^*(s_t) = \mathop{\mathbb{E}}_{r_{t+i}, s_{t+i}} \left[ \gamma^d V^*(s_{t+d}) + \sum_{i=0}^{d-1} \gamma^i (r_{t+i} - (\tau+\lambda) \log \pi^*(a_{t+i} | s_{t+i}) + \lambda \log \tilde{\pi} (a_{t+i} | s_{t+i})) \right] \\ C(s_{t: t+d}, \theta, \phi) = -V_{\phi}(s_t) + \gamma^d V_{\phi}(s_{t+d}) + \sum_{i=0}^{d-1} \gamma^i (r_{t+i} - (\tau+\lambda) \log \pi_{\theta}(a_{t+i} | s_{t+i}) + \lambda \log \pi_{\tilde{\theta}} (a_{t+i} | s_{t+i})) \\ \mathcal{L} (S, \theta, \phi) = \sum_{k=1}^B \sum_{t=1}^{T_k-1} C(s_{t: t+d}^{(k)}, \theta, \phi)^2$

论文比较凝练，很好的回顾了TRPO和PCL，建议反复看看。

# MARL

7. HATRPO/HAPPO

2022: Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

将基于Trust Region的Policy Optimization方法应用到Multi-Agent设定一直是研究热点。之前的一些方法假设agents之间是同质的从而将TRPO/PPO参数共享来实现，但是缺乏单调改进的理论保证，并且可能导致次优策略。
论文将Trust Region方法扩展到cooperative MARL环境（异质性代理，Heterogeneous-Agent），主要贡献是the multi-agent advantage decomposition lemma和the sequential policy update scheme，以及在TRPO和PPO基础上的算法HATRPO和HAPPO。

Miscellaneous 强化 20 section jianshu

有关强化学习整理-经典论文之Miscellaneous的更多相关文章

7个大一C语言必学的程序 / C语言经典代码大全 - 2
嗨~大家好，这里是可莉！今天给大家带来的是7个C语言的经典基础代码~那一起往下看下去把【程序一】打印100到200之间的素数#includeintmain(){ inti; for(i=100;i 【程序二】输出乘法口诀表#includeintmain(){inti;for(i=1;i 【程序三】判断1000年---2000年之间的闰年#includeintmain(){intyear;for(year=1000;year 【程序四】给定两个整形变量的值，将两个值的内容进行交换。这里提供两种方法来进行交换，第一种为创建临时变量来进行交换，第二种是不创建临时变量而直接进行交换。1.创建临时变量来
LC滤波器设计学习笔记（一）滤波电路入门 - 2
目录前言滤波电路科普主要分类实际情况单位的概念常用评价参数函数型滤波器简单分析滤波电路构成低通滤波器RC低通滤波器RL低通滤波器高通滤波器RC高通滤波器RL高通滤波器部分摘自《LC滤波器设计与制作》，侵权删。前言最近需要学习放大电路和滤波电路，但是由于只在之前做音乐频谱分析仪的时候简单了解过一点点运放，所以也是相当从零开始学习了。滤波电路科普主要分类滤波器：主要是从不同频率的成分中提取出特定频率的信号。有源滤波器：由RC元件与运算放大器组成的滤波器。可滤除某一次或多次谐波，最普通易于采用的无源滤波器结构是将电感与电容串联，可对主要次谐波（3、5、7）构成低阻抗旁路。无源滤波器：无源滤波器，又称
CAN协议的学习与理解 - 2
最近在学习CAN，记录一下，也供大家参考交流。推荐几个我觉得很好的CAN学习，本文也是在看了他们的好文之后做的笔记首先是瑞萨的CAN入门，真的通透；秀！靠这篇我竟然2天理解了CAN协议！实战STM32F4CAN！原文链接：https://blog.csdn.net/XiaoXiaoPengBo/article/details/116206252CAN详解（小白教程）原文链接：https://blog.csdn.net/xwwwj/article/details/105372234一篇易懂的CAN通讯协议指南1一篇易懂的CAN通讯协议指南1-知乎(zhihu.com)视频推荐CAN总线个人知识总
深度学习部署：Windows安装pycocotools报错解决方法 - 2
深度学习部署：Windows安装pycocotools报错解决方法1.pycocotools库的简介2.pycocotools安装的坑3.解决办法更多Ai资讯：公主号AiCharm本系列是作者在跑一些深度学习实例时，遇到的各种各样的问题及解决办法，希望能够帮助到大家。ERROR:Commanderroredoutwithexitstatus1:'D:\Anaconda3\python.exe'-u-c'importsys,setuptools,tokenize;sys.argv[0]='"'"'C:\\Users\\46653\\AppData\\Local\\Temp\\pip-instal
Hive SQL 五大经典面试题 - 2
目录第1题连续问题分析：解法：第2题分组问题分析：解法：第3题间隔连续问题分析：解法：第4题打折日期交叉问题分析：解法：第5题同时在线问题分析：解法：第1题连续问题如下数据为蚂蚁森林中用户领取的减少碳排放量iddtlowcarbon10012021-12-1212310022021-12-124510012021-12-134310012021-12-134510012021-12-132310022021-12-144510012021-12-1423010022021-12-154510012021-12-1523.......找出连续3天及以上减少碳排放量在100以上的用户分析：遇到这类
ruby - 我正在学习编程并选择了 Ruby。我应该升级到 Ruby 1.9 吗？ - 2
我完全不是程序员，正在学习使用Ruby和Rails框架进行编程。我目前正在使用Ruby1.8.7和Rails3.0.3，但我想知道我是否应该升级到Ruby1.9，因为我真的没有任何升级的“遗留”成本。缺点是什么？我是否会遇到与普通gem的兼容性问题，或者甚至其他我不太了解甚至无法预料的问题？最佳答案你应该升级。不要坚持从1.8.7开始。如果您发现不支持1.9.2的gem，请避免使用它们(因为它们很可能不被维护)。如果您对gem是否兼容1.9.2有任何疑问，您可以在以下位置查看:http://www.railsplugins.or
ruby - 我如何学习 ruby 的正则表达式？ - 2
如何学习ruby的正则表达式？(对于假人) 最佳答案 http://www.rubular.com/在Ruby中使用正则表达式时是一个很棒的工具，因为它可以立即将结果可视化。关于ruby-我如何学习ruby的正则表达式？，我们在StackOverflow上找到一个类似的问题： https://stackoverflow.com/questions/1881231/
深度学习12. CNN经典网络 VGG16 - 2
深度学习12.CNN经典网络VGG16一、简介1.VGG来源2.VGG分类3.不同模型的参数数量4.3x3卷积核的好处5.关于学习率调度6.批归一化二、VGG16层分析1.层划分2.参数展开过程图解3.参数传递示例4.VGG16各层参数数量三、代码分析1.VGG16模型定义2.训练3.测试一、简介1.VGG来源VGG（VisualGeometryGroup）是一个视觉几何组在2014年提出的深度卷积神经网络架构。VGG在2014年ImageNet图像分类竞赛亚军，定位竞赛冠军；VGG网络采用连续的小卷积核（3x3）和池化层构建深度神经网络，网络深度可以达到16层或19层，其中VGG16和VGG
机器学习——时间序列ARIMA模型(四)：自相关函数ACF和偏自相关函数PACF用于判断ARIMA模型中p、q参数取值 - 2
文章目录1、自相关函数ACF2、偏自相关函数PACF3、ARIMA(p,d,q)的阶数判断4、代码实现1、引入所需依赖2、数据读取与处理3、一阶差分与绘图4、ACF5、PACF1、自相关函数ACF自相关函数反映了同一序列在不同时序的取值之间的相关性。公式：ACF(k)=ρk=Cov(yt,yt−k)Var(yt)ACF(k)=\rho_{k}=\frac{Cov(y_{t},y_{t-k})}{Var(y_{t})}ACF(k)=ρk=Var(yt)Cov(yt,yt−k)其中分子用于求协方差矩阵，分母用于计算样本方差。求出的ACF值为[-1,1]。但对于一个平稳的AR模型，求出其滞
Unity Shader 学习笔记（5）Shader变体、Shader属性定义技巧、自定义材质面板 - 2
写在之前Shader变体、Shader属性定义技巧、自定义材质面板，这三个知识点任何一个单拿出来都是一套知识体系，不能一概而论，本文章目的在于将学习和实际工作中遇见的问题进行总结，类似于网络笔记之用，方便后续回顾查看，如有以偏概全、不祥不尽之处，还望海涵。1、Shader变体先看一段代码......Properties{ [KeywordEnum(on,off)]USL_USE_COL("IsUseColorMixTex?",int)=0 [Toggle(IS_RED_ON)]_IsRed("IsRed?",int)=0}......//中间省略，后续会有完整代码 #pragmamulti_c