多智能体增强学习算法及应用研究

资源描述

《多智能体增强学习算法及应用研究》由会员分享，可在线阅读，更多相关《多智能体增强学习算法及应用研究（66页珍藏版）》请在金锄头文库上搜索。

1、国防科学技术大学硕士学位论文多智能体增强学习算法及应用研究姓名：连传强申请学位级别：硕士专业：控制科学与工程指导教师：徐昕2010-11国防科学技术大学研究生院学位论文摘要摘要随着物理机器人和软件智能体的不断普及，对于多智能体的需求和应用，如足球机器人、搜索和营救等变得越来越普遍。多智能体系统的增强学习(Reinforcement Learning: RL)在近年来吸引了越来越多研究者的注意力。由于多智能体系统往往面临着巨大或连续的状态动作空间以及更多的环境不确定性和随机性，因此高效的多智能体增强学习算法仍然是目前研究的难点和热点之一。本文在国家自然科学

2、基金项目“基于核的增强学习与近似动态规划方法研究”的支持下，对多智能体增强学习算法进行了研究：首先对一类自适应评价增强学习方法-对偶启发式规划 (Dual Heuristic Programming : DHP)算法进行了改进，然后面向多机器人编队控制和网络资源分配两类典型的多智能体协作控制问题，分别提出了 IL-DHP(Individually Learning- Dual Heuristic Programming)和Q-CF(Q-Chain Feedback)两类多智能体增强学习算法。在整个研究过程中，取得的主要成果包括：（1） DHP 方法中的执行器模块和评价器模块通常采用神经网络

3、来构建，在以往的训练中采用的是固定不变的学习率，这限制了神经网络的学习收敛速度，进而影响了 DHP 方法的学习收敛速度和学习成功率。本文针对这个问题将Delta-Bar-Delta 学习规则引入到 DHP 算法中，使其在学习过程中两个网络模块动态的调节学习率，提高了收敛速度和学习成功率，仿真实验验证了其有效性。（2）在多机器人编队控制问题中，提出了基于独立增强学习(Reinforcement Learning Individually: RLI)思想的 IL-DHP 方法。IL-DHP 算法是一种分布式的多智能体增强学习算法，每个智能体不依赖于其他智能体的状态和动作独立的采用DHP 算法进

4、行学习。在基于距离角度信息的 l- 反馈控制方法的基础上，每个智能体采用 DHP 算法优化其反馈控制参数。仿真结果表明，在队形或领航机器人速度发生改变时， IL-DHP 方法相比单纯的 l- 反馈控制方法具有更优的性能。（3）针对网络资源分配问题，将 Q-学习算法和链式反馈(Chain Feedback: CF)学习算法相结合，提出了 Q-CF 多智能体增强学习算法，利用一种称为信息链式反馈的机制实现了多智能体之间的高效协同。仿真结果表明，和已有的多智能体Q-学习算法相比，本文方法具有更加快速的收敛速度，同时保证了协同策略的性能优化。关键词：增强学习，多智能体系统，多机器人编队控制

5、，关键词：增强学习，多智能体系统，多机器人编队控制，Delta-Bar-Delta 学习规则，学习规则，DHP，神经网络，资源分配，协同控制，神经网络，资源分配，协同控制第 i 页国防科学技术大学研究生院学位论文 ABSTRACT As the popularization of physics robots and software agents, there are more needs and applications for multi-agent such as robot soccer, searching, rescuing and so on

6、. In recent years, reinforcement learning in multi-agent system has taken more and more attention. However, multi-agent system often faces huge or continuous state and action spaces and more uncertainty and randomness, so multi-agent reinforcement learning is still a difficult and hot topic. Support

7、ed by the National Natural Science Foundation of China (NSFC), the research topic of this paper has been focus on the kernel-based reinforcement learning and approximate dynamic programming. Multi-agent reinforcement learning algorithm was studied in this paper: DHP(Dual Heuristic Programming) algor

8、ithm which is one of ACD (Adaptive Critic Design) methods was ameliorated, and IL-DHP(Individually Learning- Dual Heuristic Programming) and Q-CF(Q-Chain Feedback) multi-agent reinforcement learning algorithms were presented for formation control of multi-robot and resource allocation in network whi

9、ch are representative multi-agent cooperation problems. The main contributions and innovations of this paper can be summarized as follows: (1) In DHP algorithm, action module and critic module were often structured by neural networks, whose learning rate was constant in traditional training, which l

10、imited the learning convergence speed. This influenced the learning convergence speed and success rate of DHP algorithm. In this paper, Delta-Bar-Delta learning rule was introduced in DHP algorithm, making two network modules dynamic adjust learning rate, which increased learning convergence speed a

11、nd success rate. The efficiency of the proposed method was verified by experimental results. (2) In formation control problem of multi-robot, IL-DHP algorithm based on reinforcement learning individually(RLI) was presented. IL-DHP is a distributed multi-agent reinforcement learning algorithm, each a

12、gent learns by using DHP algorithm individually without depending on the states and actions of other agents. Based on l- feedback control method, each agent optimizes feedback control parameters by using DHP algorithm. Simulation results showed that when the formation or speed of leading robot was c

13、hanged, the proposed method was better than pure l- feedback control method. (3) For the resource allocation in network, by combining the Q-learning algorithm and the chain feedback learning mechanism, a novel Q-CF multi-agent 第 ii 页国防科学技术大学研究生院学位论文 reinforcement learning algorithm wa

14、s presented. In the Q-CF algorithm, multi-agent cooperation was realized based on the mechanism of information chain feedback. Simulation results showed that compared with multi-agent Q-learning algorithm in existence, the proposed algorithm in this paper owned faster convergence speed, and at the s

15、ame time it ensured the performance optimization of cooperation policy. Keywords: Reinforcement Learning, Multi-agent System, Formation Control of Multi-robot, Delta-Bar-Delta Learning Rule, DHP, Neural Networks, Resource Allocation, Cooperation Control 第 iii 页国防科学技术大学研究生院学位论文图目录图 2.1 ACD方法的结构图.17 图 2.2 DHP方法中神经网络结构示意图.19 图 2.3 倒立摆系统示意图.24 图 2.4 收敛速度比对图.27 图 2.5 学习成功率比对图.

展开阅读全文