深度学习课件:深度强化学习

上传人:我*** 文档编号:146211390 上传时间:2020-09-28 格式:PPTX 页数:30 大小:2.87MB
返回 下载 相关 举报
深度学习课件:深度强化学习_第1页
第1页 / 共30页
深度学习课件:深度强化学习_第2页
第2页 / 共30页
深度学习课件:深度强化学习_第3页
第3页 / 共30页
深度学习课件:深度强化学习_第4页
第4页 / 共30页
深度学习课件:深度强化学习_第5页
第5页 / 共30页
点击查看更多>>
资源描述

《深度学习课件:深度强化学习》由会员分享,可在线阅读,更多相关《深度学习课件:深度强化学习(30页珍藏版)》请在金锄头文库上搜索。

1、Introduction to Deep Reinforcement Learning,Yen-Chen Wu 2015/12/11,Outline,Reinforcement Learning Markov Decision Process How to Solve MDPs DP MC TD Q-learning (DQN) Paper Review,Reinforcement Learning,Branches of Machine Learning,What makes different?,There is no supervisor, only a reward signal Fe

2、edback is delayed, not instantaneous Time really matters (sequential, non i.i.d data) Agents actions affect the subsequent data it receives,Goal: Maximize Cumulative Reward,Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-

3、term reward,Agent & Enviroment, Defense Attack Jump,Full observability vs Partial observability Learning and Planning Exploration and Exploitation Prediction and Control,Markov Decision Process,Markov Processes Markov Reward Processes Markov Decision Processes,Markov Process,Markov Reward Processes,

4、Markov Decision Process,Markov Decision Process(MDP),S : finite set of states (observations) A : finite set of actions P : transition probability R : immediate reward : discount factor Goal : Choose policy Maximize expected return :,How to Solve MDP,Dynamic Programming Monte-Carlo Temporal-Differenc

5、e Q-Learning,Model-based,Dynamic Programming Evaluate policy Update policy,Model Free,Unknown Transition Probability & Reward MC vs TD,Model Free: Q-learning,Instead of tabular optimal action-value function (Q-learning) = Bellman equation,Basic idea : iterative update (lack of generalization) In pra

6、ctical : function approximator Linear ? Using DNN !,Deep Q-network (DQN),Video,Deep Q-Network,compute Q-values for all actions,Input : 84x84x4,Convolves 32 filters of 8x8 with stride 4 Convolves 64 filters of 4x4 with stride 2 Convolves 64 filters of 3x3 with stride 1,Full-connected 512 nodes,Output

7、 a node for each action,Update DQN,Loss function Gradient,Two Technique,Experience Replay Experience Pooled Memory Data efficiency (bootstrap) Avoid correlation between samples (variance between batches) Off policy is suitable for Q-learning Random sampled mini-batch Prioritized sweeping (active lea

8、rning) Separate Target Network more stable than online learning,DEMO,Paper review,Paper list,Massively Parallel Methods for Deep Reinforcement Learning Continuous control with deep reinforcement learning Deep Reinforcement Learning with Double Q-learning Policy Distillation Dueling Network Architect

9、ures for Deep Reinforcement Learning Multiagent Cooperation and Competition with Deep Reinforcement Learning,Massively Parallel Methods for Deep Reinforcement Learning Arun Nair arXiv:1507.04296,DDPG (Deterministic Policy Gradient),DDAC (Deep Deterministic Actor-Critic),Continuous control with deep reinforcement learning Timothy P. Lillicrap arXiv:1509.02971 https:/goo.gl/J4PIAz,Double Q-learning,Policy Distillation,Soft target,Dueling Network,Multiagent,

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 办公文档 > PPT模板库 > PPT素材/模板

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号