《深度学习课件:深度强化学习》由会员分享,可在线阅读,更多相关《深度学习课件:深度强化学习(30页珍藏版)》请在金锄头文库上搜索。
1、Introduction to Deep Reinforcement Learning,Yen-Chen Wu 2015/12/11,Outline,Reinforcement Learning Markov Decision Process How to Solve MDPs DP MC TD Q-learning (DQN) Paper Review,Reinforcement Learning,Branches of Machine Learning,What makes different?,There is no supervisor, only a reward signal Fe
2、edback is delayed, not instantaneous Time really matters (sequential, non i.i.d data) Agents actions affect the subsequent data it receives,Goal: Maximize Cumulative Reward,Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-
3、term reward,Agent & Enviroment, Defense Attack Jump,Full observability vs Partial observability Learning and Planning Exploration and Exploitation Prediction and Control,Markov Decision Process,Markov Processes Markov Reward Processes Markov Decision Processes,Markov Process,Markov Reward Processes,
4、Markov Decision Process,Markov Decision Process(MDP),S : finite set of states (observations) A : finite set of actions P : transition probability R : immediate reward : discount factor Goal : Choose policy Maximize expected return :,How to Solve MDP,Dynamic Programming Monte-Carlo Temporal-Differenc
5、e Q-Learning,Model-based,Dynamic Programming Evaluate policy Update policy,Model Free,Unknown Transition Probability & Reward MC vs TD,Model Free: Q-learning,Instead of tabular optimal action-value function (Q-learning) = Bellman equation,Basic idea : iterative update (lack of generalization) In pra
6、ctical : function approximator Linear ? Using DNN !,Deep Q-network (DQN),Video,Deep Q-Network,compute Q-values for all actions,Input : 84x84x4,Convolves 32 filters of 8x8 with stride 4 Convolves 64 filters of 4x4 with stride 2 Convolves 64 filters of 3x3 with stride 1,Full-connected 512 nodes,Output
7、 a node for each action,Update DQN,Loss function Gradient,Two Technique,Experience Replay Experience Pooled Memory Data efficiency (bootstrap) Avoid correlation between samples (variance between batches) Off policy is suitable for Q-learning Random sampled mini-batch Prioritized sweeping (active lea
8、rning) Separate Target Network more stable than online learning,DEMO,Paper review,Paper list,Massively Parallel Methods for Deep Reinforcement Learning Continuous control with deep reinforcement learning Deep Reinforcement Learning with Double Q-learning Policy Distillation Dueling Network Architect
9、ures for Deep Reinforcement Learning Multiagent Cooperation and Competition with Deep Reinforcement Learning,Massively Parallel Methods for Deep Reinforcement Learning Arun Nair arXiv:1507.04296,DDPG (Deterministic Policy Gradient),DDAC (Deep Deterministic Actor-Critic),Continuous control with deep reinforcement learning Timothy P. Lillicrap arXiv:1509.02971 https:/goo.gl/J4PIAz,Double Q-learning,Policy Distillation,Soft target,Dueling Network,Multiagent,