《人工智能(nilson版-英文幻灯片)-chap03》由会员分享,可在线阅读,更多相关《人工智能(nilson版-英文幻灯片)-chap03(27页珍藏版)》请在金锄头文库上搜索。
1、Neural Networks,Chapter 3,2,Outline,3.1 Introduction 3.2 Training Single TLUs Gradient Descent Widrow-Hoff Rule Generalized Delta Procedure 3.3 Neural Networks The Backpropagation Method Derivation of the Backpropagation Learning Rule 3.4 Generalization, Accuracy, and Overfitting 3.5 Discussion,3,3.
2、1 Introduction,TLU (threshold logic unit): Basic units for neural networks Based on some properties of biological neurons Training set Input: real value, boolean value, Output: di: associated actions (Label, Class ) Target of training Finding f(X) corresponds “acceptably” to the members of the train
3、ing set. Supervised learning: Labels are given along with the input vectors.,4,3.2.1 TLU Geometry,Training TLU: Adjusting variable weights A single TLU: Perceptron, Adaline (adaptive linear element) Rosenblatt 1962, Widrow 1962 Elements of TLU Weight: W =(w1, , wn) Threshold: Output of TLU: Using we
4、ighted sum s = WX 1 if s 0 0 if s 0 Hyperplane WX = 0,5,6,3.2.2 Augmented Vectors,Adopting the convention that threshold is fixed to 0. Arbitrary thresholds: (n + 1)-dimensional vector W = (w1, , wn, 1) Output of TLU 1 if WX 0 0 if WX 0,7,3.2.3 Gradient Decent Methods,Training TLU: minimizing the er
5、ror function by adjusting weight values. Two ways: Batch learning v.s. incremental learning Commonly used error function: squared error Gradient : Chain rule: Solution of nonlinearity of f / s : Ignoring threshod function: f = s Replacing threshold function with differentiable nonlinear function,8,3
6、.2.4 The Widrow-Hoff Procedure,Weight update procedure: Using f = s = WX Data labeled 1 1, Data labeled 0 1 Gradient: New weight vector Widrow-Hoff (delta) rule (d f) 0 increasing s decreasing (d f) (d f) 0 decreasing s increasing (d f),9,The Generalized Delta Procedure,Sigmoid function (differentia
7、ble): Rumelhart, et al. 1986,10,The Generalized Delta Procedure (II),Gradient: Generalized delta procedure: Target output: 1, 0 Output f = output of sigmoid function f(1 f) = 0, where f = 0 or 1 Weight change can occur only within fuzzy region surrounding the hyperplane (near the point f(s) = ).,11,
8、The Error-Correction Procedure,Using threshold unit: (d f) can be either 1 or 1. In the linearly separable case, after finite iterations, W will be converged to the solution. In the nonlinearly separable case, W will never be converged. The Widrow-Hoff and generalized delta procedures will find mini
9、mum squared error solutions even when the minimum error is not zero.,12,Training Process,Data,NN,X(k),f(k),d(k),-,Update Rule,13,3.3 Neural Networks,Need for use of multiple TLUs Feedforward network: no cycle Recurrent network: cycle (treated in a later chapter) Layered feedforward network jth layer
10、 can receive input only from j 1th layer. Example :,14,Notation,Hidden unit: neurons in all but the last layer Output of j-th layer: X(j) input of (j+1)-th layer Input vector: X(0) Final output: f The weight of i-th sigmoid unit in the j-th layer: Wi(j) Weighted sum of i-th sigmoid unit in the j-th
11、layer: si(j) Number of sigmoid units in j-th layer: mj,15, 3.5,16,3.3.3 The Backpropagation Method,Gradient of Wi(j) : Weight update:,Local gradient,17,Weight Changes in the Final Layer,Local gradient: Weight update:,18,3.3.5 Weights in Intermediate Layers,Local gradient: The final ouput f, depends
12、on si(j) through of the summed inputs to the sigmoids in the (j+1)-th layer. Need for computation of,19,Weight Update in Hidden Layers (cont.),v i : v = i : Conseqeuntly,20,Weight Update in Hidden Layers (cont.),Attention to recursive equation of local gradient! Backpropagation: Error is back-propag
13、ated from the output layer to the input layer Local gradient of the latter layer is used in calculating local gradient of the former layer.,21,3.3.5 (cont),Example (even parity function) Learning rate: 1.0,22,Generalization, Accuracy, & Overfitting,Generalization ability: NN appropriately classifies
14、 vectors not in the training set. Measurement = accuracy Curve fitting Number of training input vectors number of degrees of freedom of the network. In the case of m data points, is (m-1)-degree polynomial best model? No, it can not capture any special information. Overfitting Extra degrees of freed
15、om are essentially just fitting the noise. Given sufficient data, the Occams Razor principle dictates to choose the lowest-degree polynomial that adequately fits the data.,23,Overfitting,24,Generalization (contd),Out-of-sample-set error rate Error rate on data drawn from the same underlying distribu
16、tion of training set. Dividing available data into a training set and a validation set Usually use 2/3 for training and 1/3 for validation k-fold cross validation k disjoint subsets (called folds). Repeat training k times with the configuration: one validation set, k-1 (combined) training sets. Take average of the error rate of each validation as the out-of-sample error. Empirically 10-fol