Deeplearningwithmultiplicativeinteractions

资源描述

《Deeplearningwithmultiplicativeinteractions》由会员分享，可在线阅读，更多相关《Deeplearningwithmultiplicativeinteractions（61页珍藏版）》请在金锄头文库上搜索。

1、Deep-learning-with-Deep-learning-with-multiplicative-multiplicative-interactionsinteractionsOverviewBackground: How to learn a multilayer generative model of unlabeled data using a Restricted Boltzmann MachineHow to fine-tune for better discriminationA speech recognition example (Dahl & Mohamed)The

2、new idea: RBMs with factored, 3-way interactionsWhy generative models need 3-way interactionsFactorizing 3-way interactions to save on parametersInference and learning in the factored 3-way modelMemisevic: Learning how images transform over timeTaylor: Transforming a model of human motionRanzato: Cr

3、eating a pixel covariance matrix on the flyApplied to object recognition in tiny color images.Restricted Boltzmann MachinesWe restrict the connectivity to make learning easier.Only one layer of stochastic binary hidden units.No connections between hidden units.In an RBM, the hidden units are conditi

4、onally independent given the visible states. So we can quickly get an unbiased sample from the posterior distribution when given a data-vector.hiddenijvisiblebias terms left out to simplify the mathThe Energy of a joint configuration(ignoring terms to do with biases)weight between units i and jEnerg

5、y with binary vectors v on the visible units and h on the hidden unitsbinary state of visible unit ibinary state of hidden unit jUsing energies to define probabilitiesThe probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compare

6、d with the energy of all other joint configurations.The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.partition functionA picture of the maximum likelihood learning algorithm for an RBM ijijijijt = 0 t = 1 t = 2 t

7、= infinityStart with a training vector on the visible units.Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.a fantasyA quick way to learn an RBMijijt = 0 t = 1 Start with a training vector on the visible units.Update all the hidden unit

8、s in parallelUpdate the all the visible units in parallel to get a “reconstruction”.Update the hidden units again. This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another objective function called contrastive divergence (Hint

9、on, 2002).reconstructiondataTraining a deep network(the main reason RBMs are interesting)First train a layer of features that receive input directly from the pixels.Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.This

10、creates a multi-layer generative model.It can be proved that each time we add another layer of features we improve a variational lower bound on the log probability of the training data.The proof is complicated. Fine-tuning for discriminationFirst learn one layer of features at a time without using l

11、abel information.Then add a final layer of label units.Then use backpropagation from the label units to fine-tune the features that were learned in the unsupervised “pre-training” phase.This overcomes many of the limitations of standard backpropagation.The label information is used to adjust decisio

12、n boundaries, not to discover featuresIt finds much deeper minima that generalize much better (Bengio lab).Why unsupervised pre-training makes sensestuffimagelabelstuffimagelabelIf image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example

13、, do the pixels have even parity?If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.high bandwidthlow bandwidthA neat application of deep learningA very deep belief net is beats the record at p

14、hone recognition on the very well-studied TIMIT database.The task: Predict the probabilities of 183 context-dependent phone labels for the central frame of a short window of speechThe training procedure:Train lots of big layers, one at a time, without using the labels.Add a 183-way softmax of contex

15、t-specific phone labelsFine-tune with backprop on a big GPU board for several daysThe performance: After the standard post-processing using a bi-phone model this gets 23.0% phone error rate.Our speech experts believe that this beats all previous recognition methods that use the standard decoder.For

16、TIMIT, the classification task is a bit easier than the recognition task. Deep networks are the best at classification too (Honglak Lee) One very deep belief net for phone recognition11 frames of 39 MFCCs2000 binary hidden units 2000 binary hidden units 2000 binary hidden units 2000 binary hidden un

17、its 128 units 183 labelsMohamed, Dahl & Hintonposter in the NIPS speech workshop on Saturdaynot pre-trainedThe Mel Cepstrum Coefficients are a standard representation for speechA simple real-valued visible unitWe model MFCC coefficients as Gaussian variables that are independent given the hidden sta

18、tes. Alternating Gibbs sampling is still easy, but learning needs to be much slower.E energy-gradient due to top-down input to unit i.parabolic containment The new ideaThe basic RBM module is flawed.It is no good at dealing with multiplicative interactions.Multiplicative interactions are ubiquitousS

19、tyle and content (Freeman and Tenebaum)Image transformations (Tensor faces)Heavy-tailed distributions caused by multiplying together two Gaussian distributed variables.Generating the parts of an object: why multiplicative interactions are useful One way to maintain the constraints between the parts

20、is for the level above to specify the location of each part very accuratelyBut this would require a lot of communication bandwidth.Sloppy top-down specification of the parts is less demanding but it messes up relationships between partsso use redundant features and specify lateral interactions to sh

21、arpen up the mess.Each part helps to locate the othersThis allows a noisy top-down channelGenerating the parts of an object sloppy top-down activation of partsclean-up using lateral interactions specified by the layer above.pose parameters parts with top-down support“square”+Its like soldiers on a p

22、arade groundTowards a more powerful, multi-linear stackable learning moduleWe want the states of the units in one layer to modulate the pair-wise interactions in the layer below (not just the biases)Can we do this without losing the nice property that the hidden units are conditionally independent g

23、iven the visible states? To modulate pair-wise interactions we need higher-order Boltzmann machines. These have far too many parameters, but we have a trick for fixing that.Higher order Boltzmann machines (Sejnowski, 1986)The usual energy function is quadratic in the states:But we could use higher o

24、rder interactions: Unit k acts as a switch. When unit k is on, it switches in the pairwise interaction between unit i and unit j. Units i and j can also be viewed as switches that control the pairwise interactions between the other two units.Using higher-order Boltzmann machines to model image trans

25、formations (the unfactored version, Memisevic &Hinton CVPR 2007)A global transformation specifies which pixel goes to which other pixel.Conversely, each pair of similar intensity pixels, one in each image, votes for a particular global transformation.image(t)image(t+1)image transformationFactoring t

26、hree-way multiplicative interactionsfactored with linearly many parameters per factor.unfactoredwith cubically many parametersA picture of the rank 1 tensor contributed by factor fIts a 3-way outer product.Each layer is a scaled version of the same rank 1 matrix. Inference with factored three-way mu

27、ltiplicative interactionsHow changing the binary state of unit h changes the energy contributed by factor fWhat unit h needs to know in order to do Gibbs samplingEnergy contributed by factor f=Belief propagationThe outgoing message at each vertex of the factor is the product of the weighted sums at

28、the other two vertices.Learning with factored three-way multiplicative interactionsmessage from factor f to unit h2.30?receptive field in pre-imagereceptive field in post-imageShowing what a factor learns by alternating between its pre- and post- fieldspre-imagepost-imageThe factor receptive fieldsT

29、he network is trained on translated random dot patterns.The factor receptive fieldsThe network is trained on translated random dot patterns.The network is trained on rotated random dot patterns.The network is trained on rotated random dot patterns.How does it perceive two overlaid sparse dot pattern

30、s moving in different directions?First we train a second hidden layer. Each of these units prefers motion in a different direction.Then we compute the perceived motion by adding up the preferences of the active units in the second hidden layer.If the two motions are within about 30 degrees it sees a

31、 single average motion.If they are further apart it sees two separate motions.The separate motions are slightly further apart than the real ones.This is just like human perception and it was not trained on transparent motion.The training is entirely unsupervised.Time series modelsInference is diffic

32、ult in directed models of time series if we use non-linear, distributed representations in the hidden units.It is hard to fit directed graphical models to high-dimensional sequences (e.g motion capture data). So people tend to use methods with much less representational power HMMs give up on distrib

33、uted representationsLinear Dynamical Systems give up on non-linearity.The conditional RBM model (a partially observed bipartite CRF)Start with a generic RBM.Add two types of conditioning connections.Given the data, the hidden units at time t are conditionally independent.The autoregressive weights c

34、an model most short-term temporal structure very well, leaving the hidden units to model nonlinear irregularities. t-2 t-1 thvCausal generation from a learned modelKeep the previous visible states fixed.They provide a time-dependent bias for the hidden units.Perform alternating Gibbs sampling for a

35、few iterations between the hidden units and the most recent visible units.This picks new hidden and visible states that are compatible with each other and with the recent history.Higher level modelsOnce we have trained the model, we can add more layers.Treat the hidden activities of the first CRBM a

36、s data for training the next CRBM.Add “autoregressive” connections to a layer when it becomes the visible layer.Adding a second layer makes it generate more realistic sequences. t-2 t-1 tskip?An application to modeling motion capture dataHuman motion can be captured by placing reflective markers on

37、the jointsUse lots of infrared cameras to track the 3-D positions of the markersGiven a skeletal model, the 3-D positions of the markers can be converted intoThe joint anglesThe 3-D translation of the pelvisThe roll, pitch and delta yaw of the pelvis6 earlier visible framescurrent visible frame600 h

38、idden units100 style featuresstyle: 1-of-NUsing a style variable to modulate the interactions (there is additional weight sharing: Taylor&Hinton, ICML 2009)200 factorsShow demos of multiple styles of walkingThese can be found at www.cs.toronto.edu/gwtaylor/Modeling the covariance structure of a stat

39、ic image by using two copies of the imageEach factor sends the squared output of a linear filter to the hidden units.It is exactly the standard model of simple and complex cells. It allows complex cells to extract oriented energy.The standard model drops out of doing belief propagation for a factore

40、d third-order energy function. Copy 1Copy 2An advantage of modeling covariances between pixels rather than pixelsDuring generation, a hidden “vertical edge” unit can turn off the horizontal interpolation in a region without worrying about exactly where the intensity discontinuity will be.This gives

41、some translational invarianceIt also gives a lot of invariance to brightness and contrast.The “vertical edge” unit acts like a complex cell.By modulating the correlations between pixels rather than the pixel intensities, the generative model can still allow interpolation parallel to the edge.Using l

42、inear filters to model the inverse covariance matrix of two pixel intensitiesThe joint distribution of 2 pixelsEach factor creates a parabolic energy trough.small weightbig weightModulating the precision matrix by using additive contributions that can be switched offUse the squared outputs of a set

43、of linear filters to create an energy function. The energy function represents the negative log probability of the data under a full covariance Gaussian.Adapt the precison matrix to each datapoint by switching off the energy contributions from some of the linear filters.This is good for modeling smo

44、othness constraints that almost always apply, but sometimes fail catastrophically (e.g. at edges).Using binary hidden units to remove violated smoothness constraintsWhen the negative input from the squared filter exceeds the positive bias, the hidden unit turns off.filter output, y Free energy Infer

45、ence with hidden units that represent active smoothness constraintsThe hidden units are all independent given the pixel intensitiesThe factors do not create dependencies between hidden units.Given the states of the hidden units, the pixel intensity distribution is a full covariance Gaussian that is

46、adapted for that particular image.The hidden states do create dependencies between the pixels.Learning with an adaptive precision matrixSince the pixel intensities are no longer independent given the hidden states, it is much harder to produce reconstructions.We could invert the precision matrix for

47、 each training example, but this is slow.Instead, we produce reconstructions using Hybrid Monte Carlo, starting at the data.The rest of the learning algorithm is the same as before. Hybrid Monte CarloGiven the pixel intensities, we can integrate out the hidden states to get a free energy that is a d

48、eterministic function of the image.Backpropagation can then be used to get the derivatives of the free energy with respect to the pixel intensities.Hybrid Monte Carlo simulates a particle that starts at the datapoint with a random initial momentum and then moves over the free energy surface.20 leapf

49、rog steps work well for our networks.Skip?mcRBM (mean and covariance RBM)Use one set of binary hidden units to model the means of the real-valued pixels.These hidden units learn blurry patterns for coloring in regionsUse a separate set of binary hidden units to model the image-specific precision mat

50、rix. These hidden units get their input from factors.The factors learn sharp edge filters for representing breakdowns in smoothness.Receptive fields of the hidden units that represent the meansTrained on 16x16 patches of natural images.Receptive fields of the factors that are used to represent preci

51、sionsNotice the color blob with low frequency red-green and yellow-blue filtersWhy is the map topographic?We laid out the factors in a 2-D grid and then connected each hidden unit to a small set of nearby factors.If two factors get activated at the same time, it pays to connect them to the same hidd

52、en unit.You only lose once by turning off that hidden unit.Multiple reconstructions from the same hidden state of a mcRBMThe mcRBM hidden states are the same for each row. The hidden states should reflect human similarity judgements much better than squared difference of pixel intensities.Test examp

53、les from the CIFAR-10 dataset plane car bird cat deer dog frog horse ship truckApplication to the CIFAR-10 labeled subset of the TINY images dataset (MarcAurelio Ranzato)There are 5000 32x32 training images and 1000 32x32 testing images for each of 10 different classes. In addition, there are 80 mil

54、lion unlabeled images.Train the mcRBM model on a very large number of 8x8 color patches81 hiddens for the mean144 hiddens and 900 factors for the precisionReplicate the patches across the 32x32 color images49 patches with a stride of 4This gives 49 x 225 = 11025 hidden units.How well does it discrim

55、inate?Compare with Gaussian-Binary RBM model that has the same number of hidden units, but only models the means of the pixel intensities.Use multinomial logistic regression directly on the hidden units representing the means and the hidden units representing the precisions.We can probably do better

56、, but the aim is to evaluate the mcRBM idea.Also try unsupervised learning of extra hidden layers with a standard RBM to see if this gives even better features for discrimination.Percent correct on CIFAR-10 test dataGaussian RBM (only models the means)49x225 = 11025 hiddens59.7%3-way RBM (only model

57、s the covariances) 49x225 = 11025 hiddens, 225 filters per patch62.3%3-way RBM (only models the covariances)49x225 = 11025 hiddens, 900 filters per patch (extra factors allow pooling of similar filters)67.8%mcRBM (models means & covariances)49x(81+144) = 11025 hids, 900 filters per patch69.1%mcRBM t

58、hen extra hidden layer of 8096 units49x(81+144) = 11025 hids, 900 filters per patch72.1%SummaryIt is easy to learn deep generative models of unlabeled data by stacking RBMs.RBMs can be modified to allow factored multiplicative interactions. Inference is still easy.Learning is still easy if we condit

59、ion on one set of inputs (the pre-image for learning image transformations; the style for learning mocap)Multiplicative interactions allow an RBM to model pixel covariances in an image-specific way.This gives good hidden representations for object recognition. THE END A reading list on Deep Belief n

60、ets www.cs.toronto.edu/hinton/deeprefs.html First learn with all the weights tiedThis is exactly equivalent to learning an RBMContrastive divergence learning is equivalent to ignoring the small derivatives contributed by the tied weights between deeper layers.Learning a deep directed network v1 h1 v

61、0 h0 v2 h2etc. v0 h0Then freeze the first layer of weights in both directions and learn the remaining weights (still tied together).This is equivalent to learning another RBM, using the aggregated posterior distribution of h0 as the data. v1 h1 v0 h0 v2 h2etc. v1 h0The hybrid generative model after

62、learning 3 layersTo generate data: 1.Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling for a long time.2.Perform a top-down pass to get states for all the other layers. So the lower level bottom-up connections are not part of the generative model. They are jus

63、t used for inference. h2 data h1 h3Learning with unreliable labelsWe can infer the true hidden label by combining evidence. This allows us to get surprisingly low error rates with very bad labels:Perfect labels: 1%Labels 50% wrong: 2%Labels 80% wrong: 5%Its the mutual information that matters.2000 top-level neurons500 neurons500 neurons 28 x 28 pixel image 10 hidden labels 10 noisy labels confusion matrix结束结束

展开阅读全文

Deeplearningwithmultiplicativeinteractions

最新文档