计算机体系结构chapter6-4

资源描述

《计算机体系结构chapter6-4》由会员分享，可在线阅读，更多相关《计算机体系结构chapter6-4（26页珍藏版）》请在金锄头文库上搜索。

1、4 脉动阵列处理机n为要求计算量很大的信号/图像处理及科学计算的特定算法需要n卡内基-梅隆大学的美籍华人H.T.Kung于1978 年提出脉动阵列处理（Systolic Array）机n具有较高的计算并行性n脉动阵列结构原理n通用脉动阵列结构Systolic ArchitecturesnOrchestrate（编制、合成） data flow for high throughput with less memory accessnDifferent from pipeliningnNonlinear array structure, multidirection data flow, eac

2、h PE may have (small) local instruction and data memorynDifferent from SIMDnEach PE may do something differentnInitial motivationnVLSI enables inexpensive special-purpose chipsnRepresent algorithms directly by chips connected in regular patternSystolic ArchitecturesMPEPEPEMPEConventionalSystolic arr

3、aysReplace a processing element(PE) with an array of PEs without increasing I/O bandwidthTwo Communication StylesCPUCPUCPULocal MemoryLocal MemoryLocal MemorySystolic communicationMemory communicationCPULocal MemoryCPULocal MemoryCPULocal MemoryCharacteristicsnPractical realizations (e.g. Intel iWAR

4、P) use quite general processorsnEnable variety of algorithms on same hardwarenBut dedicated interconnect channelsnData transfer directly from register to register across channelnSpecialized, and same problems as SIMDnGeneral purpose systems work well for same algorithms (locality etc.)脉动阵列结构的构形n一维线形

5、n二维矩形n二维六边形n二维二叉树性n二维三角形n三维。举例n在一个脉动式二维阵列结构上进行两个3*3 矩阵相乘n每一个处理单元PE含有一个乘法器和一个加法器，完成一个内积运算Matrix Multiplicationa11 a12 a13 a21 a22 a23 a31 a32 a33*b11 b12 b13 b21 b22 b23 b31 b32 b33=c11 c12 c13 c21 c22 c23 c31 c32 c33Conventional Method: N3For I = 1 to NFor J = 1 to NFor K = 1 to NCI,J = CI,J + AJ,K

6、 * BK,J;Systolic MethodThis will run in O(n) time!To run in N time we need N x N processing units, in this case we need 9. P9P8P7P6P5P4P1P2P3We need to modify the input data, like so:a13 a12 a11 a23 a22 a21 a33 a32 a31b31 b32 b33 b21 b22 b23 b11 b12 b13Flip columns 1 & 3 Flip rows 1 & 3 and finally

7、stagger the data sets for input.P9P8P7P6P5P4P1P2P3a13 a12 a11a23 a22 a21a33 a32 a31b31 b21 b11b32 b22 b12b33 b23 b13At every tick of the global system clock data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register. 3 4 2 2 5 3 3 2 5*

8、=3 4 2 2 5 3 3 2 523 36 28 25 39 34 28 32 37Lets try this using a systolic array.P9P8P7P6P5P4P1P2P32 4 33 5 23 2 35 2 35 3 22 5 43*32 43 5 23 25 2 35 3 22 5 4Clock tick: 1900000000P1P2P3P4P6P5P7P8P92*34*23*423 535 2 35 3 22 5Clock tick: 217120600000P1P2P3P4P6P5P7P8P93*32*45*22*34*53*235 25 32Clock t

9、ick: 3233261680900P1P2P3P4P6P5P7P8P93*42*22*25*53*32*24*355Clock tick: 42336182533413120P1P2P3P4P6P5P7P8P93*25*25*35*33*22*5Clock tick: 523362825391928226P1P2P3P4P6P5P7P8P92*35*23*5Clock tick: 6233628253934283212P1P2P3P4P6P5P7P8P95*5Clock tick: 7233628253934283237P1P2P3P4P6P5P7P8P9373228343925233628

10、233628253934283237P1P2P3P4P6P5P7P8P9Same answer! In 2n + 1 time, can we do better? The answer is yes, there is an optimization.脉动阵列结构特点n结构简单、规整，模块化强，可扩充，非常适合用超大规模集成电路实现；nPE间数据通信距离短、规则，使数据流和控制流的设计、同步控制等均简单规整；n脉动阵列中所有的PE能同时运算，具有较高的计算并行性，可通过流水获得很高的运算效率和吞吐率。输入数据能被多个处理单元重复使用，大大减轻了阵列与外界的I/O通信量，降低了对系统

11、主存和I/O系统频宽的要求；n脉动阵列的构形于特定计算任务和算法密切相关，具有某种专用性，限制了应用范围，这对VLSI不利；通用脉动阵列结构n关键因素：受阵列结构的通用性及I/O带宽约束所限制的阵列结构的规模大小n不同算法要求有不同的阵列结构，以及大小不同的阵列发展通用阵列结构的途径（1）n通过增设附加的硬件，对阵列的拓扑结构和互连方式用可编程开关进行重构，即经程序重新配置阵列的结构n美国Purdue大学，Chip (Configurable Highly Parallel Computer)可重构高度并行计算机发展通用阵列结构的途径（2）n用软件把不同的算法映像但固定的阵列结构上n美国卡内基-梅隆大学的WARP机发展通用阵列结构的途径（3）n探寻与问题大小无关的脉动处理方法，以及VLSI运算系统的分割矩阵算法，使他们可以克服阵列只能求解固定大小题目的缺陷，同时探寻发展适合一类计算问题的通用算法和相应的设置方案

展开阅读全文

计算机体系结构chapter6-4

最新文档