《【大学课件】计算机系统结构Computer Architecture》由会员分享,可在线阅读,更多相关《【大学课件】计算机系统结构Computer Architecture(62页珍藏版)》请在金锄头文库上搜索。
1、计算机系统结构Computer Architecturehttp:/ Main Contents课程主要内容课程主要内容Chapter 12 CPU Structure and Function CPU结构和功能Chapter 13 Reduced Instruction Set Computers 精简指令集计算机 Chapter 14 Instruction-Level Parallelism & Superscalar Processors 指令集并行性和超标量处理器Chapter 18 Parallel Processing 并行处理William Stallings Computer
2、 Organization and Architecture7th EditionChapter 13Reduced Instruction Set Computers http:/ Advances in Computers 计算机的主要改进Instruction Execution Characteristics 指令执行的特点Use of Large Register File 使用大的寄存器Compiler-Based Register Optimization 基于编译器的寄存器优化Reduced Instruction Set Architecture 精简指令集体系结构RISC
3、Pipelining RISC流水线RISC vs. CISC Controversy RISC和CISC 的比较5Major Advances in Computers(1)The family concept 系列概念IBM System/360 1964DEC PDP-8Separates architecture from implementation 将体系结构和它的实现分开Microporgrammed control unit 微程序控制器Idea by Wilkes 1951Produced by IBM S/360 1964Cache memory Cache存储器IBM S
4、/360 model 85 19696Major Advances in Computers(2)Solid State RAM 固态存储器(See memory notes)Microprocessors 微处理器Intel 4004 1971Pipelining 流水线Introduces parallelism into fetch execute cycleMultiple processors 多处理器7The Next Step - RISCRISC-Reduced Instruction Set Computer RISC精简指令集计算机Key featuresLarge num
5、ber of general purpose registers, or use of compiler technology to optimize register use 大量通用寄存器,使用编译技术优化寄存器的使用Limited and simple instruction set 一个有限简单的指令集Emphasis on optimising the instruction pipeline 强调指令流水的优化8Comparison of processors9Driving force for CISC(1)CISC-Complex Instruction Set Compute
6、r CISC复杂指令集计算机Why CISC?Software costs far exceed hardware costs 软件成本高于硬件成本Increasingly complex high level languages 越来越复杂的高级语言Semantic gap : Difference between operations provided in HLLs and those provided in computer architecture 语义间隙问题: HLLs提供的操作和计算机体系结构提供的操作不同10Driving force for CISC(2)to close
7、the gapLeads to: Large instruction sets 指令集非常大More addressing modes 更多的寻址方式Hardware implementations of HLL statementse.g. CASE (switch) on VAXHLL描述的硬件的实现11Intention of CISC 复杂指令集体系结构的目的复杂指令集体系结构的目的Ease compiler writing 使编译器的编写更容易Improve execution efficiency 提高执行效率Complex operations in microcode 因为复杂
8、操作能以微代码实现Support more complex HLLs 提供更复杂的HLL支持A totally different approach:Simpler architecture 简化体系结构12精简指令计算机RISC设计思路的提出A number of studies have been done to determine the characteristics and patterns of execution of machine instructions generated from HLL programs已经进行的很多研究,目的是确定高级语言程序生成的机器指令执行的特征
9、和模式The results of these studies inspired some researchers to look for a different approach 促使设计人员寻找一种与复杂指令系统截然不同的方法A totally different approach:Simpler architecture 简化体系结构13Execution Characteristics执行特征Developments of RISCs were based on the study of instruction execution characteristics RISC的开发是基于指
10、令的执行特性Operations performed 完成的操作determine functions to be performed and interaction with memory 决定了CPU所要完成的功能及其与存储器的相互作用Operands used (types and frequencies) 使用的操作数及其类型和频率determine memory organization and addressing modes 决定了存储器如何组织存储它们和访问它们用的寻址方式Execution sequencing 执行顺序determines the control and p
11、ipeline organization 决定了控制和流水线的组织结构14Execution Characteristics执行特征Dynamic measurements & Static measurements 动态测量和静态测量Dynamic measurements are measured during the execution of the program. 程序的执行中进行的动态测量。Static measurements merely perform these counts on the source text of a program.静态测量只是在源程序文本上进行统计
12、,这不能给出很有用的性能信息。15Relative Dynamic Frequency Dynamic Machine InstructionMemory Reference Occurrence(Weighted)(Weighted) Pascal CPascal CPascal CAssign453813131415Loop5342323326Call151231334445If29431121713GoTo-3-Other61312116Operations操作Assignment statements predominate赋值语句占统治地位Movement of data is of
13、 high importance 数据的移动重要性 Preponderance of Conditional statements (IF, LOOP)条件语句占优势地位Sequence control is important 指令集的顺序控制17Operations操作Procedure call-return is very time consuming 程序的调用花费大量时间Some HLL instruction lead to many machine code operations 一些HLL指令导致执行大量机器代码18OperandsMainly local scalar va
14、riables 主要是局部标量变量Optimisation should concentrate on accessing local variables 优化也应该集中在局部变量的访问 PascalCAverageInteger constant162320Scalar variable585355Array/structure26242519Procedure Calls过程调用Very time consuming 过程调用是编译后的HLL程序中最耗时的操作To implement efficiently, two aspects are significant:Depends on n
15、umber of parameters passed 依赖于传递的参数的数量Depends on level of nesting 依赖于嵌套深度Most programs do not do a lot of calls followed by lots of returns 程序一般不作大量的调用之后跟着大量的返回Most variables are local 大部分变量是局部的20Implications结论Making instruction set architecture close to HLL 使指令集与HLL相近 not most effective 不是最有效的 Best
16、 support is given by optimising most used and most time consuming features 通过优化最经常使用的和最花费时间的,是最好的方案。21ImplicationsGeneralizing from the work of a number of researchers, three elements emerge that, by and large, characterize RISC architectures.1.Large number of registers 大量寄存器Operand referencing opti
17、mization + locality of references memory references reduced 减少存储器访问2.Careful design of pipelines 精心设计流水线Conditional branch and procedure call 条件分支和过程调用3.Simplified (reduced) instruction set 精简指令集22Use of Large Register FileFrom the analysisLarge number of assignment statementsMost accesses to local
18、scalars 主要访问本地标量 Heavy reliance on register storage 依赖于寄存器存储 Minimizing memory access 最小化内存访问23ApproachesSoftware solution to maximize register usage软件方法Require compiler to allocate registers to those most used variables in a given time 依赖于编译器,把寄存器分配给那些一定时间内使用最多的变量Requires sophisticated program anal
19、ysis 需要复杂的程序分析Hardware solution硬件方法Have more registers 大量寄存器Thus more variables will be in registers 寄存器中存放大量变量24Registers for Local VariablesStore local scalar variables in registers 在寄存器中存放本地标量变量 Reduces memory access 减少存储器访问Some problemsEvery procedure (function) call changes locality 每一次过程调用都会改变
20、局部性On every call, local variables must be saved to memory 每次调用变量必须被存储到存储器Parameters must be passed 必须传递参数On return, results must be returned and variables from calling programs must be restored 必须返回结果并且恢复调用程序的变量25Register WindowsSolution: Register windows Organization of registers to realize the goa
21、l 为了实现解决前述问题的目标,而对寄存器采用的组织结构From the analysisOnly few parameters and local variables 少量参数和本地变量Limited range of depth of call 有限的调用深度Use multiple small sets of registers 使用多个小的寄存器组Calls switch to a different set of registers 过程调用时自动地切换来使用不同的寄存器组Returns switch back to a previously used set of register
22、s 返回时切换回以前使用的寄存器组26Overlapping Register WindowsThree areas within a register set 窗口分为3个域:Parameter registers 参数寄存器域Local registers 局部寄存器域Temporary registers 临时寄存器域27Register Windows cont.Temporary registers from one set overlap parameter registers from the next 临时寄存器用于当前过程与下一级过程(被当前过程调用的过程)交换参数和结果Te
23、mporary registers at one level are physically the same as the parameter registers at the next lower level.当前层的临时寄存器域和下一层的参数寄存器域物理上是同一个域。This allows parameter passing without moving data 这种重叠准许不用实际移动数据就能传递参数28Circular Buffer diagramThe actual organization of the register file is as a circular buffer
24、of overlapping windows.寄存器集的实际组织结构是一个由重叠窗口组成的环形缓冲器。29Operation of Circular BufferWhen a call is made, a current window pointer (CWP) is moved to show the currently active register window 当一个调用发生时,当前窗口指针移动到当前活动寄存器窗口If all windows are in use, an interrupt is generated and the oldest window (the one fu
25、rthest back in the call nesting) is saved to memory (only .in and .loc need to be saved) 当所有窗口都在使用就会产生中断,最老的窗口会保存到内存(只保存参数、局部窗口的数据)A saved window pointer indicates where the next saved windows should restore to 保存窗口指针表明下一个保存窗口应该恢复的地方30Operation of Circular Buffer (2)Studies show: 8 windows are enoug
26、h to handle up to of call/return without save/restore 8个窗口足够处理99%的调用和返回E.g., Berkeley RISC uses 8 windows of 16 registers each31Global Variables - 2 OptionsAllocated by the compiler to memory 由编译器为全局变量指定存储器位置Straightforward 直截了当Inefficient for frequently accessed variables对经常访问的全局变量效率低下Have a set of
27、 registers for global variables CPU中有一组全局寄存器e.g., registers 0 - 7: global 8 - 31: local to current windowIncreased hardware burden 硬件负担增加Compiler must decide which global variables should be designed to registers 编译器也必须裁定什么样的全局变量应指派到寄存器。32Registers v Cache Large Register File CacheAll local scalars
28、Recently used local scalars 所有局部变量 最近使用的局部标量Individual variables Blocks of memory 个别变量 存储器块Compiler assigned global variables Recently used global variables 编译器指派全局变量 最近使用的全局变量 Save/restore based on Save/restore based on caching procedure nesting algorithm 保存/恢复基于过程的嵌套深度 保存/恢复基于cache替换算法Register add
29、ressing Memory addressing 寄存器寻址 存储器寻址33Registers v Cache大寄存器保留了所有的局部标量变量空间利用率低(窗口大,参数少)寄存器与存储器之间的数据传送不太频繁Cache有选择地保留局部标量变量可以有效地利用空间(动态更新)同时也存在空间利用率低的问题(成块传送,包含无用数据)寄存器与存储器之间的数据传送可能较频繁(组关联映像)34Registers v Cache寄存器优于寄存器优于Cache表现在:表现在:为访问基于窗口寄存器集中的一个局部标量,使用一个窗口号和一个“虚拟的”寄存器号。这些通过一个相对简单的译码器来选择某一个具体的寄存器。为
30、访问cache存储器中的一个位置,必须生成全宽度的地址。这种操作的复杂性取决于寻址方式。在一个组关联的cache中,地址的一部分用于读取等同于组长度的几个字和标记(tag),地址的另一部分用于与标记进行比较,以选择所读的一个字。这一点应是很清楚的,尽管cache能与寄存器集一样地快,但cache的存取时间肯定要长。于是,从性能观点看,基于窗口的寄存器集对于局部标量是优选的。通过加入只由指令使用的cache,能进一步改善性能。35Referencing a Scalar - Window Based Register File“virtual” register numberwindow num
31、ber快36Referencing a Scalar - Cache慢37Compiler Based Register OptimizationAssume small number of registers (16-32) 假设只有少量寄存器可用优化寄存器的使用就是编译器的责任HLL programs have no explicit references to registers 用高级语言写的程序没有对寄存器的显式引用The objective of the compiler is to keep the operands for as many computations as pos
32、sible in registers rather than main memory, and to minimize load-and-store operations.编译器的目标就是,尽可能地在寄存器中而不是在主存中为多数计算保持操作数,并且减少与内存的装入和存储操作。38Compiler Based Register Optimization cont.1.Each quantity is assigned to a symbolic or virtual register准备驻留在寄存器中的每个程序量先被指派到一个符号的或虚拟的寄存器中2.Map (unlimited) symbol
33、ic registers to real registers 然后编译器再将这些末限定数目的符号寄存器映射到固定数目的实寄存器上3.Symbolic registers that do not overlap can share real registers 那些使用不重叠的符号寄存器能共享同一实寄存器4.If you run out of real registers, some variables use memory 若在程序具体运行期间,需要打交道的量多于实寄存器数目、则某些量要被指派到存储器位置上39OptimizationThe essence of the optimizatio
34、n task is to decide which quantities are to be assigned to registers 优化任务的本质: 是判定程序中什么样的量应指派到寄存器中The technique is known as graph coloring图着色技术图着色技术Used in RISC compiler 用在RISC编译器Borrowed from the discipline of topology 这是由拓扑学借用过来的技术40Graph ColoringGiven a graph of nodes and edges 对于一个由结点和边组成的给定图Assi
35、gn a color to each node 为每个结点指定颜色Adjacent nodes have different colors 使相邻节点不同色Use minimum number of colors要使用颜色的数目最少Nodes are symbolic registers结点是符号寄存器41Graph Coloring cont.Two registers that are live in the same program fragment are joined by an edge 若两个符号寄存器同时“存活”于同一程序段,则相应的两个结点用一条边连接起来以指出它们相关。Tr
36、y to color the graph with n colors, where n is the number of real registers 尝试用n种颜色给图上色,n为实寄存器的数目Nodes that can not be colored are placed in memory 这些不能上色的结点必须放入存储器中42Graph Coloring ApproachAssume a program with six symbolic registers to be compiled into three actual registersPart a: 符号寄存器使用的时间顺序 Pa
37、rt b: 寄存器干涉图43A Trade-OffA trade-off between large registers and register optimization 在使用大量的寄存器和寄存器优化之间有一个权衡考虑问题With even simple register optimization, there is little benefit to the use of more than 64 registers 若只有相当简单的寄存器优化,那么使用多于64个寄存器只带来很少的好处With reasonably sophisticated register optimization
38、techniques, there is only marginal performance improvement with more than 32 registers使用相当精致的寄存器优化技术,使用多于32个的寄存器仅有临界性能改善Studies show64 registers are enough with simple register optimization32 registers are enough with sophisticated register optimization44Reduced Instruction Set ArchitectureWhy CISC
39、(1)?Why CISC?Ease compiler writing 使编译器的编写更容易Improve execution efficiency 提高执行效率Compiler simplification?Disputed争论Complex machine instructions harder to exploit难以使用,编译器必须找到严格满足限制的情况Optimization more difficultE.g. Minimize code size, enhance pipelining 减小代码、提高流水都很难实现45Why CISC (2)?Smaller programs?Pr
40、ogram takes up less memory 程序占用内存少But memory is now cheap 但是内存非常便宜Fewer instructions to be fetched, reducing page faults. 取更少的指令,减少缺页May not occupy less bits in symbolic machine language 符号形式的机器语言,所占据的存储器位数却不见得小More instructions require longer op-codes CISC指令多,需要的操作码就长RISC tend to emphasize register
41、, and register references require fewer bitsRISC指令使用的寄存器访问要求较少位数 46Why CISC (1)?Code Size Relative to RISC I11 C ProgramsRISC I1.0VAX-11/7800.8M680000.9Z80021.2PDP-11/700.9CISC比RISC节省很少甚至没有节省VAX比PDP-11减少很少,但VAX指令复杂的多更复杂CISCCISC47Why CISC (3)?Faster programs?More complex control unit 更复杂的控制单元Micropro
42、gram control store larger 微程序控制存储更大thus simple instructions take longer to execute 增加了简单指令的执行时间It is far from clear that CISC is the appropriate solution CISC是否是较合适的解决方法还远不是那么清楚48RISC CharacteristicsOne instruction per cycle 每周期一条指令Register to register operations 寄存器到寄存器操作寄存器到寄存器操作Few, simple addres
43、sing modes 简单寻址方式Few, simple instruction formats 简单指令格式49One Instruction Per Machine CycleIn a machine cycle 在一个机器周期fetch two operands from registers 从寄存器取两个操作数Perform an ALU operation 完成一个ALU操作Store the result in a register 结果存寄存器There is little or not need for microcode 很少或没有需要微代码Machine instructi
44、ons can be hardwired 机器指令可以用硬布线的方式实现Such instructions should execute faster than comparable machine instructions on other machines.50Register-to-Register OperationsMost operations is register-to-register 大多数操作应是寄存器到寄存器的Only LOAD and STORE accessing memory 只有简单的LOAD和STORE操作访问存储器Simplify instruction s
45、et and control unit 简化指令集和控制器RISC include only 1 or 2 ADD instructionsVAX has 25 different ADD instructionsEncourages the optimization of register use 更适合于寄存器的优化使用Frequently accessed operands remain in high-speed storage频繁存取的操作数保留在高速存储装置51Simple Addressing ModesAlmost all RISC instructions use simpl
46、e register addressing 几乎全部RISC指令都使用寄存器寻址方式May include several additional modesDisplacement and PC-relative 可能包括几种其它寻址方式,如偏移和相对Simplify instruction set and control unit 简化指令集和控制器52Simple Instruction FormatsOnly one or a few formats are used 仅使用一种或少数几种格式Instruction length is fixed and aligned on word
47、boundaries 指令长度固定并且在字边界上对齐A single instruction does not cross page boundaries 单一指令不会跨越内存分页的边界Field locations, especially the opcode, are fixed 字段位置,特别是操作码字段位置是固定的Opcode decoding and register operand accessing can occur simultaneously 操作码的译码和寄存器操作数的访问能同时出现Simplify control unit 简化控制器53CISC v RISC Typi
48、cal of a RISC A single instruction size (typically 4 bytes) 单一指令长度(典型4个字节)A small number of data addressing modes (typically less than five) 较少的寻址方式(典型小于5种)No indirect addressing 无间接寻址No operations that combine load/store with arithmetic 装入存储操作不会与算术操作混在一起No more than one memory-addressed operand per
49、 instruction每条指令不会有多于个的存储器操作数Does not support arbitrary alignment of data for load/store operations对装入存储操作不支持数据的任意对齐 54CISC v RISCRISC designs may benefit from the inclusion of some CISC features RISC设计包括某些CISC特色会有好处CISC designs may benefit from the inclusion of some RISC featuresCISC设计包括某些RISC特色也会有
50、益PowerPC 不再是纯RISC机Pentium也结合了RISC的特征55RISC PipeliningRISC: Most instructions are register to register 大多数指令是寄存器到寄存器的Two phases of execution 指令周期分为两个步骤:I: Instruction fetch 取指令E: Execute ( ALU operation with register input and output ) 执行指令(带寄存器输入和输出,完成一个ALU操作)For load and store 装入和存储操作需要三个步骤:I: Inst
51、ruction fetch 取指令E: Execute ( Calculate memory address ) 执行指令(计算存储器地址)D: Memory( Register to memory or memory to register operation ) 存储器(寄存器到存储器或反向操作)56Effects of PipeliningOnly one memory access per phasePermitting two memory access per phase1 2 3 4 5 6 7 8 9 101112131 2 3 4 5 6 7 8 9 101 2 3 4 5
52、6 7 81 2 3 4 5 6 7 810957Effects of Pipelining1234567891011121358Effects of PipeliningOnly one memory access per phase1234567891059Effects of PipeliningPermitting two memory access per phase1234567860Effects of Pipelining1234567810961ControversyProblemsNo pair of RISC and CISC that are directly comp
53、arable没有可以直接比较的RISC和CISC 机器No definitive set of test programs 没有正式的测试程序Difficult to separate hardware effects from complier effects 难于将硬件效应与编译器编写技巧的效应分开Most comparisons done on “toy” rather than production machines 对它们的分析比较大都是在模型机上完成的而不是在商品机上Most commercial devices are a mixture 大多数商品机都具有RISC和CISC的混合特征62作业习题13.7: 为提高流水效率,RISC机器可将符号寄存器映射到实际寄存器并重排指令顺序,这就提出了一个有趣问题:这两个操作有没有先后次序。考虑如下程序段:LDSR1,ALDSR2,BADDSR3,SR1,SR2LDSR4,CLDSR5,DADDSR6,SR4,SR5(a)先做寄存器映射,后做指令重排序,使用了多少机器寄存器?流水性能有改进吗?(b)先做指令重排序,后做指令映射,使用了多少寄存器? 流水性能有改进吗?