第05章标量处理机 eng1-修改版

资源描述

《第05章标量处理机 eng1-修改版》由会员分享，可在线阅读，更多相关《第05章标量处理机 eng1-修改版（143页珍藏版）》请在金锄头文库上搜索。

1、central south university*余腊生版权所有，违者必究51Scalar processor只有标量数据表示和标量指令系统的处理机称为标量处理机提高指令执行速度的主要途径： (1) 提高处理机的工作主频 (2) 采用更好的算法和设计更好的功能部件 (3) 采用指令级并行技术三种指令级并行处理机： (1) 流水线处理机和超流水线(Superpipelining)处理机 (2) 超标量(Superscalar)处理机 (3) 超长指令字(VLIW: Very Long Instruction Word)处理机central south university*余腊生版权所有，

2、违者必究52Pipelining:Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutesABCDcentral south university*余腊生版权所有，违者必究53 Sequential Laundry Sequential laundry takes 6 hours for 4 lo

3、ads If they learned pipelining, how long would laundry take? ABCD30 40 20 30 40 20 30 40 20 30 40 206 PM7891011MidnightT a s kO r d e rTimecentral south university*余腊生版权所有，违者必究54Pipelined Laundry Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads ABCD6 PM7891011MidnightT a s kO r d e rTi

4、me30 40404040 20central south university*余腊生版权所有，违者必究55Pipelining Lessons Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced l

5、engths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedupABCD6 PM789T a s kO r d e rTime30 40404040 20central south university*余腊生版权所有，违者必究56Computer Pipelines Execute billions of instructions, so throughput is what matters What is desirable in instruction

6、 sets for pipelining? Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place?central south university*余腊生版权所有，违者必究57Key D

7、efinitions Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time.A pipeline within a processor is similar to a car assembly line. Each assembly station is called a pipe stage or a pipe segment.The throughput of

8、 an instruction pipeline is the measure of how often an instruction exits the pipeline. central south university*余腊生版权所有，违者必究58Pipeline StagesWe can divide the execution of an instruction into the following 5 “classic” stages:IF: Instruction Fetch ID: Instruction Decode, register fetch EX: Executio

9、n MEM: Memory Access WB: Register write Backcentral south university*余腊生版权所有，违者必究59Pipeline Throughput and LatencyIFIDEXMEMWB5 ns4 ns5 ns10 ns4 nsConsider the pipeline above with the indicated delays. We want to know what is the pipeline throughput and the pipeline latency.Pipeline throughput: inst

10、ructions completed per second.Pipeline latency: how long does it take to execute asingle instruction in the pipeline. central south university*余腊生版权所有，违者必究510Pipeline Throughput and LatencyIFIDEXMEMWB5 ns4 ns5 ns10 ns4 ns Pipeline throughput: how often an instruction is completed.Pipeline latency:

11、how long does it take to execute aninstruction in the pipeline. Is this right?central south university*余腊生版权所有，违者必究511Pipeline Throughput and LatencyIFIDEXMEMWB5 ns4 ns5 ns10 ns4 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction IFMEMIDI1L(I

12、1) = 28nsEXWB MEMIDIFI2L(I2) = 33nsEXWB MEMIDIFI3L(I3) = 38nsEXWB MEMIDIFI4 L(I5) = 43nsEXWBWe are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one.central south university*余腊生版权所有，违者

13、必究512Speed Up Equation for PipeliningCPIpipelined = Ideal CPI + Pipeline stall clock cycles per instrSpeedup = Ideal CPI x Pipeline depth Clock CycleunpipelinedIdeal CPI + Pipeline stall CPI Clock Cyclepipelined For RISC CPI=1 Speedup = Pipeline depth Clock Cycleunpipelined1 + Pipeline stall CPI Clo

14、ck Cyclepipelinedxxcentral south university*余腊生版权所有，违者必究513 Example pipelined vs. unpipelined machine Using this architecture, we will compare the unpipelined version of the machine to a pipelined one Unpipelined machine has 1 ns clock cycle Pipelined machine has 1 ns clock cycle plus .2 ns overhea

15、d Assume benchmark with operation mix and CPI of: frequency: 40% ALU, 20% branches, 40% load/store CPI for unpipelined machine: ALU and branches: 4 cycles, load/store: 5 cycles Unpipelined computer: Average instruction execution time = clock cycle * average CPI = 1 ns * (40%*4 + 20%*4 + 40% * 5) = 4

16、.4 ns Pipelined computer: Each instruction takes 5 * (1.2 ns) = 6.0 ns, but since 5 are being executed at the same time, the average instruction execution time = 6.0 ns / 5 = 1.2 ns Speedup from pipelining: 4.4 ns/1.2 ns = 3.7 times Note: we would expect the speedup to be roughly equal to the number of stages here our speedup

展开阅读全文