专业英语大作业(西电)

资源描述

《专业英语大作业(西电)》由会员分享，可在线阅读，更多相关《专业英语大作业(西电)（8页珍藏版）》请在金锄头文库上搜索。

1、DesignDesign andand FPGA-basedFPGA-based ImplementationImplementation ofof a a HighHigh PerformancePerformance32-bit DSP Processor32-bit DSP ProcessorTasnim FerdousDepartment of Electrical and Electronic EngineeringAmerican International University-Bangladesh (AIUB)AbstractAbstract To meet the faste

2、r processing demand in consume electronics,performance efficient DSP processor design is important. This paper presents a noveldesign and FPGA-based implementation of a 32 bit DSP processor to achieve highperformance gain for reduced instruction set DSP processors. The proposed designincludes a haza

3、rd-optimized pipelined architecture and a dedicated single cycleinteger MAC to enhance the processing speed. Performance of the designed processoris evaluated against existing similar reduced instruction set DSP processor (MUNDSP-2000). Synthesis results and performance analysis of each system build

4、ingcomponent confirmed a significant performance improvement in the proposed DSPprocessor over the compared one.KeywordsKeywordsDSP processor; FPGA; Pipelined; Single cycle MAC; Hazard Handling.I. INTRODUCTIONI. INTRODUCTION1With the advent of personal computer, smart phones, gaming and othermultime

5、dia devices, the demand for DSP processor is ever increasing. Theinformation world is migrating from analog to DSP based systems to support the highspeed processing. In the past, successful research effort has been made to integratecomplex signal processing modules with the conventional processors t

6、o optimize thespeed. This paper demonstrates a novel design and FPGA based implementation of a32 bit pipelined Digital Signal Processor with reduced instruction set. The design ismodeled with behavioral VHSIC Hardware Description Language (VHDL).Theprocessor is designed to support the basic DSP oper

7、ation like digital filtering (we have2The processor is integrated with a two stage pipelinedesigned FIR filtering). which optimizes the speed by reducing propagation delay. Reduced propagation delayis ensured by allocating every step of an operation into independent pieces ofhardware and running all

8、 operations in parallel. This two stage pipeline also providesa better Cycle per Instruction (CPI) of 1 because all the instructions need only 2cycles to complete an operation.The computation speed of a DSP processor can be enhanced by incorporatingGeneral Purpose Processors (GPP) architectures into

9、 DSPs by retaining the functionscritical to DSP 2-3. To enhance processing speed of the proposed processor, a subsetof the complete instruction set of a multi cycle RISC processor is included in thisdesign. In addition to this, by incorporating two stage of pipeline , a better throughputis achieved

10、for less number of instructions of this processor. Moreover, the hazardoptimization for the pipelined architecture ensures a better performance of theprocessor. A performance evaluation shows that by using two stage of pipeline, theproposed design has achieved 12.06 MB/s of throughput over 8 MB/s of

11、 an existingsimilar reduced instruction set DSP processor- MUN DSP-2000 which has a five stageof pipeline.A DSP processor is also characterized by fast multiply-accumulate andmultiple-access memory architecture. The memory of a DSP processor is guided tooptimize the overall speed of the processor. D

12、ata and instructions must flow into thenumeric and sequencing sections of the DSP on every instruction cycle. There can beno delays and everything about the design focuses on throughput. To ensure betterthroughput, Harvard architecture is used in which memory is typically uses twoseparate memory bus

13、es. By using Harvard architecture instead of Von Neumannarchitecture, it doubles the throughput of this processor because separation of data andinstructions gives this DSP processor the ability to fetch multiple items on each cycle .3FPGAs are well suited for reducing combinational path as well as e

14、mployingparallel operations which can provide a better solution for manipulating speed. Adesign implemented on XILINX Spartan-3E has the ability to provide highthroughput and avoid the lengthy development cycles, and the inherent inflexibility ofconventional ASICs. In addition to this, digital filte

15、r implementation on FPGAs allowhigher sampling rates than available from traditional DSP chips and lower cost. FPGAprogrammability permits design upgrades in the field with no hardware replacementnecessary, an impossibility with ASICs. This can help the designer to perform thebasic processes faster.

16、 These advantages are the key reasons for choosing FPGA toimplement this design work.For digital filter applications, an efficient MAC operation requires one singlesystem clock cycle to compute a successful filter output. The proposed design ismodeled and synthesized for a dedicated MAC unit so that

17、 FIR filtering computationscan be done in one cycle.The Performance analysis of the proposed design confirmed a speed gain of 13MHz over 10 MHz of existing MUN DSP-2000. The details and extended version ofthe proposed design and implementation is published in the Proceedings ofInternational Journal

18、of Scientific and Engineering Research.The rest of the paper is organized as follows: Next few sections (II through VII)describe the design of the proposed DSP processor. Section VIII presents a4Finally, this paper concludes in section IXperformance analysis and discussion. with summarization of the

19、 design process and outlines to future works.II. DEDICATED MACII. DEDICATED MACThe proposed design contains a dedicated processing unit for multiply andaccumulation. This design has implemented a 8 tap MAC. This MAC is dealing withsample values which are signed values. Signed numbers are converted t

20、o unsignednumbers before it goes to the shift register file. The corresponding MAC is modifiedto operate in one single cycle. The simulation result of the single cycle operation isshown in Fig. 1.5TheThis is a dedicated processing unit for manipulating FIR filter operation.MAC datapath is kept apart

21、 from the general purpose datapath where ALU is freefrom filtering task to deal with other instructions. For this two stage pipelined DSP6Theprocessor, ALU needs 10 cycles to complete a two bit multiplication. comparison between ALU and single cycle MAC is given below where the totalrequired cycles

22、are shown in Table I to complete 2*2 multiplications .7Since computations are done by a large number of hardware components inonly one cycle, the duration of the clock cycle must be longer (twice) than thesummation of all propagation delays of individual hardware components in a singlecycle implemen

23、tation of MAC. To reduce the propagation delay, eight externalmultipliers are working parallelly which reduces clock period. This enhances speed ofthe processor which improves overall system performance .Based on the timinganalysis, the worst delay of one multiplier is 23.764 ns. Using one multiplie

24、r foroverall calculation, the delay would be 23.764*8 = 190.112 ns. By using eighttraditional multipliers, the delay is now only 23.764 ns as all the multiplications areperformed parallelly to compute the filter output. This improves the overall speed.Then all the multiplications results were added

25、by a dedicated adder. Again flexibilityis kept in this design where the MAC could either manipulate with four sample valuesor eight sample values. Four extra MUX were used to pass zero to the multiplier whennumber of taps is four. This helped to eliminate invalid data from the filter value. Toconduc

26、t parallel operations by the multipliers, it is required to receive all the samplevalues simultaneously. This design performs filtering operation only after all the shiftregisters are filled with sample values.III.TWO STAGE PIPELININGIII.TWO STAGE PIPELININGTwo stage pipelined architecture is one im

27、portant feature of the design. In thisdesign, both data paths include a single pipeline with two stages: 1) Execution Stage1,and 2) Execution Stage2. The primary reason for separating the stages is to keep thesystem clock operates faster with less combinational delay. To enhance the speed, the8For t

28、hegeneral purpose data path is modified by adding some dedicated registers. load or store type operation, a special register is used which helped to complete anyload or store type operation only in two clock cycles. These registers also help toexecute all instructions of this processor within two cl

29、ock cycle manipulating anyhazard.IIII. HAZARD HANDLINGThis design is modeled with a hazard free pipelined architecture which could9handle two instructions simultaneously as it is a two stage pipelined processor. Since two instructions work in parallel, the FSM is designed such a way that helps todet

30、ect hazards and can control the data path to complete computation by resolving thehazard. The FSM of this processor can manipulate both data and structural Hazards.For the structural hazard, a MUX is used instead of ALU, at the second stage ofpipelining. The reason of using MUX for this modeling ist

31、o save ALU calculationfrom unwanted data because a new instruction is using ALU at its first stage ofcalculation.The improvement for hazard freeFSM can be identified by the comparison ofthe pipelined architecture with and without hazard handling capability. An example isillustrated here for the comp

32、arison. For easier identification, some values (R1= 2, R2= 5, R3= 8, R4=9, R5= 4) are assumed for the instructions (R3 - R1+ R2, R5- R3+R4). When FSM is not handing hazard, R3 - 7 and R5- 17 (adding previous valueof R3 (8) and R4 (9) since R3 needs one more cycle to update with the most recentvalue

33、of addition (7) of R1 and R2 due to two stage of pipelining. This hazard isavoided by the FSM of proposed processor which updated R5 with the result ofaddition (16) of most recent value of R3 (7) and R4 (9). Table III is used here toillustrate the comparison.Performance gain of the proposed DSP proc

34、essor is demonstrated by comparingthe proposed processor (two stage pipelined) with the five stage pipelined10This existing design can implement 16architecture- Mun DSP-2000 4). instructions which require 20 cycles to complete execution whereas the proposeddesign has implemented 27 instructions whic

35、h requires 28 cycles to completeexecution. The throughput of the proposed DSP processor is 27/(28* 80ns)=12.06MB/swhereasthecomparedsimilarprocessorsthroughputis16/(20*100ns)=8MB/s.Thelargethroughputdemonstratestheimprovedperformance of the proposed DSP processor.11Based on the timing specification

36、from the synthesis result of each systembuilding component as shown in Table IV , the execution time for all the instructionsdepending on the different routes is found. The performance of each item is evaluatedby comparing the maximum delay with the MUN DSP-2000 processor.From this table it is clear

37、ly visible that the worst case delay of individualcomponents of proposed design is less than the compared one. Based on the timinganalysis, the multiplier has a less delay than the previous design. The use of separatepipeline stages for multipliers and the ALU in the proposed design improves theperf

38、ormance over the compared processor design as the proposed design does notneed any internal pipeline to calculate FIR filter output. The use of eight parallelmultipliers reduces delay. Usually in a pipeline, clock cycle is decided by the slowest12By considering worst case delays, the processor speed

39、 is foundstage running time. around 13 MHz whereas the compared processor design contains around 10MHzclock speed.V V. CONCLUSIONS AND FUTURE WORK. CONCLUSIONS AND FUTURE WORK13In this paper, a pipelined DSP processor with reduced instructions set isillustrated for performance optimization. Primary

40、focus of the design is to achievebetter throughput and higher speed gain over the compared one (MUN DSP-2000).The design is defined in VHDL and simulated using Modelsim 6.5. Each systembuilding component is synthesized using the Xilinx 8.2i and then implemented on14Each of the operationsXilinx Spart

41、an-3E FPGA which proved towork properly. has been verified with both functional and post fit simulations which havedemonstrated that the FSM of this DSP processor can successfully manipulate twoinstructions at a time even if hazards are present and produce correct cycle by cycletiming. The improved

42、performance of this processor is analyzed by comparingthroughput with MUN DSP-2000(another reduced instruction set processor). Thecomparison shows that a better throughput (12.06 MB/s) can be achieved with thenew design. In addition to this, the maximum delay of the proposed design is alsocompared w

43、ith existing system, and it is found that the new design consumes lessdelay in each system building components.15In the future, the design can be extended to perform more operations.Moreover, the calculation speed can be enhanced by providing support for floatingpoint operations.语法点分析语法点分析：1With the

44、 advent of personal computer, smart phones, gaming and other multimediadevices, the demand for DSP processor is ever increasing.析：析：with 作状语，处于句首表示条件、时间和原因， “With the advent of.”译为“随着.的出现”，此处表条件。2The processor is integrated with a two stage pipeline which optimizes the speed byreducing propagation d

45、elay.析：析：w which 引导的定语从句，修饰先行词 pipeline；介词“by+动名词”一般表示“通过”之意。3FPGAs are well suited for reducing combinational path as well as employingparallel operations which can provide a better solution for manipulating speed.析：析：“be suited for”为固定搭配，一般表示“适合于”之意。“as well as”为连词，连接的两个动词的形式一样，此句中的 employing 与 re

46、ducing 都为介词 for 的宾语。 “which can provide a better solution for manipulating speed”为定语从句，修饰先行词 operations。4Finally, this paper concludes in section IX with summarization of the designprocess and outlines to future works.析：析：with 作状语在句尾的情况，表示附加说明、方式、条件等，此处表示方式。5The MAC datapath is kept apart from the g

47、eneral purpose datapath where ALU isfree from filtering task to deal with other instructions.析：析：“keep apart from”为固定搭配，表示“从中分离/分开”之意。 “the datapathwhere ”是由 where 引导的状语从句。6The comparison between ALU and single cycle MAC is given below where thetotal required cycles are shown in Table I to complete

48、2*2 multiplications.析：析：betweenand 连接的名词作后置定语用来修饰前面的 comparison，where 引导的状语从句，be shown 为被动语态。7Since computations are done by a large number of hardware components inonlyone cycle, the duration of the clock cycle must be longer (twice) than the summationof all propagation delays of individual hardwar

49、e components in a single cycleimplementation of MAC.析：析：since 引导原因状语从句，意思是“因为.,由于. ,鉴于.”语气比 because 弱,since 引导的从句往往放在主句之前;“must be longer than”为比较级结构。8For the load or store type operation, a special register is used which helped tocomplete any load or store type operation only in two clock cycles.析析

50、：不定式的复合结构，当不定式的逻辑主语并不是后面句子的主语时，要表达“为了使.(能).”这一含义时应该使用不定式复合结构作状语，而一般不使用“to make.”这一句型。9Since two instructions work in parallel, the FSM is designed such a way that helpsto detect hazards and can control the data path to complete computation by resolvingthe hazard.析：析：Since 放句首，引导原因状语从句。 “such+a/an+名词

51、+that+定语从句”，and为连词，连接两个谓语成分，前后并列。10This existing design can implement 16 instructions which require 20 cycles tocomplete execution析：析：existing 现在分词作前置定语，existing design 表示已经存在的设计；Which引导限制性定语从句，修饰主句中的 instructions, 在从句中做主语。11Based on the timing specification from the synthesis result of each system

52、buildingcomponent as shown in Table IV, the execution time for all the instructions dependingon the different routes is found.析：析： “based on the timing specification” 为形容词短语作状语； “depending on ”是分词短语作定语，分词作定语一般遵循“单分在前，分短在后”的原则，即单个分词作定语时一般处于被修饰词之前，而分词短语作定语时一定要放在被修饰词之后，此句中为分词短语，故要放在被修饰词 instruction 之后

53、。12By considering worst case delays, the processor speed is found around 13 MHzwhereas the compared processor design contains around 10MHz clock speed.析析: :“by+动名词（或表示动作的名词）”一般表示“通过 .”之意；whereas 连接词，（公文用语,常放于句首）表示鉴于，也意为反之、却、而，相当于while作为连接词表示对比。13In this paper, a pipelined DSP processor with reduced

54、instructions set is illustratedfor performance optimization.析析: :with 的复合结构作独立主格，表示伴随情况时，既可用分词的独立结构，也可用 with 的复合结构：with +名词（代词）+现在分词/过去分词/形容词/副词/不定式/介词短语。14Each of the operations has been verified with both functional and post fitsimulations which have demonstrated that the FSM of this DSP processor

55、 cansuccessfully manipulate two instructions at a time even if hazards are present andproduce correct cycle by cycle timing.析析: :“has been verified with”为被动语态，表示“被证实”之意； which 引导的定语从句，先行词为 simulation；在动词 demonstrated 之后是一个由 that 引导的宾语从句。15In the future, the design can be extended to perform more op

56、erations.析析: :“can be extended”典型的被动语态，此处为短语动词的被动语态中的“不及物动词+介词=及物动词”的被动态形式，还有一种“及物动词+名词+介词=及物动词”的被动形式。句子翻译：句子翻译：1.With the advent of personal computer, smart phones, gaming and other multimediadevices, the demand for DSP processor is ever increasing.译：译：随着个人电脑、智能手机、游戏机等多媒体设备的出现，对数字信号处理器的需求不断增加。2.By

57、using Harvard architecture instead of V on Neumann architecture, it doubles thethroughput of this processor because separation of data and instructions gives thisDSP processor the ability to fetch multiple items on each cycle.译：译：通过使用哈佛架构而不是代替冯诺依曼体系结构，使得该处理器的吞吐量加倍，这是因为（哈佛结构中）数据和指令的分离给这个数字信号处理器一种获取在

58、每个周期的多个项目的能力。3.For digital filter applications, an efficient MAC operation requires one singlesystem clock cycle to compute a successful filter output.译：译：对于数字滤波器的应用，一个高效的测量与控制操作需要一个单一的系统时钟周期来计算一个成功的滤波器输出。4.The comparison between ALU and single cycle MAC is given below where the totalrequired cycle

59、s are shown in Table I to complete 2*2 multiplications.译：译：ALU 和单周期 MAC 的比较如下，完成22 乘法所需的总周期数如表 1 所示。5.Based on the timing specification from the synthesis result of each system buildingcomponent as shown in Table IV, the execution time for all the instructions dependingon the different routes is fou

60、nd.译：译：如表所示，根据从每个系统的建筑部件的合成结果的定时规范，所有取决于不同路径的指令的执行时间都可以被找到。6.Each of the operations has been verified with both functional and post fitsimulations which have demonstrated that the FSM of this DSP processor cansuccessfully manipulate two instructions at a time even if hazards are present andproduce correct cycle by cycle timing.译：译：已经被功能验证和发布适合模拟证实的每个操作表明，该 DSP 处理器的 FSM可以成功地在一个周期内操纵两条指令，即使存在危险，也能通过时序产生正确的周期。

展开阅读全文

专业英语大作业(西电)

最新文档