PowerPC和DSP对比－金锄头文库

资源描述

《PowerPC和DSP对比》由会员分享，可在线阅读，更多相关《PowerPC和DSP对比（10页珍藏版）》请在金锄头文库上搜索。

1、 .PowerPC和DSP对比一、主要性能参数对比TigerSHARC TigerSHARC PowerPCPowerPCParameterADSP-TS101SADSP-TS201SMPC7455PPC476FP(IBM 45nm SoI)Core Clock250 MHz500 MHz1,000 MHz1,600 MHzPeak Floating-pt Performance1,500 MFLOPS 3000 MFLOPS 8,000 MFLOPS 3,000 MFLOPSMemory Bus Size/Speed64-bit/100 MHz 64-bit/100 MHz 64-bit/

2、133 MHz 128-bit/800 MHz External Link Ports4250 MB/Sec 4250 MB/Sec None User DefineI/O Bandwidth (inc. memory)1,800 MB/Sec 1,800 MB/Sec 1,064 MB/sec 64,00 MB/sec Bandwidth-to-Processing Ratio1.20 Bytes/FLOP 1.20 Bytes/FLOP 0.13 Bytes/FLOP 2.1 Bytes/FLOP 1024-pt cFFT Benchmark39 sec 19 sec 13 sec (es

3、t.) 83.2sec（双精度）Approx Cycles for 1024-pt cFFT9,750 cycles 9,750 cycles 13,000 cycles Predicted 1024-pt cFFTs/chip25,641 per Sec 12,821 per Sec 64,941* per Sec ASDP tigersharp主要参数Part#Clock Speed (MHz)MMACS (Max)On Chip MemoryExternal Memory SupportedOperating Temp RangePackageUS Price 1000-4999ADSP

4、-TS201S600MHz480024MbitAsync, SDRAM-25 x 25 BGA$252.25ADSP-TS202S500MHz400012MbitAsync, SDRAM-25 x 25 BGA$209.51ADSP-TS203S500MHz40004MbitAsync, SDRAM-25 x 25 BGA$184.49ADSP-TS101S300MHz24006MbitAsync, SDRAM-40 to +8519 x 19 BGA, 27 x 27 BGA$193.88C6701C6201C6203MPC7410*PPC476Clock (MHz)167200300500

5、1600Instruction Cycle (ns)653.332Instructions Per Cycle1 - 81 - 81 - 81 - 314Million Instructions/Sec.133316002400500Million Fixed-Point Ops/Sec.1333160024008000Million Floating-Point Ops/Sec.100020003000General-Purpose Algorithm Benchmarks on TIs C66x DSP Core at 1.25 GHz1Benchmark Speed Clock Cycl

6、e 32-bit algorithm 1k point FFT (Radix 4) 5.47 s 6840 64k point FFT (Radix 4) 0.58 ms 696588 FIR filter (per real tap) 0.2 ns 0.25 8x88x8matrix multiply (complex floating point) 1.06 s 1327 16-bit algorithm 256 point complex FFT (Radix 4) 0.6 s 752 主要DSP的浮点性能对比：Speed Scores for floating-point packag

7、ed processors BDTImark2000(BDTI认证结果)(BDTI主要是针对DSP的benchmark，没有MPC7410和Powerpc的数据)一些算法，像FFT，可以充分利用7410的矢量数学运算。1024点，浮点复数FFT可以在27us内完成，相比之下，C6701需要108us。其他算法，像无线应用中的turbo解码器，VLIW结构处理的更有效率。很明显，具有AltiVec核的PowerPC G4(74xx)具有较高的核时钟速率与性能。P O W e r P C 的核时钟速率几乎是目前T i g e r s H A R C的33倍(不久更快版本的TigerSHARC将发布

8、)。AltiVec核每个周期执行单条指令，每128位向量包含4个独立的32位数据单元，这就是众所周知的sIM-D(单指令多数据)结构。当执行一次乘加(MAC)矢量运算时，达到峰值处理能力，每周期可完成8次浮点操作。对于1 GHz的MPC7455，峰值处理能力可达8000M 次s浮点运算。AltiVec每周期能执行8次整数或定点操作，峰值整数运算能力为8000MOPS(百万次操作s)。相反，TigerSHARC有两个独立的32位处理器核，或称MIMD(多指令多数据)结构。每个计算单元每周期能执行一次乘法以及和差分运算，对于300 MHz ADSPTSl0lS每周期完成6次浮点运算或1800MFL

9、OPS峰值运算能力。当执行16位整数运算时，TigerSHARC 可以利用它的超标量体系结构，分离两个独立3 2位计算单元成2个单独的16位S1MD单元。这样每个操作在两个数据单元，每个周期总共12次操作。另外，TigerSHARC有另外两个专门的1 6位整数引擎，每个周期可以增加超过1 2次的操作，这样每个周期共计2 4次整数运算，7200MOPS。1.二、 IBM 476FPE在FFT方面的性能评估FFT算法采用FFTW3.3.3的算法（http:/www.fftw.org），FFTW3.3.3算法是优化比较好的算法，性能得到肯定。测试程序采用benchFFT3.1（http:/ww

10、w.fftw.org）.对比的三个芯片是IBM PPC476FPE，PowerPC7447A，Intel 四核Pentium 3.06GHz。以512和1024 transform-size为参考。配置情况说明：1. PPC476FPE，ubuntu9.0.4，GCC-4.3.3，2. Apple iBook G4. 1.06 GHz PowerPC 7447A, linux 2.6.15, gcc-4.0.2, g+-4.0.2, g77-4.0.2. Has Altivec (4-way single precision SIMD).Compilers and flags (unless

11、overridden):C: gcc -O3 -fomit-frame-pointer -fstrict-aliasing -mcpu=7450C+: g+ -O3 -fomit-frame-pointer -fstrict-aliasing -mcpu=7450Fortran: gfortran -O3 -fomit-frame-pointer -fstrict-aliasing -mcpu=74503. Four-processor 3.06 GHz Intel Pentium 4, 512 KB L2. Linux 2.4.25, gcc-3.3.3, g+-3.3.3, g77-3.3

12、.3, AMD Core Math Library (ACML) 3.0.0, Intel Math Kernel Library Version 8.0.1, Intel Integrated Performance Primitives v5.0. Has SSE (4-way single precision SIMD), SSE2 (2-way double precision SIMD). The benchmark uses one processor only.Mflops计算方法To report FFT performance, we plot the mflops of e

13、ach FFT, which is a scaled version of the speed, defined by:mflops = 5 N log2(N) / (time for one FFT in microseconds) for complex transforms, andmflops = 2.5 N log2(N) / (time for one FFT in microseconds) for real transforms,where N is number of data points (the product of the FFT dimensions). This

14、is not an actual flop count; it is simply a convenient scaling, based on the fact that the radix-2 Cooley-Tukey algorithm asymptotically requires 5 N log2(N) floating-point operations. It allows us to compare the performance for many different sizes on the same graph, get a sense of the cache effect

15、s, and provide a rough measure of efficiency relative to the clock speed.变换类型的说明transform-typeis a four-character string consisting of precision (double/single =d/s), type (complex/real =c/r), in-place/out-of-place (=i/o), and forward/backward (=f/b). For example,transform-type=dcifdenotes a double-precision in-

展开阅读全文