编译器参数调优方法

上传人:xh****66 文档编号:62505879 上传时间:2018-12-21 格式:PPT 页数:31 大小:241KB
返回 下载 相关 举报
编译器参数调优方法_第1页
第1页 / 共31页
编译器参数调优方法_第2页
第2页 / 共31页
编译器参数调优方法_第3页
第3页 / 共31页
编译器参数调优方法_第4页
第4页 / 共31页
编译器参数调优方法_第5页
第5页 / 共31页
点击查看更多>>
资源描述

《编译器参数调优方法》由会员分享,可在线阅读,更多相关《编译器参数调优方法(31页珍藏版)》请在金锄头文库上搜索。

1、*All other brands and names are the property of their respective owners,Intel Confidential IA64_Tools_Overview2.ppt,1, Compilers For Xeon Processor,Agenda,General Xeon processor optimizations Loop level optimizations Multi-pass optimizations Other,Agenda,General Xeon processor optimizations Loop lev

2、el optimizations Multi-pass optimizations Other,General Optimizations,/Od, -O0: disable optimizations /Zi, -g: Create Symbols /O1, -O1: Optimizes for speed without increasing code size i.e. disables library function inlining /O2, -O2 default Optimize for speed /O3, -O3 High-level optimizations,Agend

3、a,General Xeon processor optimizations Loop level optimizations Multi-pass optimizations Other,Instruction Scheduling,Schedule instructions to be optimal for specific processor instruction latencies and cache sizes,Note: default may change in future compilers,Shift/Multiply Latency,Pentium Shift has

4、 1x latency of adds Multiply has 10x latency of adds Pentium Pro, II, and III Shift has 1x latency of adds Multiply has 3x latency of adds Pentium 4 (may change in future releases) Shift has 8x latency of adds Multiply has 26x latency of adds,Under the Covers: P4,Compiler accounts for these differen

5、ces for you!,for (int i=0;ilength;i+) pi = qi * 32; ,.B1.7: # -tpp6 movl (%ebx,%edx,4),%eax shll $5, %eax movl %eax, (%esi,%edx,4) incl %edx cmpl %ecx, %edx jl .B1.7,.B1.7: # -tpp7 movl (%ebx,%edx,4),%eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax addl %eax, %eax movl %eax, (%es

6、i,%edx,4) addl $1, %edx cmpl %ecx, %edx jl .B1.7,Under the Covers: Xeon,Which Processor: ax?,Automatic Processor Dispatch,Single executable Pentium 4 target that runs on all x86 processors. For Target Processor it uses: Processor Specific Opcodes Prefetch (Pentium III only) Vectorization Low Overhea

7、d Some increase in code size Can mix and match: -xK axW together makes Xeon/Pentium 4 the target and Pentium III the default,Agenda,General Xeon processor optimizations Loop level optimizations Multi-pass optimizations Other,Vectorization,Automatically converts loops to utilize MMX/SSE/SSE2 instruct

8、ions and registers. Data types: char/short/int/float/double (but not mixed) Can Use Short Vector Math Library Enabled through -QxW, -QxK, -QaxW, -QaxK -vec_report3 tells you which loops were vectorized, and if not, why not.,High Level Optimizer,Windows: /O3 or Linux: -O3 Use with xW, -xK, -QxW, -QxK

9、, etc. additional loop optimizations more aggressive dependency analysis scalar replacement software prefetch (-xK on Pentium III) Loops must meet criteria related to those for vectorization,Under the Covers: Xeon,SMP parallelism,OpenMP Easy multithreading using directives Use KSL tools for Developm

10、ent Use Intel tools to optimize for IA in tandem with OpenMP Auto-parallelization Simple loops threaded by compiler alone Loops must meet certain criteria,OpenMP* Support,OpenMP 1.1 for Fortran & 1.0 for C / C+ Debugger info support for OpenMP Assure for Threads supported with Intel Compiler OpenMP

11、switches: -Qopenmp, -openmp (or -openmpP) -QopenmpS, -openmpS (serial, for debugging) -openmp_reportn (diagnostics) works in conjunction with vectorization,Auto Parallelization,Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directive. -Qparallel (Windows

12、*), -parallel (Linux*) -Qpar_reportn, -par_reportn (diagnostics) Better to use OpenMP directives Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.,Agenda,General and processor optimization Loop level optimizations Multi-pass optimizations I

13、nter Procedural Optimization Profile Guided Optimization Other,Inter-Procedural Optimizations (IPO),-Qip, -ip: Enables interprocedural optimizations for single file compilation. -Qipo, -ipo: Enables interprocedural optimizations across files.,Inter-Procedural Optimizations (IPO),More benefits than j

14、ust inlining Partial inlining Interprocedural constant propagation Passing arguments in registers Loop-invariant code motion Dead code elimination Helps vectorization, memory disambiguation,Pass 1,Pass 2,virtual .obj and .il files,executable,Compiling: Windows*: icl -c /Qipo main.c func1.c func2.c L

15、inux*: icc -c -ipo main.c func1.c func2.c,Linking: Windows*: icl /Qipo main.obj func1.obj func2.obj Linux*: icc -ipo main.obj func1.obj func2.obj,IPO Usage: 2 Step Process,Windows* Hint: LINK=link.exe should be replaced with LINK=xilink.exe ie: xilink /Qipo main.obj func1.obj func2.obj,Use execution-time feedback to guide opt Helps I-cache, paging, branch-prediction Enabled Optimizations: Basic block ordering Better register allocation Better decision of functions to inline Function ordering Switch-statement optimization Better v

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 生活休闲 > 科普知识

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号