《并行程序设计07CUDAProgramming》由会员分享,可在线阅读,更多相关《并行程序设计07CUDAProgramming(87页珍藏版)》请在金锄头文库上搜索。
1、CUDA Programming (GPU Programming),Instructor: Zhang Weizhe (张伟哲) Computer Network and Information Security Technique Research Center , School of Computer Science and Technology, Harbin Institute of Technology,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model C
2、UDA Programming Basics,Outline,3,A Simple Example,Three Integer Arrays AN BN CN We want to calculate Ci = Ai + Bi,4,Traditionally, on the cpu(serial),for(i = 0; i N; +i) Ci = Ai + Bi T(N)=O(N),5,Traditionally, on the cpu(parallel),create N threads Cthreadid.i=Athreadid.i+Bthreadid.i T(N)=O(1),6,GPU
3、Computing,But there is a problem. Applications like,Need thousands of threads to execute,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model CUDA Programming Basics,Outline,8,A simple comparison between CPU and GPU,9,A detailed description,Graphics Processing Clu
4、sters (GPCs),Texture Processing Clusters (TPCs),Streaming Multiprocessors (SM),10,pascal-architecture-whitepaper,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model CUDA Programming Basics,Outline,12,Three methods,CUDA Optimized Libraries,Programming Languages,13
5、,Three methods,CUDA Optimized Libraries,These libraries are written in CUDA Simply replace your standard library functions with corresponding CUDA libraries It supports many math libraries but not all a supported list can be found at,https:/ methods,It is a directive-based programming model You need
6、 to isnert some directives in your code Use openacc compiler to compile the code,15,Three methods,Programming Languages,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model CUDA Programming Basics,Outline,CUDA Programming Model Prerequest for CUDA Programming CUDA
7、 Execution Flow CUDA Threads CUDA Memory Model,Outline,18,Prerequest for CUDA Programming,Hardware A Nvidia Graphics card: It can be a specialized computing card, like Tesla Pascal GP100(too expensive), or a normal game graphic card, like GT or GTX. Check wheter your GPU supports CUDA: you can check
8、 out this website http:/ Click on,19,Prerequest for CUDA Programming,Software CUDA Toolkit: Its supported on Windows, Mac, and most standard Linux distributions. Download from https:/ Visual Studio(if on windows): If you work on Windows, for I know, VS is the only IDE that can work with CUDA. If you
9、 dont want to install VS, you can use the CUDA compiler NVCC directly from a command line.,CUDA Programming Model Prerequest for CUDA Programming CUDA Execution Flow CUDA Threads CUDA Memory Model,Outline,21,CUDA Execution Flow,CUDA Application,Host = CPU,Device = GPU,Host = CPU,Device = GPU,Paralle
10、l code,Serial code,Serial code,Parallel code,22,CUDA Execution Flow,1. Copy data from CPU memory to GPU memory,23,CUDA Execution Flow,Instruct the GPU to start computing,24,CUDA Execution Flow,Copy the results back to CPU memory,CUDA Programming Model Prerequest for CUDA Programming CUDA Execution F
11、low CUDA Threads CUDA Memory Model,Outline,26,CUDA Threads,Parallel portion of an applicaton,float x = inthreadIdx.x; float y = func(x); outthreadIdx.x = y;,ini,ini+1,ini+2,ini+3,outi,outi+1,outi+2,outi+3,A kernel is a function executed on the GPU as an array of threads in parallel and can be called
12、 from CPU All threads execute the same code, can take different paths Each thread has an ID,27,CUDA Threads,.,Block,28,CUDA Threads,.,32 Threads are grouped into warps A warp in CUDA is the minimum size of the data processed in SIMD fashion by a CUDA multiprocessor. Thread IDs within a warp are cons
13、ecutive and increasing Warp is unit of thread scheduling in SMs,29,CUDA Threads,.,Warp,One or More warps are grouped into blocks A thread block is batch of threads that can cooperate with each other by sharing data through shared memory and synchronizing their execution. A block can at most contain
14、1024 threads becasuse of the hardware source limit The thread id is unique and starts from zero in a block,Warp,30,CUDA Threads,Block,Block,Block,Block,Block,Block,Grid,A kernel will be executed as a grid,31,CUDA Threads,Kernel Grid,Block 0,Block 1,Block 2,Block 3,Block 4,Block 5,Block 6,Block 7,Dev
15、ice with 2 SMs,SM 0,SM 1,Block 0,Block 1,Block 2,Block 3,Block 4,Block 5,Block 6,Block 7,CUDA Threads,32,Kernel Grid,Block 0,Block 1,Block 2,Block 3,Block 4,Block 5,Block 6,Block 7,Device with 4 SMs,SM 0,SM 1,SM 2,SM 4,Block 0,Block 4,Block 1,Block 5,Block 2,Block 6,Block 3,Block 7,CUDA Threads,33,CUDA Threads,34,Block 0,CUDA Threads,35,All threads within a warp must execute the same instruction at any given time, but this will yields a problem: branch divergence Example with divergence: If (threadIdx.x 2) This creates two different control paths for t