并行程序设计07CUDAProgramming

上传人:w****i 文档编号:92843049 上传时间:2019-07-13 格式:PPT 页数:87 大小:3.87MB
返回 下载 相关 举报
并行程序设计07CUDAProgramming_第1页
第1页 / 共87页
并行程序设计07CUDAProgramming_第2页
第2页 / 共87页
并行程序设计07CUDAProgramming_第3页
第3页 / 共87页
并行程序设计07CUDAProgramming_第4页
第4页 / 共87页
并行程序设计07CUDAProgramming_第5页
第5页 / 共87页
点击查看更多>>
资源描述

《并行程序设计07CUDAProgramming》由会员分享,可在线阅读,更多相关《并行程序设计07CUDAProgramming(87页珍藏版)》请在金锄头文库上搜索。

1、CUDA Programming (GPU Programming),Instructor: Zhang Weizhe (张伟哲) Computer Network and Information Security Technique Research Center , School of Computer Science and Technology, Harbin Institute of Technology,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model C

2、UDA Programming Basics,Outline,3,A Simple Example,Three Integer Arrays AN BN CN We want to calculate Ci = Ai + Bi,4,Traditionally, on the cpu(serial),for(i = 0; i N; +i) Ci = Ai + Bi T(N)=O(N),5,Traditionally, on the cpu(parallel),create N threads Cthreadid.i=Athreadid.i+Bthreadid.i T(N)=O(1),6,GPU

3、Computing,But there is a problem. Applications like,Need thousands of threads to execute,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model CUDA Programming Basics,Outline,8,A simple comparison between CPU and GPU,9,A detailed description,Graphics Processing Clu

4、sters (GPCs),Texture Processing Clusters (TPCs),Streaming Multiprocessors (SM),10,pascal-architecture-whitepaper,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model CUDA Programming Basics,Outline,12,Three methods,CUDA Optimized Libraries,Programming Languages,13

5、,Three methods,CUDA Optimized Libraries,These libraries are written in CUDA Simply replace your standard library functions with corresponding CUDA libraries It supports many math libraries but not all a supported list can be found at,https:/ methods,It is a directive-based programming model You need

6、 to isnert some directives in your code Use openacc compiler to compile the code,15,Three methods,Programming Languages,Motivation GPU Architecture Three ways to accelerate applications CUDA Programming Model CUDA Programming Basics,Outline,CUDA Programming Model Prerequest for CUDA Programming CUDA

7、 Execution Flow CUDA Threads CUDA Memory Model,Outline,18,Prerequest for CUDA Programming,Hardware A Nvidia Graphics card: It can be a specialized computing card, like Tesla Pascal GP100(too expensive), or a normal game graphic card, like GT or GTX. Check wheter your GPU supports CUDA: you can check

8、 out this website http:/ Click on,19,Prerequest for CUDA Programming,Software CUDA Toolkit: Its supported on Windows, Mac, and most standard Linux distributions. Download from https:/ Visual Studio(if on windows): If you work on Windows, for I know, VS is the only IDE that can work with CUDA. If you

9、 dont want to install VS, you can use the CUDA compiler NVCC directly from a command line.,CUDA Programming Model Prerequest for CUDA Programming CUDA Execution Flow CUDA Threads CUDA Memory Model,Outline,21,CUDA Execution Flow,CUDA Application,Host = CPU,Device = GPU,Host = CPU,Device = GPU,Paralle

10、l code,Serial code,Serial code,Parallel code,22,CUDA Execution Flow,1. Copy data from CPU memory to GPU memory,23,CUDA Execution Flow,Instruct the GPU to start computing,24,CUDA Execution Flow,Copy the results back to CPU memory,CUDA Programming Model Prerequest for CUDA Programming CUDA Execution F

11、low CUDA Threads CUDA Memory Model,Outline,26,CUDA Threads,Parallel portion of an applicaton,float x = inthreadIdx.x; float y = func(x); outthreadIdx.x = y;,ini,ini+1,ini+2,ini+3,outi,outi+1,outi+2,outi+3,A kernel is a function executed on the GPU as an array of threads in parallel and can be called

12、 from CPU All threads execute the same code, can take different paths Each thread has an ID,27,CUDA Threads,.,Block,28,CUDA Threads,.,32 Threads are grouped into warps A warp in CUDA is the minimum size of the data processed in SIMD fashion by a CUDA multiprocessor. Thread IDs within a warp are cons

13、ecutive and increasing Warp is unit of thread scheduling in SMs,29,CUDA Threads,.,Warp,One or More warps are grouped into blocks A thread block is batch of threads that can cooperate with each other by sharing data through shared memory and synchronizing their execution. A block can at most contain

14、1024 threads becasuse of the hardware source limit The thread id is unique and starts from zero in a block,Warp,30,CUDA Threads,Block,Block,Block,Block,Block,Block,Grid,A kernel will be executed as a grid,31,CUDA Threads,Kernel Grid,Block 0,Block 1,Block 2,Block 3,Block 4,Block 5,Block 6,Block 7,Dev

15、ice with 2 SMs,SM 0,SM 1,Block 0,Block 1,Block 2,Block 3,Block 4,Block 5,Block 6,Block 7,CUDA Threads,32,Kernel Grid,Block 0,Block 1,Block 2,Block 3,Block 4,Block 5,Block 6,Block 7,Device with 4 SMs,SM 0,SM 1,SM 2,SM 4,Block 0,Block 4,Block 1,Block 5,Block 2,Block 6,Block 3,Block 7,CUDA Threads,33,CUDA Threads,34,Block 0,CUDA Threads,35,All threads within a warp must execute the same instruction at any given time, but this will yields a problem: branch divergence Example with divergence: If (threadIdx.x 2) This creates two different control paths for t

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 高等教育 > 其它相关文档

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号