krylov子空间方法的gpu加速算法研究

资源描述

《krylov子空间方法的gpu加速算法研究》由会员分享，可在线阅读，更多相关《krylov子空间方法的gpu加速算法研究（58页珍藏版）》请在金锄头文库上搜索。

1、国防科学技术大学硕士学位论文 Krylov子空间方法的GPU加速算法研究姓名：秦晋申请学位级别：硕士专业：计算机科学与技术指导教师：胡庆丰 2010-11 国防科学技术大学研究生院硕士学位论文第 i 页摘要近几年来，通用图形处理单元（GPGPU）在体系结构、编程模型等方面迅速发展，可编程性不断提高，同时 GPU 的单精度浮点峰值性能已经从几个 Gflops 增长到几个 Tflops。 GPU 开始越来越多的运用于通用计算，并且越来越多地应用到科学计算程序的加速研究当中。在科学计算领域中，迭代法求解大型稀疏线性系统在计算流体力学、数值天气预报、石油地震数据处理、材料模拟

2、与设计等实际应用中具有重要的意义，Krylov 子空间方法是求解大规模稀疏线性系统的有效算法，具有存储容量需求小和快速收敛等优点。然而在 CPU 上利用该算法求解稀疏线性系统时往往会耗费大量的时间，如何加速其求解具有重要的理论意义和实际应用价值。利用 GPU 加速 Krylov 子空间方法求解大型稀疏线性系统可以降低计算周期，提高计算效率，有利于加速多种实际问题的解决。本文通过对 GPU 体系结构和 CUDA 编程模型的分析，研究了 Krylov 子空间方法在 NVIDIA GPU 上移植的技术难点，包括稀疏矩阵中零元素对存储空间的浪费、稀疏矩阵向量乘在 GPU 上的实现、内积操作

3、在 GPU 上的实现、GPU 上实现算法中公式时产生的计算核心开销对性能的降低、NVIDIA GPU 不存在归约操作、在 NVIDIA GPU 上需要采用针对性的优化方法等问题。通过 CPU 和 GPU 协作的方式使用 CUDA 编程模型在 NVIDIA GPU 上对算法进行了加速，取得了如下的成果： 1，面向 CUDA 编程模型的算法优化。详细分析 NVIDIA GPU 的体系结构特点，首先 NVIDIA GPU 的一个重要特点就是具有复杂的存储层次结构，包括 registers、shared memory、local memory、global memory 以及两种只读存储器 te

4、xture memory 和 constant memory，各种存储器在容量、速度方面有较大的差别，优化算法需要充分利用 GPU 内的高速存储器，例如 shared memory、texture memory 甚至是 registers。CUDA 中的一个与硬件联系比较紧密的概念是 warp，在 NVIDIA GPU 实际运行程序时，warp 才是真正的执行单位，在我们的程序优化中，从优化访存入手，线程束能符合 warp 或者 half-warp 的对其访问要求。 2，提出稀疏对角矩阵的 CDIA 压缩存储格式。在 DIA 存储格式的基础上设计了一种新型的压缩存储格式 CDIA，在

5、该格式下能够获得较高的数据传输效率以及满足 CUDA 对数据的对齐访问。为满足 CUDA 对存储器的合并访问要求，CDIA 压缩存储格式中包含了对压缩矩阵做了相应的转置处理的操作，这大大提高了 GPU 的访存性能，从而在 Krylov 子空间方法的几个核心操作中起到了重要的作用。在稀疏对角矩阵向量乘中，使用 CDIA 格式比使用普通 DIA 格式性能提高了 50.3%。 3，稀疏对角矩阵向量乘在 GPU 上的加速实现。使用 CDIA 压缩存储格式对矩阵进行压缩存储，将计算任务进行细粒度的任务分配，在 NVIDIA C1060 平台上使用 CUDA 编程模型，数据试验结果表明矩阵向量乘的

6、性能有明显的提高，通过对访存、数据传输等的优国防科学技术大学研究生院硕士学位论文第 ii 页化，在测试数据中，最高获得了单精度 39.6Gflop/s 和双精度 19.6Gflop/s 的浮点计算性能，性能在 Nathan Bell 和 Michael Garland 的基础上分别提高了 7.6%和 17.4%。 4，内积运算在 GPU 上的实现及优化。内积运算的实现以归约操作的实现为基础，内积运算采用先乘法再归约的方式，借助 shared memory 在 GPU 上有良好的性能表现，单精度情况下，与 CPU 相比，达到了高达 36.2 倍的加速比。 5，提出了 Krylov 子

7、空间方法在 GPU 上的优化策略。对算法的优化主要从程序结构、主机与设备的通信以及 CUDA 编程模型下的存储层次结构等几方面入手，使用 NVIDIA GPU 上的高速存储部件，主要是 shared memory 来提高访存效率，同时使用 pinned memory 来提高通信带宽。这些措施大大提高了算法的整体性能。在 NVIDIA C1060 平台使用 CUDA 3.0 的测试结果中，双精度和单精度条件算法性能分别达到 11.3Gflop/s 和 16.7Gflop/s，相比 Intel 四核处理器 Q6600 使用三级优化编译，加速比分别达到 9.2 倍和 13.57 倍。测

8、试结果表明，在不同矩阵规模下算法的性能有较大的差异，这是由 GPU 的体系结构特点决定的，从测试数据可以看出，GPU 在加速 Krylov 子空间方法求解线性系统中具有较大的优势。关键词：GPU，CUDA，Krylov 子空间方法，稀疏对角矩阵向量乘，内积操作国防科学技术大学研究生院硕士学位论文第 iii 页 ABSTRACT In recent years, the architecture, programming model and related aspects of general purpose graphics processing unit (GPGPU) have

9、been developed rapidly, the programmability of GPU have been improving all the time, and the peak performance of GPU grows from several Gflops to several Tflops. The main application of GPU has shifted from image processing to gener- al-purpose computing, and more and more researchs are focusing on

10、accelerating scientific problems with GPU. In scientific problems, utilizing iterative method to solve the large sparse linear systems is the key in many fields including the fluid dynamics, numerical weather predic- tion, oil and seismic date processing, materials simulation and design. The Krylov

11、subspace me- thod is an efficient algorithm solving the large sparse system, which both has less storage re- quirement and stable convergence. However, solving these problems on CPU consumes lots of time which cannot meet the practical requirements. Accelerating iterative method that solving large s

12、parse linear system with GPU can not only reduce the time but also improve the compu- ting efficiency, which can be generalized to speed up a lot of practical problems. After analyzing the architecture and CUDA programming model of GPU, we discussed the difficulties in adapting Krylov subspace metho

13、d to NVIDIA GPU, including the wasting of the storage space introduced by sparse matrix; the implementation of sparse-matrix multiplication in GPU; the implementation of vector inner-product operation in GPU、the decreasing of the per- formance bringed by kernel overhead; the lack of reduction operat

14、ion on NVIDIA GPU; the need of special optimization methods on NVIDIA GPU and so on. Through the cooperation of CPU and GPU, we proposed some novel methods to realize the algorithm in NVIDIA GPU on CUDA programming model. The main contributions in this paper are as follows: We proposed some arithmet

15、ic optimization method faced on CUDA program model, deeply studied the features of NVIDIA GPU, at first, we noticed that the NVIDIA GPU has a complex storage hierarchy, including registers, shared memory, local memory, global memory and other two kinds of read-only memorytexture memory and constant

16、memory. These memorizers have great differences in speed and capacity, while optimize the algorithm we should make full use of those high speed memory, such as the shared memory, texture memory and even the registers. In CUDA program model, one of the concept that has keen relation of the hardware is warp, while run the program in NVIDIA GPU, the warp is the real unit of execute. In our optimization of Krylov subspace method, we try to fulfill the need of warp or ha

展开阅读全文