高性能矩阵乘法－金锄头文库

资源描述

《高性能矩阵乘法》由会员分享，可在线阅读，更多相关《高性能矩阵乘法（27页珍藏版）》请在金锄头文库上搜索。

1、*1高性能高性能矩阵乘法矩阵乘法夏霏夏霏 *2并行算法优化研究相对于传统面向对象串行算法的4个挑战：同步：两个或者多个线程协调其行为的过程通信：与线程之间交换数据相关的带宽和延迟问题负载均衡：多个线程之间工作量分布的情况，给各个线程（执行核）分配均匀的工作可扩展性：衡量在性能更加强劲的系统上运行软件时能否有效利用更多线程的指标，观察应用程序在更高级的平台上运行4核到8核线性增长*3多线程（核）设计主要分解模式任务分解：对程序根据其执行的功能进行分解的过程数据分解：将应用程序根据各任务所处理的数据而非按任务的天然特性来进行分解数据流分解：研究数据在诸任务之间如何流动，根据任

2、务之间的数据流关系对问题进行分解模式分解方式任务级并行模式任务分解Divide and Conquer任务/数据分解几何分解模式数据分解流水线模式数据流分解波峰（wavefront）模式数据流分解*4多线程（核）设计主要分解模式任务分解：对程序根据其执行的功能进行分解的过程数据分解：将应用程序根据各任务所处理的数据而非按任务的天然特性来进行分解数据流分解：研究数据在诸任务之间如何流动，根据任务之间的数据流关系对问题进行分解分解方式设计说明任务分解不同的程序行为采用不同的线程实现常用于GUI应用程序数据分解多个线程对不同的数据块执行相同的操作常用于音频、图像处理和科学计算应用程序

3、数据流分解一个线程的输出作为另一个线程的输入尤其应注意尽量消除启动和排空延迟*5矩阵乘法算法探讨矩阵乘法算法探讨在工程科学计算中，矩阵乘积是最基本的运算典型的n阶稠密方阵乘积算法的时间复杂度是O(n3) 。目前对大型矩阵乘积运算的处理主要是采用分治思想，将矩阵分布在多个节点上，但每个结点上的小矩阵仍要立方级乘法次数。基于分之思想的两种划分策略：条形划分和块状（棋盘）划分的6种常见分布式矩阵乘法并行算法。*6基于不同划分策略的矩阵乘法算法探讨基于不同划分策略的矩阵乘法算法探讨1 1、条形（、条形（striped partitioningstriped partitionin

4、g）划分的矩阵乘法并行算法）划分的矩阵乘法并行算法行条划分列条划分两两组合：行列、行行、列列、列行 *7基于不同划分策略的矩阵乘法算法探讨基于不同划分策略的矩阵乘法算法探讨2 2、块状划分、块状划分(checkerboard partitioning)(checkerboard partitioning)的矩阵乘法并行算法的矩阵乘法并行算法称为棋盘划分CannonCannon Description for implementation of MPI program to compute Matrix Matrix Multiplication using block checkerboa

5、rd partitioning and Cannon Algorithm *8CannonCannon ObjectiveObjective Computing the matrix-matrix multiplication on SMP System. Use block checkerboard partitioning of the matrices and Cannons Algorithm. AssumptionAssumption Size of the square matrices p= q2 and the size of square matrices A and B i

6、s evenly divisible by q. It is assumed that the number of blocks are equal to the number of processors.*9CannonCannon Cannons algorithm is based on cartesian virtual topologyA and B are square matrices of size n and C be the output matrix.These matrices are dived into blocks or submatrices to perfor

7、m matrix- matrix operations in paralleln x n matrix A can be regarded as q x q array of blocks Ai, j (0=i q, 0= j q) such that each block is an (n/q) x (n/q) submatrixWe use p processors to implement the block version of matrix multiplication in parallel by choosing q as a square root of p and compu

8、te a distinct block Ci, j on each processor.*10传统并行传统并行 *11传统并行传统并行 The matrices A and B are partitioned into p blocks, A i, j and B i, j (0=i q, 0= j q) of size (n/q x n/q) on each process. These blocks are mapped onto a q x q logical mesh of processes. The processes are labeled from P0,0 to Pq-1,q

9、-1.*12传统并行传统并行 Process Pi, j initially store block matrices Ai, j and Bi, j and computes block Ci, j of result matrix. To compute submatrix Ci, j , we need all submatrices , Ai, k and Bk, j ( 0 = k q ) . To acquire all the required blocks, an all-to-all broadcast of matrix Ai, j s is performed in ea

10、ch row and similarly in each column of matrix Bi, js. MPI collective communication is used to perform this operations. *13传统并行传统并行 After Pi, j acquires, A i,0 , A i,1 , A i,2 , A i, q-1 and B0, j , B1, j , B2, j , Bq-1, j , it performs the serial block matrix to matrix multiplication and accumulates

11、 the partial block matrix Ci, j of matrix C . To obtain the resultant product matrix C, processes with rank 0 gathers all the block matrices by using MPI_Gather collective communication operation.*14CannonCannon p processors arranged in q x q square grid of processors and the input matrices.A and B

12、are distributed among the processes in checkerboard fashion. It results in constructing p block matrices of A and B. It uses only point-to-point communication for circularly shifting blocks of matrix A and matrix B among p processes. *15Cannon-initalCannon-inital The algorithm proceeds in q stages.

13、The first step in this algorithm is to perform initial alignment of the block matrix A and block matrix B. The blocks of matrix A are circularly shifted to the i positions to left in the row of the square grid of processes, where i is the row number of the process in the mesh. The blocks of matrix B

14、 are circularly shifted j positions upwards, where j is the column number of the process in the processes mesh.*16Cannon-initalCannon-inital*17Cannon-runningCannon-running The algorithm performs the following steps in each stage : 1. Multiply the block of matrix A and matrix B and add the resultant

15、matrix to get the block matrix C, which is initially set to zero. 2. Circularly shift the blocks of matrix A to left in the rows of the processes and the blocks of matrix B upwards in the columns of the square grid of processes in a wrap around manner. *18Cannon-runningCannon-running*19Cannon-runningCannon-running*20书中书中Cannon-bugCannon-bug*21MPI_Send and MPI_Recv is not used for point-to-point communicationbecause if all the processes call MPI_Send or MPI_Recv in different order the deadlocked deadlocked situation may arise . How to fix?指派一个缓冲区，使用MPI_Ire

展开阅读全文

高性能矩阵乘法

最新文档