并行程序设计10并行算法及性能分析

资源描述

《并行程序设计10并行算法及性能分析》由会员分享，可在线阅读，更多相关《并行程序设计10并行算法及性能分析（36页珍藏版）》请在金锄头文库上搜索。

1、Parallel Programming,Instructor: Zhang Weizhe (张伟哲) Computer Network and Information Security Technique Research Center , School of Computer Science and Technology, Harbin Institute of Technology,Communication Costs in Parallel Computers,Outline,Matrix-Vector Multiplication Matrix-Matrix Multiplicat

2、ion,Matix Algorithms: Introduction,Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition. Typical algorithms rely on input, output, or intermediate data decomposition. Most algorithms use one- and two-dimensional block, cyc

3、lic, and block-cyclic partitioning.,Matrix-Vector Multiplication,5,Matrix-Vector Multiplication,We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y. The serial algorithm requires n2 multiplications and additions.,Matrix-Vector Multiplication: Rowwise 1

4、-D Partitioning,The n x n matrix is partitioned among n processors, with each processor storing complete row of the matrix. The n x 1 vector x is distributed such that each process owns one of its elements.,Matrix-Vector Multiplication: Rowwise 1-D Partitioning,Multiplication of an n x n matrix with

5、 an n x 1 vector using rowwise block 1-D partitioning. For the one-row-per-process case, p = n.,Matrix-Vector Multiplication: Rowwise 1-D Partitioning,Since each process starts with only one element of x , an all-to-all broadcast is required to distribute all the elements to all the processes. Proce

6、ss Pi now computes .,Matrix-Vector Multiplication: Rowwise 1-D Partitioning,Consider now the case when p n and we use block 1D partitioning. Each process initially stores n=p complete rows of the matrix and a portion of the vector of size n=p. The all-to-all broadcast takes place among p processes a

7、nd involves messages of size n=p. This is followed by n=p local dot products. Thus, the parallel run time of this procedure is,Matrix-Vector Multiplication: 2-D Partitioning,The n x n matrix is partitioned among n2 processors such that each processor owns a single element. The n x 1 vector x is dist

8、ributed only in the last column of n processors.,Matrix-Vector Multiplication: 2-D Partitioning,Matrix-Vector Multiplication: 2-D Partitioning,We must first align the vector with the matrix appropriately. The first communication step for the 2-D partitioning aligns the vector x along the principal d

9、iagonal of the matrix. The second step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. Finally, the result vector is computed by performing an all-to-one reduction along the col

10、umns.,Matrix-Vector Multiplication: 2-D Partitioning,Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column all-to-one reduction in each ro

11、w.,Matrix-Vector Multiplication: 2-D Partitioning,When using fewer than n2 processors, each process owns an block of the matrix. The vector is distributed in portions of elements in the last process-column only. In this case, the message sizes for the alignment, broadcast, and reduction are all The

12、computation is a product of an submatrix with a vector of length .,Matrix-Vector Multiplication: 2-D Partitioning,The first alignment step takes time The broadcast and reductions take time Local matrix-vector products take time Total time is,Example: Code of MV,See Reference,17,Outline,Matrix-Matrix

13、 Multiplication Matrix-Matrix Multiplication,Matrix-Matrix Multiplication,Consider the problem of multiplying two n x n dense, square matrices A and B to yield the product matrix C =A x B. The serial complexity is O(n3). A useful concept in this case is called block operations. In this view, an n x

14、n matrix A can be regarded as a q x q array of blocks Ai,j (0 i, j q) such that each block is an (n/q) x (n/q) submatrix. In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.,Matrix-Matrix Multiplication,Consider two n x n matrices A and B partitioned into p blo

15、cks Ai,j and Bi,j (0 i, j ) of size each. Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 k . All-to-all broadcast blocks of A along rows and B along columns. Perform local submatrix multi

16、plication.,Matrix-Matrix Multiplication,The two broadcasts take time The computation requires multiplications of sized submatrices. The parallel run time is approximately Major drawback of the algorithm is that it is not memory optimal.,Matrix-Matrix Multiplication: Cannons Algorithm,In this algorithm, we schedule the computations of the processes of the ith row such that, at any given time, each p

展开阅读全文