高性能计算论坛的好例子e

资源描述

《高性能计算论坛的好例子e》由会员分享，可在线阅读，更多相关《高性能计算论坛的好例子e（17页珍藏版）》请在金锄头文库上搜索。

1、11 月 23日Combining CUDA code with C+code As a C+fan,I write most of the programs in C+.However,since CUDA kernel can only be written in C(more precisely,the C subset supported in C+),not C+,so it is simply impossible to write C+code in.cu files.However,you can write C+code to invoke the functions def

2、ined in.cu files.Thats the way how we integrate CUDA code in C+programs.For example,you can write code like this:/=/main.cpp/this is the C+part of the program#include using namespace std;extern C /note that the functions defined in.cu files will use C linkage,thus this extern C void do_square(float*

3、a,int N);int main(void)const int N=1000;size_t size=N*sizeof(float);float*h_a=new floatN;for(int i=0;iN;i+)h_ai=i;do_square(h_a,N);for(int i=0;iN;i+)couth_ait;delete h_a;/=/do_square.cu/this is the C part of the program _global_ void square_array(float*a,int N)/this is the kernel functions 名师资料总结-精品

4、资料欢迎下载-名师精心整理-第 1 页，共 17 页 -int i=blockIdx.x*blockDim.x+threadIdx.x;ai*=ai;extern C void do_square(float*h_a,int N)/the do_square function can be invoked by C+code size_t size=N*sizeof(float);float*d_a;cudaMalloc(void*)&d_a,size);cudaMemcpy(d_a,h_a,size,cudaMemcpyHostToDevice);int block_size=40;int

5、n_blocks=N/block_size;square_array(d_a,N);cudaMemcpy(h_a,d_a,size,cudaMemcpyDeviceToHost);cudaFree(d_a);This program includes two files,the C+file main.cpp and the CUDA file do_square.cu,where the kernel function is defined.The main functions simply sets up the parameters and pass them to the C func

6、tion do_square,where the real job is done.Note that the do_square function will use C linkage because NVCC treats the.cu files as C code by default.Thus we should add extern Cto the declaration of do_square in main.cpp.This is the basic technique to invoke CUDA code in C+programs.Actually,you can al

7、so invoke CUDA functions in programs written in other languages in this way./*/#include#include /#include#include#define blocksize 16 名师资料总结-精品资料欢迎下载-名师精心整理-第 2 页，共 17 页 -_global_ void updata(float*data)int j=threadIdx.y+blocksize*blockIdx.y;int i=threadIdx.x+blocksize*blockIdx.x;dataj*100+i=dataj*1

8、00+i+123;_syncthreads();/*/*HelloCUDA*/*/int main(int argc,char*argv)/float*d_data,*h_data;cudaMalloc(void*)&d_data,sizeof(float)*100*40);h_data=(float*)malloc(sizeof(float)*100*40);cudaMemset(d_data,0,sizeof(float)*100*40);dim3 Grid(100+blocksize-1)/blocksize,(40+blocksize-1)/blocksize);dim3 Block(

9、blocksize,blocksize);updata(d_data);cudaMemcpy(h_data,d_data,sizeof(float)*100*40,cudaMemcpyDeviceToHost);FILE*fp1;fp1=fopen(Hz.txt,w+);名师资料总结-精品资料欢迎下载-名师精心整理-第 3 页，共 17 页 -for(int i=0;i40;i+)for(int j=0;j%fn,i,j,h_datai*100+j);cudaFree(d_data);free(h_data);这个小程序就是把数组元素加123，然后返回为什么 BLOCKSIZE 的大小在8 的

10、时候，结果正确，在 16 的时候，结果全部变成 -0.001327 是哪个地方出问题了呢显卡是 9400GT 很困惑 3.求助 cuda 程序有问题下面是我的程序，init cuda 部分是抄SDK 里的例子的。内核函数主要是对每个线程号二进制保存到一个数组里（比如线程号是8 时，二进制是00001000，存到数组名师资料总结-精品资料欢迎下载-名师精心整理-第 4 页，共 17 页 -P8=0,0,0,0,1,0,0,0 里），每个线程对数组 P 与数组 v ,G 对应项的积求和。最后的结果存到线程对应的数组AllTotali，和 AllProfiti 里。最后要把这两个数组拷回到内存里。

11、勇哥，帮我看看。#include#include using namespace std;#include#include#include#include#define NumPro 14#define BLOCK_SIZE 16#define GRID_SIZE 8#define AllCondition BLOCK_SIZE*BLOCK_SIZE*GRID_SIZE*GRID_SIZE/*/*Init CUDA*/*/#if _DEVICE_EMULATION_ bool InitCUDA(void)return true;#else bool InitCUDA(void)int coun

12、t=0;int i=0;cudaGetDeviceCount(&count);if(count=0)fprintf(stderr,There is no device.n);return false;for(i=0;i=1)break;if(i=count)fprintf(stderr,There is no device supporting CUDA.n);return false;cudaSetDevice(i);printf(CUDA initialized.n);return true;#endif/*/*kernel */*/_global_ void eachSum(float*

13、AllTotal,float*AllProfit)float VNumPro=,GNumPro=;/这两个数组可自己设定。int PNumPro;float eachTotal=0,eachProfit=0;unsigned long int i=blockIdx.y*BLOCK_SIZE*GRID_SIZE+blockIdx.x*GRID_SIZE+threadIdx.y*BLOCK_SIZE+threadIdx.x;for(int j=0;jNumPro;j+)unsigned long int k=(1(NumPro-1-i);eachTotal+=Pj*Vj;eachProfit+=P

14、j*Gj;_syncthreads();AllTotali=eachTotal;名师资料总结-精品资料欢迎下载-名师精心整理-第 6 页，共 17 页 -AllProfiti=eachProfit;/*/*main()*/*/int main(int argc,char*argv)if(!InitCUDA()return 0;float*h_AllTotal=(float*)malloc(AllCondition*sizeof(float),*h_AllProfit=(float*)malloc(AllCondition*sizeof(float);float*d_AllTotal,*d_Al

15、lProfit;CUDA_SAFE_CALL(cudaMalloc(void*)d_AllTotal,AllCondition*sizeof(float);CUDA_SAFE_CALL(cudaMalloc(void*)d_AllProfit,AllCondition*sizeof(float);unsigned int timer=0;CUT_SAFE_CALL(cutCreateTimer(&timer);CUT_SAFE_CALL(cutStartTimer(timer);dim3 grid(GRID_SIZE,GRID_SIZE);dim3 block(BLOCK_SIZE,BLOCK

16、_SIZE);eachSum(d_AllTotal,d_AllProfit);CUT_CHECK_ERROR(Kernel execution failedn);CUDA_SAFE_CALL(cudaThreadSynchronize();CUT_SAFE_CALL(cutStopTimer(timer);printf(Processing time:%f(ms)n,cutGetTimerValue(timer);CUT_SAFE_CALL(cutDeleteTimer(timer);CUDA_SAFE_CALL(cudaMemcpy(h_AllTotal,d_AllTotal,sizeof(float)*AllCondition,cudaMemcpyDeviceToHost);名师资料总结-精品资料欢迎下载-名师精心整理-第 7 页，共 17 页 -CUDA_SAFE_CALL(cudaMemcpy(h_AllProfit,d_AllProfit,sizeof(float)*AllCondition,cudaMemcpyDeviceToHost);free(h_AllProfit);

展开阅读全文