物流外贸是做什么的_html页面布局模板_怎么把产品快速宣传并推广_网站推广计划方案

时间:2025/9/8 12:28:56来源：https://blog.csdn.net/weixin_44231807/article/details/145709888 浏览次数:0次

CUDA C编程笔记

第四章全局内存
- 4.4 核函数可达到的带宽
- - - 4.4.2.2 朴素转置【最原始的转置】：读取行和读取列

待解决的问题：

第四章全局内存

4.4 核函数可达到的带宽

4.4.2.2 朴素转置【最原始的转置】：读取行和读取列

基于行的朴素转置：基于主机实现【按行加载+按列存储】
列=行

//1.朴素转置----基于行
__global__ void transposeNaiveRow(float *out, float *in, const int nx, const int ny){unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;unsigned int iy = blockIdx.y * blockDim.y + threadIdx.y;if(ix < nx && iy < ny){out[ix*ny + iy] = in[iy*nx + ix];//列=行}
}case 2:1.朴素转置----基于行kernel = &transposeNaiveRow;kernelName = "NaiveRow       ";break;

基于列的朴素转置：行=列（互换上面的读写索引）
【按列加载+按行存储】

//1.朴素转置----基于列
__global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny){unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;unsigned int iy = blockIdx.y * blockDim.y + threadIdx.y;if(ix < nx && iy < ny){out[iy*nx + ix] = in[ix*ny + iy];//行=列}
}case 3:1.朴素转置----基于列kernel = &transposeNaiveCol;kernelName = "NaiveCol       ";break;

输出结果如下：基于列优于基于行

~/cudaC/unit4$ ./4-6transpose 2
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 2
warmup         elapsed 0.000559 sec
NaiveRow        elapsed 0.000078 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 429.077698 GB~/cudaC/unit4$ ./4-6transpose 3
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 3
warmup         elapsed 0.000550 sec
NaiveCol        elapsed 0.000061 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 549.755798 GB

查询可得，理论峰值带宽为936 GB/s
基于行是理论峰值的46%（书上36%）
基于列是理论峰值的59%（书上45%）
猜测这种情况的结果是缓存中交叉存储导致，读入一级缓存的数据这次没有被访问到，但是以后的访问中可能命中。

尝试禁用一级缓存

-Xptxas -dlcm=ca 启用一级缓存
-Xptxas -dlcm=cg 禁用一级缓存

~/cudaC/unit4$ nvcc -arch=sm_86 -Xptxas -dlcm=cg 4-6transpose.cu -o 4-6transpose~/cudaC/unit4$ ./4-6transpose 0
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 0
warmup         elapsed 0.000549 sec
CopyRow        elapsed 0.000056 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 601.442261 GB 》 586.406189 GB~/cudaC/unit4$ ./4-6transpose 1
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 1
warmup         elapsed 0.000549 sec
CopyCol        elapsed 0.000082 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 409.120605 GB 《 515.521912 GB~/cudaC/unit4$ ./4-6transpose 2
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 2
warmup         elapsed 0.000552 sec
NaiveRow        elapsed 0.000075 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 448.208557 GB 》429.077698 GB~/cudaC/unit4$ ./4-6transpose 3
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 3
warmup         elapsed 0.000556 sec
NaiveCol        elapsed 0.000069 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 486.980927 GB《 549.755798 GB