CUDA C编程笔记
- 第四章 全局内存
- 4.4 核函数可达到的带宽
- 4.4.2.2 朴素转置【最原始的转置】:读取行和读取列
待解决的问题:
第四章 全局内存
4.4 核函数可达到的带宽
4.4.2.2 朴素转置【最原始的转置】:读取行和读取列
- 基于行的朴素转置:基于主机实现【按行加载+按列存储】
列=行
//1.朴素转置----基于行
__global__ void transposeNaiveRow(float *out, float *in, const int nx, const int ny){unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;unsigned int iy = blockIdx.y * blockDim.y + threadIdx.y;if(ix < nx && iy < ny){out[ix*ny + iy] = in[iy*nx + ix];//列=行}
}case 2:1.朴素转置----基于行kernel = &transposeNaiveRow;kernelName = "NaiveRow ";break;
- 基于列的朴素转置:行=列(互换上面的读写索引)
【按列加载+按行存储】
//1.朴素转置----基于列
__global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny){unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;unsigned int iy = blockIdx.y * blockDim.y + threadIdx.y;if(ix < nx && iy < ny){out[iy*nx + ix] = in[ix*ny + iy];//行=列}
}case 3:1.朴素转置----基于列kernel = &transposeNaiveCol;kernelName = "NaiveCol ";break;
输出结果如下:基于列 优于 基于行
~/cudaC/unit4$ ./4-6transpose 2
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 2
warmup elapsed 0.000559 sec
NaiveRow elapsed 0.000078 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 429.077698 GB~/cudaC/unit4$ ./4-6transpose 3
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 3
warmup elapsed 0.000550 sec
NaiveCol elapsed 0.000061 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 549.755798 GB
查询可得,理论峰值带宽为936 GB/s
基于行是理论峰值的46%(书上36%)
基于列是理论峰值的59%(书上45%)
猜测这种情况的结果是缓存中交叉存储导致,读入一级缓存的数据这次没有被访问到,但是以后的访问中可能命中。
尝试禁用一级缓存
-Xptxas -dlcm=ca 启用一级缓存
-Xptxas -dlcm=cg 禁用一级缓存
~/cudaC/unit4$ nvcc -arch=sm_86 -Xptxas -dlcm=cg 4-6transpose.cu -o 4-6transpose~/cudaC/unit4$ ./4-6transpose 0
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 0
warmup elapsed 0.000549 sec
CopyRow elapsed 0.000056 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 601.442261 GB 》 586.406189 GB~/cudaC/unit4$ ./4-6transpose 1
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 1
warmup elapsed 0.000549 sec
CopyCol elapsed 0.000082 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 409.120605 GB 《 515.521912 GB~/cudaC/unit4$ ./4-6transpose 2
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 2
warmup elapsed 0.000552 sec
NaiveRow elapsed 0.000075 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 448.208557 GB 》429.077698 GB~/cudaC/unit4$ ./4-6transpose 3
./4-6transpose starting transpose at device 0: NVIDIA GeForce RTX 3090 with matrix nx 2048 ny 2048 with kernel 3
warmup elapsed 0.000556 sec
NaiveCol elapsed 0.000069 sec <<< grid (128,128) block (16,16)>>> effective bandwidth 486.980927 GB《 549.755798 GB
结果对比图:
缓存交叉读取能获得最高的加载吞吐量,一旦数据预先放到一级存储中,全局内存读取就有较好的隐藏延迟和较高的一级缓存命中率
一级缓存的加载可以限制交叉加载对性能的负面影响