CUDA 編程 GPU 矩陣運算

內(nèi)核不能顯式返回值;所有結(jié)果數(shù)據(jù)必須寫入傳遞給函數(shù)的數(shù)組(如果計算標(biāo)量,則可能傳遞一個單元素數(shù)組);
內(nèi)核在調(diào)用時顯式聲明其線程層次結(jié)構(gòu):即線程塊的數(shù)量和每個塊的線程數(shù)(請注意,雖然內(nèi)核編譯一次,但可以使用不同的塊大小或網(wǎng)格大小多次調(diào)用它)。
from numba import cuda
@cuda.jit
def my_kernel(io_array):?
?"""
? ?Code for kernel.
? ?"""
? ?
# code here
內(nèi)核調(diào)用
內(nèi)核通常以以下方式啟動:
import numpy
# Create the data array - usually initialized some other way
data = numpy.ones(256)
# Set the number of threads in a block
threadsperblock = 32?
# Calculate the number of thread blocks in the grid
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock
# Now start the kernel
my_kernel[blockspergrid, threadsperblock](data)
# Print the result
print(data)
主要有兩個步驟:
通過指定每個網(wǎng)格的塊數(shù)和每個塊的線程數(shù)來實例化內(nèi)核。兩者的乘積將給出啟動的線程總數(shù)。內(nèi)核實例化是通過獲取編譯的內(nèi)核函數(shù)并使用整數(shù)元組對其進(jìn)行索引來完成的。
運行內(nèi)核,將輸入數(shù)組傳遞給它(如果需要,還可以傳遞任何單獨的輸出數(shù)組)。默認(rèn)情況下,運行內(nèi)核是同步的:當(dāng)內(nèi)核完成執(zhí)行并且數(shù)據(jù)同步返回時,函數(shù)返回。
@cuda.jit
def my_kernel(io_array):
? ? # Thread id in a 1D block
? ? tx = cuda.threadIdx.x
? ? # Block id in a 1D grid
? ? ty = cuda.blockIdx.x
? ? # Block width, i.e. number of threads per block
? ? bw = cuda.blockDim.x
? ? # Compute flattened index inside the array
? ? pos = tx + ty * bw
? ? if pos < io_array.size:? # Check array boundaries
? ? ? ? io_array[pos] *= 2 # do the computation
numba.cuda.threadIdx
?- The thread indices in the current thread block. For 1-dimensional blocks, the index (given by the x attribute) is an integer spanning the range from 0 to?numba.cuda.blockDim
?- 1. A similar rule exists for each dimension when more than one dimension is used.numba.cuda.blockDim
?- The shape of the block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).numba.cuda.blockIdx
?- The block indices in the grid of threads launched a kernel. For a 1-dimensional grid, the index (given by the?x
?attribute) is an integer spanning the range from 0 to?numba.cuda.gridDim
?- 1. A similar rule exists for each dimension when more than one dimension is used.numba.cuda.gridDim
?- The shape of the grid of blocks, i.e. the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.
These objects can be 1-, 2- or 3-dimensional, depending on how the kernel was invoked. To access the value at each dimension, use the?x
,?y
?and?z
?attributes of these objects, respectively.
Absolute positions
Simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba provides additional facilities to automate such calculations:
numba.cuda.grid(ndim)
?- Return the absolute position of the current thread in the entire grid of blocks.?ndim
?should correspond to the number of dimensions declared when instantiating the kernel. If?ndim
?is 1, a single integer is returned. If?ndim
?is 2 or 3, a tuple of the given number of integers is returned.numba.cuda.gridsize(ndim)
?- Return the absolute size (or shape) in threads of the entire grid of blocks.?ndim
?has the same meaning as in?grid()
?above.
Using these functions, the our example can become:
from __future__ import division
from numba import cuda
import numpy
import math
# CUDA kernel
@cuda.jit
def my_kernel(io_array):
? ? pos = cuda.grid(1)
? ? if pos < io_array.size:
? ? ? ? io_array[pos] *= 2 # do the computation
# Host code? ?
data = numpy.ones(256)
threadsperblock = 256
blockspergrid = math.ceil(data.shape[0] / threadsperblock)
my_kernel[blockspergrid, threadsperblock](data)
print(data)
Memory management
Numba has been automatically transferring the NumPy arrays to the device when you invoke the kernel. However, it can only do so conservatively by always transferring the device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is possible to manually control the transfer.
device_array = cuda.device_array( shape )
Allocate an empty device?ndarray
. Similar to?numpy.empty()
.
device_array = cuda.to_device( array )
Allocate and transfer a NumPy?ndarray
?to the device.