Last updated: 2022-11-17

The server has three identical GPUs (NVidia GeForce GTX 1070). The first one is used by default, although it is possible to select another card either programmatically (`cudaSetDevice(0)`

uses the first GPU, `cudaSetDevice(1)`

uses the second one, and so on), or using the environment variable `CUDA_VISIBLE_DEVICES`

.

For example

` CUDA_VISIBLE_DEVICES=0 ./cuda-stencil1d`

runs `cuda-stencil1d`

on the first GPU (default), while

` CUDA_VISIBLE_DEVICES=1 ./cuda-stencil1d`

runs the program on the second GPU.

Run `deviceQuery`

from the command line to display the hardware features of the GPUs.

The program cuda-dot.cu computes the dot product of two arrays `x[]`

and `y[]`

of length \(n\). Modify the program to use the GPU, by transforming the `dot()`

function into a kernel. The dot product \(s\) of two arrays `x[]`

and `y[]`

is defined as

\[ s = \sum_{i=0}^{n-1} x[i] \times y[i] \]

Some modifications of the `dot()`

function are required to use the GPU. In this exercise we implement a simple (although not efficient) approach where we use a *single* block of *BLKDIM* threads. The algorithm works as follows:

The CPU allocates a

`tmp[]`

array of*BLKDIM*elements on the GPU, in addition to a copy of`x[]`

and`y[]`

.The CPU executes a single 1D thread block containing

*BLKDIM*threads; use the maximum number of threads per block supported by the hardware, which is*BLKDIM = 1024*.Thread \(t\) (\(t = 0, \ldots, \mathit{BLKDIM}-1\)) computes the value of the expression \((x[t] \times y[t] + x[t + \mathit{BLKDIM}] \times y[t + \mathit{BLKDIM}] + x[t + 2 \times \mathit{BLKDIM}] \times y[t + 2 \times \mathit{BLKDIM}] + \ldots)\) and stores the result in

`tmp[t]`

(see Figure 1).When the kernel terminates, the CPU transfers

`tmp[]`

back to host memory and performs a sum-reduction to compute the final result.

Your program must work correctly for any value of \(n\), even if it is not a multiple of *BLKDIM*.

A better way to compute a reduction will be shown in future lectures.

To compile:

` nvcc cuda-dot.cu -o cuda-dot -lm`

To execute:

` ./cuda-dot [len]`

Example:

` ./cuda-dot`