HPC - Dot product

Moreno Marzolla

Last updated: 2022-11-17

Familiarize with the environment

The server has three identical GPUs (NVidia GeForce GTX 1070). The first one is used by default, although it is possible to select another card either programmatically (cudaSetDevice(0) uses the first GPU, cudaSetDevice(1) uses the second one, and so on), or using the environment variable CUDA_VISIBLE_DEVICES.

For example

    CUDA_VISIBLE_DEVICES=0 ./cuda-stencil1d

runs cuda-stencil1d on the first GPU (default), while

    CUDA_VISIBLE_DEVICES=1 ./cuda-stencil1d

runs the program on the second GPU.

Run deviceQuery from the command line to display the hardware features of the GPUs.

Scalar product

The program cuda-dot.cu computes the dot product of two arrays x[] and y[] of length \(n\). Modify the program to use the GPU, by transforming the dot() function into a kernel. The dot product \(s\) of two arrays x[] and y[] is defined as

\[ s = \sum_{i=0}^{n-1} x[i] \times y[i] \]

Some modifications of the dot() function are required to use the GPU. In this exercise we implement a simple (although not efficient) approach where we use a single block of BLKDIM threads. The algorithm works as follows:

  1. The CPU allocates a tmp[] array of BLKDIM elements on the GPU, in addition to a copy of x[] and y[].

  2. The CPU executes a single 1D thread block containing BLKDIM threads; use the maximum number of threads per block supported by the hardware, which is BLKDIM = 1024.

  3. Thread \(t\) (\(t = 0, \ldots, \mathit{BLKDIM}-1\)) computes the value of the expression \((x[t] \times y[t] + x[t + \mathit{BLKDIM}] \times y[t + \mathit{BLKDIM}] + x[t + 2 \times \mathit{BLKDIM}] \times y[t + 2 \times \mathit{BLKDIM}] + \ldots)\) and stores the result in tmp[t] (see Figure 1).

  4. When the kernel terminates, the CPU transfers tmp[] back to host memory and performs a sum-reduction to compute the final result.

Figure 1
Figure 1

Your program must work correctly for any value of \(n\), even if it is not a multiple of BLKDIM.

A better way to compute a reduction will be shown in future lectures.

To compile:

    nvcc cuda-dot.cu -o cuda-dot -lm

To execute:

    ./cuda-dot [len]