HPC - Dot product

Moreno Marzolla

Last updated: 2024-01-20

Familiarize with the environment

The server has three identical GPUs (NVidia GeForce GTX 1070). The first one is used by default, although it is possible to select another GPU either programmatically (cudaSetDevice(0) uses the first GPU, cudaSetDevice(1) uses the second one, and so on), or using the environment variable CUDA_VISIBLE_DEVICES.

For example

    CUDA_VISIBLE_DEVICES=0 ./cuda-stencil1d

runs cuda-stencil1d on the first GPU (default), while

    CUDA_VISIBLE_DEVICES=1 ./cuda-stencil1d

runs the program on the second GPU.

Run deviceQuery from the command line to display the hardware features of the GPUs.

Scalar product

The program cuda-dot.cu computes the dot product of two arrays x[] and y[] of length \(n\). Modify the program to use the GPU, by defining a suitable kernel and modifying the dot() function to use it. The dot product \(s\) of two arrays x[] and y[] is defined as

\[ s = \sum_{i=0}^{n-1} x[i] \times y[i] \]

In this exercise we implement a simple (although not efficient) approach where we use a single block of BLKDIM threads. The algorithm works as follows:

  1. The CPU allocates a float array d_tmp[] of length BLKDIM on the GPU, in addition to a copy of x[] and y[].

  2. The CPU executes a single 1D thread block containing BLKDIM threads; use the maximum number of threads per block supported by the hardware, which is BLKDIM = 1024.

  3. Thread \(t\) (\(t = 0, \ldots, \mathit{BLKDIM}-1\)) computes the value of the expression \((x[t] \times y[t] + x[t + \mathit{BLKDIM}] \times y[t + \mathit{BLKDIM}] + x[t + 2 \times \mathit{BLKDIM}] \times y[t + 2 \times \mathit{BLKDIM}] + \ldots)\) and stores the result in d_tmp[t] (see Figure 1).

  4. When the kernel terminates, the CPU transfers d_tmp[] back to host memory and performs a sum-reduction to compute the final result.

Figure 1
Figure 1

Your program must work correctly for any value of \(n\), even if it is not a multiple of BLKDIM.

A better way to compute a reduction will be shown in future lectures.

To compile:

    nvcc cuda-dot.cu -o cuda-dot -lm

To execute:

    ./cuda-dot [len]

Example:

    ./cuda-dot

Files