HPC - Inclusive Scan

Moreno Marzolla

Last updated: 2022-10-12

The file omp-inclusive-scan.c contains a serial program that computes the prefix sum of an array v[] of length \(n\) by applying the inclusive scan operator; the result is stored in s[]. At the end of the operation we therefore have \(s[i] = v[0] + v[1] + \ldots + v[i]\) for each \(i = 0, \ldots, n-1\).

Since in shared-memory programming it is often assumed that the number \(P\) of threads is much lower than the problem size (\(P \ll n\)), it is not possible to impelment the tree-based computation shown during the class, which is more appropriate for GPUs. Therefore, we use a different strategy, called blocked scan (refer to Figure 1, where \(P = 4\) OpenMP threads are assumed).

Figure 1: Blocked inclusive scan
Figure 1: Blocked inclusive scan

Each thread operates on portions of v[] and s[] whose endpoints are determined appropriately.

  1. Each thread performs an inclusive scan of its own portion of v[] using the serial algorithm. The result is stored in the corresponding portion of s[].

  2. The value of the last element of each block of s[] is copied to a temporary shared array blksum[] of length \(P\).

  3. The master computes the inclusive scan blksum[], storing the result on a new array blksum_s[]; it would also be possible to store the result inside blksum [] without use a separate array.

  4. Each thread \(p\) adds blksum_[p] to all elements of their own portion of s[].

At the end, s[] contains the sum-inclusive scan v[]

Stages 1 and 4 can be done in parallel by OpenMP threads since they are embarrassingly parallel.

Compile with:

    gcc -std=c99 -Wall -Wpedantic -fopenmp omp-inclusive-scan.c -o omp-inclusive-scan

Run with:

    OMP_NUM_THREADS=2 ./omp-inclusive-scan

Files