Last updated: 2022-10-12
The file omp-inclusive-scan.c contains a serial program that computes the prefix sum of an array v[]
of length \(n\) by applying the inclusive scan operator; the result is stored in s[]
. At the end of the operation we therefore have \(s[i] = v[0] + v[1] + \ldots + v[i]\) for each \(i = 0, \ldots, n-1\).
Since in shared-memory programming it is often assumed that the number \(P\) of threads is much lower than the problem size (\(P \ll n\)), it is not possible to impelment the tree-based computation shown during the class, which is more appropriate for GPUs. Therefore, we use a different strategy, called blocked scan (refer to Figure 1, where \(P = 4\) OpenMP threads are assumed).
Each thread operates on portions of v[]
and s[]
whose endpoints are determined appropriately.
Each thread performs an inclusive scan of its own portion of v[]
using the serial algorithm. The result is stored in the corresponding portion of s[]
.
The value of the last element of each block of s[]
is copied to a temporary shared array blksum[]
of length \(P\).
The master computes the inclusive scan blksum[]
, storing the result on a new array blksum_s[]
; it would also be possible to store the result inside blksum []
without use a separate array.
Each thread \(p\) adds blksum_[p]
to all elements of their own portion of s[]
.
At the end, s[]
contains the sum-inclusive scan v[]
Stages 1 and 4 can be done in parallel by OpenMP threads since they are embarrassingly parallel.
Compile with:
gcc -std=c99 -Wall -Wpedantic -fopenmp omp-inclusive-scan.c -o omp-inclusive-scan
Run with:
OMP_NUM_THREADS=2 ./omp-inclusive-scan