Last updated: 2023-11-29
Write a program that reverses an array v[]
of length \(n\), i.e., exchanges v[0]
and v[n-1]
, v[1]
and v[n-2]
and so on. You should write the following functions:
reverse()
reverses an array in[]
into a different array out[]
(the input is not modified). Assume that in[]
and out[]
reside on non-overlapping memory blocks.
inplace_reverse()
reverses the array in[]
“in place”, i.e., exchanging elements using \(O(1)\) additional storage; therefore, you are not allowed to allocate a temporary output vector.
The file cuda-reverse.cu provides a serial implementation of reverse()
and inplace_reverse()
. Your goal is to odify the functions to use of the GPU, defining any additional kernel that is required.
reverse()
can be parallelized by launching \(n\) CUDA threads; each thread copies a single element from the input to the output array. Since the array size \(n\) can be large, you should create as many one-dimensional thread blocks as needed to have at least \(n\) threads. Have a look at the lecture notes on how to do this.
inplace_reverse()
can be parallelized by launching \(\lfloor n/2 \rfloor\) CUDA threads (note the rounding): each thread swaps an element on the first half of in[]
with the corresponding element on the second half.
To map threads to array elements it is possible to use the expression:
In both cases the program might create more threads than actually needed; special care should be made to ensure that the extra threads do nothing, e.g., using
for reverse()
, and
for inplace_reverse()
.
To compile:
nvcc cuda-reverse.cu -o cuda-reverse
To execute:
./cuda-reverse [n]
Example:
./cuda-reverse