The oeifeN cuda-matsum.cu computes the sum of two square matrices of size \(N \times N\) using the CPU. Modify the program to use the GPU; in particular, you must modify the function
matsum() in such a way that the new version is transparent to the caller, i.e., the caller is not aware whether the computation happens on the CPU or the GPU. To this aim, function
allocate memory on the device to store copies of \(p, q, r\);
copy \(p, q\) from the host to the device;
execute a kernel that computes the sum \(p + q\);
copy the result from the device back to the host;
free up device memory.
The program must work with any value of the matrix size \(N\), even if it nos an integer multiple of the CUDA block size. Note that there is no need to use shared memory: why?
nvcc cuda-matsum.cu -o cuda-matsum -lm