Last updated: 2022-10-27

The file mpi-sum.c contains a serial implementation of an MPI program that computes the sum of an array of length \(N\); indeed, the program performsa a *sum-reduction* of the array. In the version provided, process 0 performs all computations, and therefore is not a true parallel program. Modify the program so that all processes contribute to the computation according to the following steps (see Figure 1).

The master process (rank 0) creates and initializes the input array

`master_array[]`

.The master distributes

`master_array[]`

array among the \(P\) processes (including itself) using`MPI_Scatter()`

. You may initially assume that \(N\) is an integer multiple of \(P\).Each process computes the sum-reduction of its portion.

Each process \(p > 0\) sends its own local sum to the master using

`MPI_Send()`

; the master receives the local sums using`MPI_Recv()`

and accumulates them.

We will see in the next lexture how step 4 can be realized more efficiently with the MPI reduction operation.

If the array length \(N\) is not a multiple of \(P\), there are several possible solutions:

*Padding*: add extra elements to the array so that the new length \(N'\) is multiple of \(P\). The extra elements must be initialized to zero, so that the sum does not change. This solution requires that the procedure has some control on the generation of the input array, so that the length can be changed. This is not always possible nor desirable, e.g., if the sum-reduction must be implemented as a subroutine that receives the input as a parameter over which the subroutine has no control.*Use Scatterv*: MPI provides the`MPI_Scatterv`

function which works like`MPI_Scatter`

but allows different block sizes. The downside is that`MPI_Scatterv`

is cumbersome to use because it requires array parameters that must be allocated/deallocated and properly initialized.*Let the master handle the leftover*: if \(N\) is not an integer multiple of \(P\), the master (rank 0) takes care of the leading or trailing`N % P`

elements, in addition to its own block of length \(N/P\) like any other process. The limitation of this approach is that it introduces some load imbalance: by definition,`N % P`

is a number between \(0\) and \(P-1\) inclusive, which may be significant if the number of processes \(P\) is large and/or the computation time is heavily influenced by the chunk sizes.

For this exercise I suggest option c: in our setting, \(P\) is small due to hardware limitations and the computation is trivial. Hence, the execution time is likely to be dominated by the communication operations anyway.

To compile:

` mpicc -std=c99 -Wall -Wpedantic mpi-sum.c -o mpi-sum`

To execute:

` mpirun -n P ./mpi-sun N`

Example:

` mpirun -n 4 ./mpi-sum 10000`