HPC - Sum-reduction of an array

Moreno Marzolla

Last updated: 2022-10-27

The file mpi-sum.c contains a serial implementation of an MPI program that computes the sum of an array of length \(N\); indeed, the program performsa a sum-reduction of the array. In the version provided, process 0 performs all computations, and therefore is not a true parallel program. Modify the program so that all processes contribute to the computation according to the following steps (see Figure 1).

Figure 1: Computing the sum-reduction using MPI
Figure 1: Computing the sum-reduction using MPI
  1. The master process (rank 0) creates and initializes the input array master_array[].

  2. The master distributes master_array[] array among the \(P\) processes (including itself) using MPI_Scatter(). You may initially assume that \(N\) is an integer multiple of \(P\).

  3. Each process computes the sum-reduction of its portion.

  4. Each process \(p > 0\) sends its own local sum to the master using MPI_Send(); the master receives the local sums using MPI_Recv() and accumulates them.

We will see in the next lexture how step 4 can be realized more efficiently with the MPI reduction operation.

If the array length \(N\) is not a multiple of \(P\), there are several possible solutions:

  1. Padding: add extra elements to the array so that the new length \(N'\) is multiple of \(P\). The extra elements must be initialized to zero, so that the sum does not change. This solution requires that the procedure has some control on the generation of the input array, so that the length can be changed. This is not always possible nor desirable, e.g., if the sum-reduction must be implemented as a subroutine that receives the input as a parameter over which the subroutine has no control.

  2. Use Scatterv: MPI provides the MPI_Scatterv function which works like MPI_Scatter but allows different block sizes. The downside is that MPI_Scatterv is cumbersome to use because it requires array parameters that must be allocated/deallocated and properly initialized.

  3. Let the master handle the leftover: if \(N\) is not an integer multiple of \(P\), the master (rank 0) takes care of the leading or trailing N % P elements, in addition to its own block of length \(N/P\) like any other process. The limitation of this approach is that it introduces some load imbalance: by definition, N % P is a number between \(0\) and \(P-1\) inclusive, which may be significant if the number of processes \(P\) is large and/or the computation time is heavily influenced by the chunk sizes.

For this exercise I suggest option c: in our setting, \(P\) is small due to hardware limitations and the computation is trivial. Hence, the execution time is likely to be dominated by the communication operations anyway.

To compile:

    mpicc -std=c99 -Wall -Wpedantic mpi-sum.c -o mpi-sum

To execute:

    mpirun -n P ./mpi-sun N


    mpirun -n 4 ./mpi-sum 10000