HPC - SIMD Dot product

Moreno Marzolla

Last updated: 2023-11-09

Environment setup

To see which SIMD extensions are supported by the CPU you can examine the output of cat /proc/cpuinfo or lscpu. Look at the flags field for the presence of the abbreviations mmx,sse, sse2,sse3, sse4_1,sse4_2, avx,avx2.

Compile SIMD programs with:

    gcc -std=c99 -Wall -Wpedantic -O2 -march=native -g -ggdb prog.c -o prog

where:

It is sometimes useful to analyze the assembly code produced by the compiler, e.g., to see if SIMD instructions have actually been emitted. This can be done with the command:

    objdump -dS executable_name

Use the following command to see which compiler flags are enabled by -march=native:

    gcc -march=native -Q --help=target

Scalar product

simd-dot.c contains a function that computes the scalar product of two arrays. The program prints the mean execution times of the serial and SIMD versions; the goal of this exercise is to develop the SIMD version. The dot product requires little time even with large arrays; therefore, you might not observe a significant speedup.

1. Auto-vectorization. Check the effectiveness of compiler auto-vectorization of scalar_dot(). Compile as follows:

    gcc -O2 -march=native -ftree-vectorize -fopt-info-vec-all \
      simd-dot.c -o simd-dot -lm 2>&1 | grep "loop vectorized"

The -ftree-vectorize enables auto-vectorization; -fopt-info-vec-all flag prints some “informative” messages (so to speak) on standard error to show which loops have been vectorized.

Recent versions of GCC correctly vectorize the serial_dot() function. Older versions vectorize the loop in the fill() function, but not that in serial_dot().

2. Auto-vectorization (second attempt). Examine the assembly code to verify that SIMD instructions have indeed been emitted:

    gcc -S -c -march=native -O2 -ftree-vectorize simd-dot.c -o simd-dot.s

If you have an older version of GCC, examine the diagnostic messages of the compiler (remove the strings from 2>&1 onwards from the previous command); you should see something like:

    simd-dot.c:157:5: note: reduction: unsafe fp math optimization: r_17 = _9 + r_20;

that refers to the “for” loop of the scalar_dot() function. The message reports that the instructions:

    r += x[i] * y[i];

are part of a reduction operation involving operands of type float. Since floating-point arithmetic is not commutative, the compiler did not vectorize in order not to alter the order of the sums. To ignore the problem, recompile the program with the -funsafe-math-optimizations flag:

    gcc -O2 -march=native -ftree-vectorize -fopt-info-vec-all \
      -funsafe-math-optimizations \
      simd-dot.c -o simd-dot -lm 2>&1 | grep "loop vectorized"

The following message should now appear:

    simd-dot.c:165:5: optimized: loop vectorized using 32 byte vectors

3. Vectorize the code manually. Implement simd_dot() using the vector datatypes of the GCC compiler. The function should be very similar to the one computing the sum-reduction (refer to simd-vsum-vector.c in the examples archive). The function simd_dot() should work correctly for any length \(n\) of the input arrays, which is therefore not required to be a multiple of the SIMD array lenght. Input arrays are always correctly aligned.

Compile with:

    gcc -std=c99 -Wall -Wpedantic -O2 -march=native -g -ggdb simd-dot.c -o simd-dot -lm

(do not use -ftree-vectorize, since we want to compare the execution time of the pure scalar version with the hand-tuned SIMD implementation).

Run with:

    ./simd-dot [n]

Example:

    ./simd-dot 20000000

Files