Last updated: 2022-11-27
To see which SIMD extensions are supported by the CPU you can examine the output of cat /proc/cpuinfo
or lscpu
. Look at the flags field for the presence of the abbreviations mmx
,sse
, sse2
,sse3
, sse4_1
,sse4_2
, avx
,avx2
.
Compile SIMD programs with:
gcc -std=c99 -Wall -Wpedantic -O2 -march=native -g -ggdb prog.c -o prog
where:
-march=native
enables all statements supported by the machine on which you are compiling;
-g -ggdb
generates debugging information; this is useful for showing the source code along with the corresponding assembly code (see below).
It is sometimes useful to analyze the assembly code produced by the compiler, e.g., to see if SIMD instructions have actually been emitted. This can be done with the command:
objdump -dS executable_name
Use the following command to see which compiler flags are enabled by -march=native
:
gcc -march=native -Q --help=target
simd-dot.c contains a function that computes the scalar product of two arrays. The program prints the mean execution times of the serial and SIMD versions (the goal of this exercise is to develop the SIMD version). The dot product is a very simple computation that requires little time even with large arrays. Therefore, you might not observe a significant spèeedup of the SIMD program.
1. Auto-vectorization. Check the effectiveness of compiler auto-vectorization of the scalar_dot()
function. Compile the program as follows:
gcc -O2 -march=native -ftree-vectorize -fopt-info-vec-all \
simd-dot.c -o simd-dot -lm 2>&1 | grep "loop vectorized"
The -ftree-vectorize
enables auto-vectorization; -fopt-info-vec-all
flag prints some “informative” messages (so to speak) on standard error to show which loops have been vectorized.
Recent versions of GCC (e.g., 9.4.0) correctly vectorize the serial_dot()
function. Older versions did vectorize the loop in the fill()
function, but not that in serial_dot()
.
2. Auto-vectorization (second attempt). If you have a recent version of GCC, you can examine the assembly code to verify that SIMD instructions have indeed been emitted:
gcc -S -c -march=native -O2 -ftree-vectorize simd-dot.c -o simd-dot.s
If you have an older version of GCC, examine the diagnostic messages of the compiler (remove the strings from 2>&1
onwards from the previous command); you should see something like:
simd-dot.c:157:5: note: reduction: unsafe fp math optimization: r_17 = _9 + r_20;
that refers to the “for” loop of the scalar_dot()
function. The message reports that the instructions:
r += x[i] * y[i];
are part of a reduction operation involving operands of type float
. Since floating-point arithmetic is not commutative, the compiler did not vectorize in order not to alter the order of the sums. To ignore the problem, recompile the program with the -funsafe-math-optimizations
flag:
gcc -O2 -march=native -ftree-vectorize -fopt-info-vec-all \
-funsafe-math-optimizations \
simd-dot.c -o simd-dot -lm 2>&1 | grep "loop vectorized"
The following message should now appear:
simd-dot.c:165:5: optimized: loop vectorized using 32 byte vectors
3. Vectorize the code manually. Implement the function simd_dot()
using the vector datatype of the GCC compiler. The function should be very similar to the one computing the sum-reduction (refer to simd-vsum-vector.c
in the examples archive). The function simd_dot()
should work correctly for any length \(n\) of the input arrays, which is therefore not required to be a multiple of the SIMD array lenght. Input arrays are always correctly aligned.
Compile with:
gcc -std=c99 -Wall -Wpedantic -O2 -march=native -g -ggdb simd-dot.c -o simd-dot -lm
(do not use -ftree-vectorize
, since we want to compare the execution time of the pure scalar version with the hand-tuned SIMD implementation).
Run with:
./simd-dot [n]
Example:
./simd-dot 20000000