Last updated: 2023-03-07

The file simd-matmul.c contains a serial version of the dense matrix-matrix product of two square matrices \(p, q\), \(r=p \times q\). Both a “plain” and cache-efficient version are provided; the cache-efficient program transposes \(q\) so that the product can be computed by accessing both \(p\) and \(q^T\) row-wise.

The cache-efficient matrix-matrix product can be modified to take advantage of SIMD instructions, since it essentially computes a number of dot products between rows of \(p\) and rows of \(q^T\). Indeed, the body of function `scalar_matmul_tr()`

```
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
double s = 0.0;
for (k=0; k<n; k++) {
s += p[i*n + k] * qT[j*n + k];
}
r[i*n + j] = s;
}
}
```

computes the dot product of two arrays of length \(n\) that are stored at memory addresses \((p + i \times n)\) and \((\mathit{qT} + j \times n)\), respectively.

Your goal is to use SIMD instructions to compute the dot product above using *vector datatypes* provided by the GCC compiler. The program guarantees that the array length \(n\) is an integer multiple of the SIMD register length, and that all data are suitably aligned in memory.

This exercise uses the `double`

data type; it is therefore necessary to define a vector datatype `v2d`

of length 16 Bytes containing two doubles, using the declaration:

The server `isi-raptor03`

supports the AVX2 instruction set, and therefore has SIMD registers of width 256 bits = 32 Bytes. You might want to make use of a wider datatype `v4d`

containing 4 doubles instead of two.

To compile:

` gcc -march=native -O2 -std=c99 -Wall -Wpedantic -D_XOPEN_SOURCE=600 simd-matmul.c -o simd-matmul`

To execute:

` ./simd-matmul [matrix size]`

Example:

` ./simd-matmul 1024`