# HPC - Dense matrix-matric product using vector datatypes

Last updated: 2023-03-07

The file simd-matmul.c contains a serial version of the dense matrix-matrix product of two square matrices $$p, q$$, $$r=p \times q$$. Both a “plain” and cache-efficient version are provided; the cache-efficient program transposes $$q$$ so that the product can be computed by accessing both $$p$$ and $$q^T$$ row-wise.

The cache-efficient matrix-matrix product can be modified to take advantage of SIMD instructions, since it essentially computes a number of dot products between rows of $$p$$ and rows of $$q^T$$. Indeed, the body of function scalar_matmul_tr()

    for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
double s = 0.0;
for (k=0; k<n; k++) {
s += p[i*n + k] * qT[j*n + k];
}
r[i*n + j] = s;
}
}

computes the dot product of two arrays of length $$n$$ that are stored at memory addresses $$(p + i \times n)$$ and $$(\mathit{qT} + j \times n)$$, respectively.

Your goal is to use SIMD instructions to compute the dot product above using vector datatypes provided by the GCC compiler. The program guarantees that the array length $$n$$ is an integer multiple of the SIMD register length, and that all data are suitably aligned in memory.

This exercise uses the double data type; it is therefore necessary to define a vector datatype v2d of length 16 Bytes, containing two doubles, using the declaration:

    typedef double v2d __attribute__((vector_size(16)));
#define VLEN (sizeof(v2d)/sizeof(double))

The server isi-raptor03 supports the AVX2 instruction set, and therefore has SIMD registers of width 256 bits = 32 Bytes. You might want to make use of a wider datatype v4d containing 4 doubles instead of two.

To compile:

    gcc -march=native -O2 -std=c99 -Wall -Wpedantic -D_XOPEN_SOURCE=600 simd-matmul.c -o simd-matmul

To execute:

    ./simd-matmul [matrix size]

Example:

    ./simd-matmul 1024