Last updated: 2024-01-04

The goal of this exercise is to write a SIMD version of a program to compute the iterate of *Arnold’s cat map*. We have already seen this problem in other lab sessions; to make this exercise self-contained, we report here the problem specification.

Arnold’s cat map is a continuous chaotic function that has been studied in the ’60s by the Russian mathematician Vladimir Igorevich Arnold. In its discrete version, the function can be understood as a transformation of a bitmap image \(P\) of size \(N \times N\) into a new image \(P'\) of the same size. For each \(0 \leq x, y < N\), the pixel of coordinates \((x,y)\) in \(P\) is mapped into a new position \(C(x, y) = (x', y')\) in \(P'\) where

\[ x' = (2x + y) \bmod N, \qquad y' = (x + y) \bmod N \]

(“mod” is the integer remainder operator, i.e., operator `%`

of the C language). We may assume that \((0, 0)\) is top left and \((N-1, N-1)\) bottom right, so that the bitmap can be encoded as a regular two-dimensional C matrix.

The transformation corresponds to a linear “stretching” of the image, that is then broken down into triangles that are rearranged as shown in Figure 1.

Arnold’s cat map has interesting properties. Let \(C^k(x, y)\) be the result of iterating \(k\) times the function \(C\), i.e.:

\[ C^k(x, y) = \begin{cases} (x, y) & \mbox{if $k=0$}\\ C(C^{k-1}(x,y)) & \mbox{if $k>0$} \end{cases} \]

Therefore, \(C^2(x,y) = C(C(x,y))\), \(C^3(x,y) = C(C(C(x,y)))\), and so on.

If we take an image and apply \(C\) once, we get a severely distorted version of the input. If we apply \(C\) on the resulting image, we get an even more distorted image. As we keep applying \(C\), the original image is no longer discernible. However, after a certain number of iterations that depends on \(N\) and has been proved to never exceed \(3N\), we get back the original image! (Figure 2).

The *minimum recurrence time* for an image is the minimum positive integer \(k \geq 1\) such that \(C^k(x, y) = (x, y)\) for all \((x, y)\). In simple terms, the minimum recurrence time is the minimum number of iterations of the cat map that produce the starting image.

For example, the minimum recurrence time for cat1368.pgm of size \(1368 \times 1368\) is \(36\). As said before, the minimum recurrence time depends on the image size \(N\). Unfortunately, no closed formula is known to compute the minimum recurrence time as a function of \(N\), although there are results and bounds that apply to specific cases.

You are given sequential program that computes the \(k\)-th iterate of the cat map using the CPU. The number of iterations \(k\) to compute is passed on the command line. The program reads an image in PGM format from standard input, and produces to standard output the image that is produced after \(k\) iterations of the cat map. You should redirect the standard output to a file, as shown in the comments to the source code.

The structure of the function that calculates the \(k\)-th iterate of the cat map is very simple:

```
for (y=0; y<N; y++) {
for (x=0; x<N; x++) {
/* compute the coordinates (xnew, ynew) of point (x, y)
after k iterations of the cat map */
next[xnew + ynew*N] = cur[x+y*N];
}
}
```

To make use of SIMD parallelism we proceed as follows: instead of computing the new coordinates of a single point at a time, we compute the new coordinates of all adjacent points \((x, y)\), \((x+1,y)\), \((x+2,y)\), \((x+3,y)\) using the compiler’s *vector datatype*. To this aim, we define the following variables of type `v4i`

(i.e., SIMD array of four integers):

`vx`

,`vy`

: \(x\) and \(y\) coordinates of four adjacent points, before the application of the cat map; the coordinates of point \(i=0, \ldots 3\) are`(vx[i], vy[i])`

.`vxnew`

,`vynew`

: new coordinates of four adjacent points, after application of the cat map.

Type `v4i`

is defined as:

Let \(vx = \{x, x+1, x+2, x+3\}\), \(vy = \{y, y, y, y\}\); we can apply the cat map to all \(vx\), \(vy\) using the usual C operators, i.e., you should not need to change the instructions that actually compute the new coordinates.

There is, however, one exception. The following scalar instruction:

can not be parallelized automatically by the compiler if `x`

, `y`

, `xnew`

, `ynew`

are SIMD vectors. Instead, you must expand the instruction into the following four lines of code:

```
next[vxnew[0] + vynew[0]*N] = cur[vx[0] + vy[0]*N];
next[vxnew[1] + vynew[1]*N] = cur[vx[1] + vy[1]*N];
next[vxnew[2] + vynew[2]*N] = cur[vx[2] + vy[2]*N];
next[vxnew[3] + vynew[3]*N] = cur[vx[3] + vy[3]*n];
```

You can assume that the size \(N\) of the image is always an integer multiple of the SIMD vector length, i.e., an integer multiple of 4.

To compile:

` gcc -std=c99 -Wall -Wpedantic -O2 -march=native simd-cat-map.c -o simd-cat-map`

To execute:

` ./simd-cat-map k < input_file > output_file`

Example:

` ./simd-cat-map 100 < cat368.pgm > cat1368-100.pgm`

The performance of the SIMD version should be only marginally better than the version scale (actually, the SIMD version could even be *worse* than the serial version). Analyzing the assembly code produced by the compiler, it turns out that the computation of the modulus operator in the expressions

is done using scalar operations. By consulting the list of *SIMD intrinsics* on Intel’s web site, we see that there is no SIMD instruction for integer division. Therefore, to get better performance it is essential to compute the modulus without using division at all.

Analyzing the scalar code, we realize that if \(0 \leq xold < N\) and \(0 \leq yold < N\), then we have \(0 \leq 2 \times xold + yold < 3N\) and \(0 \leq xold+yold < 2N\). Therefore, the scalar code can be rewritten as:

```
xnew = (2*xold + yold);
if (xnew >= N) { xnew = xnew - N; }
if (xnew >= N) { xnew = xnew - N; }
ynew = (xold + yold);
if (ynew >= N) { ynew = ynew - N; }
```

The code above is certainly more verbose and less readable than the original version that uses the modulus operator, but it has the advantage of not requiring the modulus operator nor integer division. Furthermore, we can use the “selection and masking” technique to parallelize the conditional statements. Indeed, the instruction

can be rewritten as

that tan be further simplified as:

The SIMD program becomes more complex, but at the same time more efficient than the serial program.

To compile:

` gcc -std=c99 -Wall -Wpedantic -march=native -O2 simd-cat-map.c -o simd-cat-map`

To execute:

` ./simd-cat-map [niter] < in.pgm > out.pgm`

Example:

` ./simd-cat-map 1024 < cat1368.pgm > cat1368-1024.pgm`

- simd-cat-map.c
- hpc.h
- cat1368.pgm (the minimum recurrence time of this image is 36)