/****************************************************************************
*
* simd-cat-map.c - Arnold's cat map
*
* Copyright (C) 2016--2024 by Moreno Marzolla
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
****************************************************************************/
/***
% HPC - Ardnold's cat map
% Moreno Marzolla
% Last updated: 2024-01-04
![](cat-map.png)
The goal of this exercise is to write a SIMD version of a program to
compute the iterate of _Arnold's cat map_. We have already seen this
problem in other lab sessions; to make this exercise self-contained,
we report here the problem specification.
[Arnold's cat map](https://en.wikipedia.org/wiki/Arnold%27s_cat_map)
is a continuous chaotic function that has been studied in the '60s by
the Russian mathematician [Vladimir Igorevich
Arnold](https://en.wikipedia.org/wiki/Vladimir_Arnold). In its
discrete version, the function can be understood as a transformation
of a bitmap image $P$ of size $N \times N$ into a new image $P'$ of
the same size. For each $0 \leq x, y < N$, the pixel of coordinates
$(x,y)$ in $P$ is mapped into a new position $C(x, y) = (x', y')$ in
$P'$ where
$$
x' = (2x + y) \bmod N, \qquad y' = (x + y) \bmod N
$$
("mod" is the integer remainder operator, i.e., operator `%` of the C
language). We may assume that $(0, 0)$ is top left and $(N-1, N-1)$
bottom right, so that the bitmap can be encoded as a regular
two-dimensional C matrix.
The transformation corresponds to a linear "stretching" of the image,
that is then broken down into triangles that are rearranged as shown
in Figure 1.
![Figure 1: Arnold's cat map](cat-map.svg)
Arnold's cat map has interesting properties. Let $C^k(x, y)$ be the
result of iterating $k$ times the function $C$, i.e.:
$$
C^k(x, y) = \begin{cases}
(x, y) & \mbox{if $k=0$}\\
C(C^{k-1}(x,y)) & \mbox{if $k>0$}
\end{cases}
$$
Therefore, $C^2(x,y) = C(C(x,y))$, $C^3(x,y) = C(C(C(x,y)))$, and so
on.
If we take an image and apply $C$ once, we get a severely distorted
version of the input. If we apply $C$ on the resulting image, we get
an even more distorted image. As we keep applying $C$, the original
image is no longer discernible. However, after a certain number of
iterations that depends on $N$ and has been proved to never exceed
$3N$, we get back the original image! (Figure 2).
![Figure 2: Some iterations of the cat map](cat-map-demo.png)
The _minimum recurrence time_ for an image is the minimum positive
integer $k \geq 1$ such that $C^k(x, y) = (x, y)$ for all $(x, y)$. In
simple terms, the minimum recurrence time is the minimum number of
iterations of the cat map that produce the starting image.
For example, the minimum recurrence time for
[cat1368.pgm](cat1368.pgm) of size $1368 \times 1368$ is $36$. As said
before, the minimum recurrence time depends on the image size $N$.
Unfortunately, no closed formula is known to compute the minimum
recurrence time as a function of $N$, although there are results and
bounds that apply to specific cases.
You are given sequential program that computes the $k$-th iterate of
the cat map using the CPU. The number of iterations $k$ to compute is
passed on the command line. The program reads an image in PGM format
from standard input, and produces to standard output the image that is
produced after $k$ iterations of the cat map. You should redirect the
standard output to a file, as shown in the comments to the source
code.
The structure of the function that calculates the $k$-th iterate of
the cat map is very simple:
```C
for (y=0; y output_file
Example:
./simd-cat-map 100 < cat368.pgm > cat1368-100.pgm
## Extension
The performance of the SIMD version should be only marginally better
than the version scale (actually, the SIMD version could even be
_worse_ than the serial version). Analyzing the assembly code produced
by the compiler, it turns out that the computation of the modulus
operator in the expressions
```C
vxnew = (2*vxold+vyold) % N;
vynew = (vxold + vyold) % N;
```
is done using scalar operations. By consulting the list of _SIMD
intrinsics_ on [Intel's web
site](https://software.intel.com/sites/landingpage/IntrinsicsGuide/),
we see that there is no SIMD instruction for integer division.
Therefore, to get better performance it is essential to compute the
modulus without using division at all.
Analyzing the scalar code, we realize that if $0 \leq xold < N$ and $0
\leq yold < N$, then we have $0 \leq 2 \times xold + yold < 3N$ and $0
\leq xold+yold < 2N$. Therefore, the scalar code can be rewritten as:
```C
xnew = (2*xold + yold);
if (xnew >= N) { xnew = xnew - N; }
if (xnew >= N) { xnew = xnew - N; }
ynew = (xold + yold);
if (ynew >= N) { ynew = ynew - N; }
```
The code above is certainly more verbose and less readable than the
original version that uses the modulus operator, but it has the
advantage of not requiring the modulus operator nor integer division.
Furthermore, we can use the "selection and masking" technique to
parallelize the conditional statements. Indeed, the instruction
```C
if (xnew >= N) { xnew = xnew - N; }
```
can be rewritten as
```C
const v4i mask = (xnew >= N);
xnew = (mask & (xnew - N)) | (mask & xnew);
```
that tan be further simplified as:
```C
const v4i mask = (xnew >= N);
xnew = xnew - (mask & N);
```
The SIMD program becomes more complex, but at the same time more
efficient than the serial program.
To compile:
gcc -std=c99 -Wall -Wpedantic -march=native -O2 simd-cat-map.c -o simd-cat-map
To execute:
./simd-cat-map [niter] < in.pgm > out.pgm
Example:
./simd-cat-map 1024 < cat1368.pgm > cat1368-1024.pgm
## Files
- [simd-cat-map.c](simd-cat-map.c)
- [hpc.h](hpc.h)
- [cat1368.pgm](cat1368.pgm) (the minimum recurrence time of this image is 36)
***/
/* The following #define is required by posix_memalign() and
clock_gettime(). It MUST be defined before including any other
file. */
#define _XOPEN_SOURCE 600
#include
#include
#include
#include
#include "hpc.h"
typedef int v4i __attribute__((vector_size(16)));
#define VLEN (sizeof(v4i)/sizeof(int))
typedef struct {
int width; /* Width of the image (in pixels) */
int height; /* Height of the image (in pixels) */
int maxgrey; /* Don't care (used only by the PGM read/write routines) */
unsigned char *bmap; /* buffer of width*height bytes; each element represents the gray level of a pixel (0-255) */
} PGM_image;
const unsigned char WHITE = 255;
const unsigned char BLACK = 0;
/**
* Initialize a PGM_image object: allocate space for a bitmap of size
* `width` x `height`, and set all pixels to color `col`
*/
void init_pgm( PGM_image *img, int width, int height, unsigned char col )
{
int i, j;
assert(img != NULL);
img->width = width;
img->height = height;
img->maxgrey = 255;
img->bmap = (unsigned char*)malloc(width*height);
assert(img->bmap != NULL);
for (i=0; ibmap[i*width + j] = col;
}
}
}
/**
* Read a PGM file from file `f`. Warning: this function is not
* robust: it may fail on legal PGM images, and may crash on invalid
* files since no proper error checking is done.
*/
void read_pgm( FILE *f, PGM_image* img )
{
char buf[1024];
const size_t BUFSIZE = sizeof(buf);
char *s;
int nread;
assert(f != NULL);
assert(img != NULL);
/* Get the file type (must be "P5") */
s = fgets(buf, BUFSIZE, f);
if (0 != strcmp(s, "P5\n")) {
fprintf(stderr, "Wrong file type %s\n", buf);
exit(EXIT_FAILURE);
}
/* Get any comment and ignore it; does not work if there are
leading spaces in the comment line */
do {
s = fgets(buf, BUFSIZE, f);
} while (s[0] == '#');
/* Get width, height */
sscanf(s, "%d %d", &(img->width), &(img->height));
/* get maxgrey; must be less than or equal to 255 */
s = fgets(buf, BUFSIZE, f);
sscanf(s, "%d", &(img->maxgrey));
if ( img->maxgrey > 255 ) {
fprintf(stderr, "FATAL: maxgray=%d > 255\n", img->maxgrey);
exit(EXIT_FAILURE);
}
#if _XOPEN_SOURCE < 600
img->bmap = (unsigned char*)malloc((img->width)*(img->height)*sizeof(unsigned char));
#else
/* The pointer img->bmap must be properly aligned to allow aligned
SIMD load/stores to work. */
int ret = posix_memalign((void**)&(img->bmap), __BIGGEST_ALIGNMENT__, (img->width)*(img->height));
assert( 0 == ret );
#endif
assert(img->bmap != NULL);
/* Get the binary data from the file */
nread = fread(img->bmap, 1, (img->width)*(img->height), f);
if ( (img->width)*(img->height) != nread ) {
fprintf(stderr, "FATAL: error reading input: expecting %d bytes, got %d\n", (img->width)*(img->height), nread);
exit(EXIT_FAILURE);
}
}
/**
* Write the image `img` to file `f`; if not NULL, use the string
* `comment` as metadata.
*/
void write_pgm( FILE *f, const PGM_image* img, const char *comment )
{
assert(f != NULL);
assert(img != NULL);
fprintf(f, "P5\n");
fprintf(f, "# %s\n", comment != NULL ? comment : "");
fprintf(f, "%d %d\n", img->width, img->height);
fprintf(f, "%d\n", img->maxgrey);
fwrite(img->bmap, 1, (img->width)*(img->height), f);
}
/**
* Free the bitmap associated with image `img`; note that the
* structure pointed to by `img` is NOT deallocated; only `img->bmap`
* is.
*/
void free_pgm( PGM_image *img )
{
assert(img != NULL);
free(img->bmap);
img->bmap = NULL; /* not necessary */
img->width = img->height = img->maxgrey = -1;
}
/**
* Compute the |k|-th iterate of the cat map for image |img|. You must
* implement this function, starting with a serial version, and then
* adding OpenMP pragmas.
*/
void cat_map( PGM_image* img, int k )
{
const int N = img->width;
unsigned char *cur = img->bmap;
unsigned char *next;
int x, y, i, ret;
assert( img->width == img->height );
assert( img->width % VLEN == 0);
ret = posix_memalign((void**)&next, __BIGGEST_ALIGNMENT__, N*N*sizeof(*next));
assert( 0 == ret );
/* SIMD version of the cat map iteration. The idea is to compute
the new coordinates of four adjacent points (x, y), (x+1,y),
(x+2,y) and (x+3,y) using SIMD instructions. To do so, we pack
the coordinates into two SIMD registers vx = {x, x+1, x+2,
x+3}, vy = {y, y, y, y}.
Note that initially vx = {0, 1, 2, 3}, and at each iteration of
the inner loop, all elements of vx are incremented by
VLEN. Therefore, assuming VLEN == 4, at the second iteration we
have vx = {4, 5, 6, 7}, then {8, 9, 10, 11} and so on. */
for (y=0; y= N);
xnew = (mask & (xnew - N)) | (~mask & xnew); // or: xnew = xnew - (mask & N);
mask = (xnew >= N);
xnew = (mask & (xnew - N)) | (~mask & xnew); // or: xnew = xnew - (mask & N);
/* assuming 0 <= xold < N and 0 <= yold < N, we have
that 0 <= ynew < 2N; therefore, we might need to
subtract N at most once. */
ynew = (xold + yold);
mask = (ynew >= N);
ynew = (mask & (ynew - N)) | (~mask & ynew); // or: ynew = ynew - (mask & N);
#endif
xold = xnew;
yold = ynew;
}
next[xnew[0] + ynew[0]*N] = cur[vx[0]+y*N];
next[xnew[1] + ynew[1]*N] = cur[vx[1]+y*N];
next[xnew[2] + ynew[2]*N] = cur[vx[2]+y*N];
next[xnew[3] + ynew[3]*N] = cur[vx[3]+y*N];
vx += VLEN;
}
}
img->bmap = next;
free(cur);
}
int main( int argc, char* argv[] )
{
PGM_image img;
int niter;
if ( argc != 2 ) {
fprintf(stderr, "Usage: %s niter < in.pgm > out.pgm\n\nExample: %s 684 < cat1368.pgm > out1368.pgm\n", argv[0], argv[0]);
return EXIT_FAILURE;
}
niter = atoi(argv[1]);
read_pgm(stdin, &img);
if ( img.width != img.height ) {
fprintf(stderr, "FATAL: width (%d) and height (%d) of the input image must be equal\n", img.width, img.height);
return EXIT_FAILURE;
}
if ( img.width % VLEN ) {
fprintf(stderr, "FATAL: this program expects the image width (%d) to be a multiple of %d\n", img.width, (int)VLEN);
return EXIT_FAILURE;
}
const double tstart = hpc_gettime();
cat_map(&img, niter);
const double elapsed = hpc_gettime() - tstart;
fprintf(stderr, " SIMD width : %d bytes\n", (int)VLEN);
fprintf(stderr, " Iterations : %d\n", niter);
fprintf(stderr, " width,height : %d,%d\n", img.width, img.height);
fprintf(stderr, " Mops/sec : %f\n", 1.0e-6 * img.width * img.height * niter / elapsed);
fprintf(stderr, "Elapsed time (s) : %f\n", elapsed);
write_pgm(stdout, &img, "produced by simd-cat-map.c");
free_pgm(&img);
return EXIT_SUCCESS;
}