HPC - Arnold’s cat map

Arnold’s cat map is a continuous chaotic function that has been studied in the ’60s by the Russian mathematician Vladimir Igorevich Arnold. In its discrete version, the function can be understood as a transformation of a bitmapped image $P$ of size $N \times N$ into a new image $P'$ of the same size. For each $0 \leq x, y < N$, the pixel of coordinates $(x,y)$ in $P$ is mapped into a new position $C(x, y) = (x', y')$ in $P'$ where

(“mod” is the integer remainder operator, i.e., operator % of the C language). We may assume that $(0, 0)$ is top left and $(N-1, N-1)$ bottom right, so that the bitmap can be encoded as a regular two-dimensional C matrix.

The transformation corresponds to a linear “stretching” of the image, that is then broken down into triangles that are rearranged as shown in Figure 1.

Arnold’s cat map has interesting properties. Let $C^k(x, y)$ be the result of iterating $k$ times the function $C$, i.e.:

\[ C^k(x, y) = \begin{cases} (x, y) & \mbox{if $k=0$}\\ C(C^{k-1}(x,y)) & \mbox{if $k>0$} \end{cases} \]

If we take an image and apply $C$ once, we get a severely distorted version of the input. If we apply $C$ on the resulting image, we get an even more distorted image. As we keep applying $C$, the original image is no longer discernible. However, after a certain number of iterations that depends on $N$ and has been proved to never exceed $3N$, we get back the original image! (Figure 2).

The minimum recurrence time for an image is the minimum positive integer $k \geq 1$ such that $C^k(x, y) = (x, y)$ for all $(x, y)$. In simple terms, the minimum recurrence time is the minimum number of iterations of the cat map that produce the starting image.

For example, the minimum recurrence time for cat1368.pgm of size $1368 \times 1368$ is $36$. As said before, the minimum recurrence time depends on the image size $N$. Unfortunately, no closed formula is known to compute the minimum recurrence time as a function of $N$, although there are results and bounds that apply to specific cases.

You are provided with a serial program that computes the $k$-th iterate of Arnold’s cat map on a square image. The program reads the input from standard input in PGM (Portable GrayMap) format. The results is printed to standard output in PGM format. For example:

applies the cat map $k=100$ times on cat1368.phm and saves the result to cat1368-100.pgm.

To display a PGM image you might need to convert it to a different format, e.g., JPEG. Under Linux you can use convert from the ImageMagick package:

To make use of CUDA parallelism, define a 2D grid of 2D blocks that covers the input image. The block size is $\mathit{BLKDIM} \times \mathit{BLKDIM}$, with BLKDIM = 32, and the grid size is:

\[ (N + \mathit{BLKDIM} – 1) / \mathit{BLKDIM} \times (N + \mathit{BLKDIM} – 1) / \mathit{BLKDIM} \]

Each thread applies a single iteration of the cat map and copies one pixel from the input image to the correct position of the output image. The kernel has the following signature:

__global__ void cat_map_iter( unsigned char *cur, unsigned char *next, int N )

where $N$ is the height/width of the image. The program must work correctly even if $N$ is not an integer multiple of BLKDIM. Each thread is mapped to the coordinates $(x, y)$ of a pixel using the usual formulas:

        const int x = threadIdx.x + blockIdx.x * blockDim.x;
        const int y = threadIdx.y + blockIdx.y * blockDim.y;

Therefore, to compute the $k$-th iteration of the cat map we need to execute the kernel $k$ times.

__global__ void cat_map_iter_k( unsigned char *cur, unsigned char *next, int N, int k )

that applies $k$ iterations of the cat map to the current image. This kernel needs to be executed only once, and this saves some significant overhead associated to kernel calls. The new kernel can be defined as follows:

const int x = ...;
const int y = ...;
int xcur = x, ycur = y, xnext, ynext;

if ( x < N && y < N ) {
    while (k--) {
        xnext = (2*xcur + ycur) % N;
        ynext = (xcur + ycur) % N;
        xcur = xnext;
        ycur = ynext;
    }
    /* copy the pixel (x, y) from the current image to
    the position (xnext, ynext) of the new image */
}

I suggest to implement both solutions (the one where the kernel is executed $k$ times, and the one where the kernel is executed only once) and measure the execution times to see the difference.

Files