HPC - Character counts

Moreno Marzolla

Last updated: 2023-10-19

By Willi Heidelbach, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1181525
By Willi Heidelbach, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1181525

The file omp-letters.c contains a serial program that computes the number of occurrences of each lowercase letter in an ASCII file read from standard input. The program is case-insensitive, meaning that uppercase characters are treated as if they were lowercase; non-letter characters are ignored. We provide some substantial ASCII documents to experiment with, that have been made available by the Project Gutenberg; despite the fact that these documents have been written by different authors, the frequencies of characters are quite similar. Indeed, it is well known that the relative frequencies of characters are language-dependent and more or less author-independent. You may experiment with other free books in other languages that are available on Project Gutenberg Web site.

The goal of this exercise is to modify the function make_hist(text, hist) to make use of OpenMP parallelism. The function takes as parameter a pointer text to the whole text, represented as a zero-terminated C string, and an array hist[26] of counts. The array hist is not initialized. At the end, hist[0] contains the occurrences of the letter a in the text, hist[1] the occurrences of the letter b, up to hist[25] that represents the occurrences of the letter z.

A reasonable approach is to partition the text among the OpenMP threads, so that each thread computes the histogram for part of the text. Then, all partial histograms needs to be combined to get the final result.

You might want to start by doing the partitioning manually, i.e., without using the omp for directive. This is not the most efficient solution, but is nevertheless instructive; a better approach is discussed below.

Since the text is a character array of some length \(n\), thread \(p\) can compute the extremes of its chunk as:

const int from = (n * p) / num_threads;
const int to = (n * (p+1)) / num_threads;

where num_threads is the size of OpenMP thread pool. Thread \(p\) will compute the frequencies of the characters in text[from .. (to-1)].

You need to create a shared, two-dimensional array local_hist[num_threads][26] initialized to zero. Thread \(p\) operates on local_hist[p][] so that no race conditions are possible. If thread \(p\) sees character \(x\), \(x \in \{\texttt{'a'}, \ldots, \texttt{'z'}\}\), it will increase the value local_hist[p][x - 'a']. When all threads are done, the master computes the result as the column-wise sum of local_hist. In other words, the number of occurrences of character a is

\[ \sum_{p=0}^{\texttt{num_threads}-1} \texttt{local_hist}[p][0] \]

and so on. Also, don’t forget that there is a reduction on the counter nlet that reports the number of letters; this might be done using the reduction() clause of the omp for directive.

A better and simpler solution can be realized using the omp parallel for directive, and employing array reductions that are available with OpenMP 4.5 and later (and that we did not discuss during the class). To perform the sum-reduction on each element of the array hist[] we can use the following syntax:

    #pragma omp parallel for ... reduction(+:hist[:ALPHA_SIZE])

This works as the normal scalar reductions, with the differences that the compiler actually computes ALPHA_SIZE sum-reductions on hist[0], … hist[ALPHA_SIZE - 1].

Compile with:

    gcc -std=c99 -Wall -Wpedantic -fopenmp omp-letters.c -o omp-letters

Run with:

    ./omp-letters < the-war-of-the-worlds.txt

Files