HPC - Character counts

Moreno Marzolla

Last updated: 2022-08-12

By Willi Heidelbach, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1181525
By Willi Heidelbach, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1181525

The file omp-letters.c contains a serial program that computes the number of occurrences of each lowercase letter in an ASCII file read from standard input. The program is case-insensitive, meaning that uppercase characters are treated as if they were lowercase; non-letter characters are ignored. We provide some substantial ASCII documents to experiment with, that have been made available by the Project Gutenberg; despite the fact that these documents have been written by different authors, the frequencies of characters are quite similar. Indeed, it is well known that the relative frequencies of characters are language-dependent and more or less author-independent. You may experiment with other free books in other languages that are available on Project Gutenberg Web site.

In this exercise you are required to modify the function make_hist(text, hist) to make use of shared-memory parallelism using OpenMP. The function takes as parameter a pointer text to the whole text, represented as a zero-terminated C string, and an array hist[26] of counts. The array hist is not initialized. At the end, hist[0] contains the occurrences of the letter a in the text, hist[1] the occurrences of the letter b, up to hist[25] that represents the occurrences of the letter z.

A shared-memory parallel version can be developed as follows. The text is partitioned into num_threads chunks, where num_threads is the number of OpenMP threads; since the text is a character array of some length \(n\), thread \(p\) can compute the extremes of its chunk as:

const int from = (n * p) / num_threads;
const int to = (n * (p+1)) / num_threads;

Thread \(p\) will then examine the block text[from .. (to-1)].

You also need create a shared, two-dimensional array local_hist[num_threads][26], initially containing all zeros. Thread \(p\) operates on a different portion of the text and updates the occurrences on the slice local_hist[p][] of the shared array. Therefore, if thread \(p\) sees character \(x\), \(x \in \{\texttt{'a'}, \ldots, \texttt{'z'}\}\), it will increment local_hist[p][x - 'a'].

When all threads are done, the master computes the results as the sums of the columns of local_hist. In other words, the number of occurrences of a is

    local_hist[0][0] + local_hist[1][0] + ... + local_hist[num_threads-1][0]

and so on.

Compile with:

    gcc -std=c99 -Wall -Wpedantic -fopenmp omp-letters.c -o omp-letters

Run with:

    ./omp-letters < the-war-of-the-worlds.txt