Lab 7: Performance Optimization Project

Overview

In this lab, you are going to optimize a program for performance and fix errors in its use of dynamic memory.

Lab - Getting Started

To begin this lab, start by obtaining the necessary boilerplate code. Enter the class repository:

unix>  cd ~/bitbucket/2014_fall_ecpe170_boilerplate/

Pull the latest version of the repository, and update your local copy of it:

unix>  hg pull
unix>  hg update

Copy the files you want from the class repository to your private repository:

unix>  cp -R ~/bitbucket/2014_fall_ecpe170_boilerplate/lab07/* ~/bitbucket/2014_fall_ecpe170/lab07

Enter your private repository now, specifically the lab07 folder:

unix>  cd ~/bitbucket/2014_fall_ecpe170/lab07

Add the new files to version control in your private repository:
(With the *, we are asking hg to add all files/folders in the current directory)

unix>  hg add *

Commit the new files in your personal repository, so you can easily go back to the original starter code if necessary

unix>  hg commit -m "Starting Lab 7 with boilerplate code"

Push the new commit to the bitbucket.org website

unix>  hg push

You should see the following code:

Makefile that can compile the program
main() function and key algorithms in analysis.c

Functions to process command-line arguments in options.c
Input data (text) to process in corpus.tar.gz

Uncompress the corpus.tar.gz file. A "corpus" is a large collection of text for statistical analysis.
Note: A .tar file is a collection of files/folders combined into a single archive. A .gz file is a compressed file. Thus, a .tar.gz file is a compressed archive.

unix>  tar -xzf corpus.tar.gz

Compile the program via the included Makefile

unix>  make

Run the program to analyze bigrams in a given text file (such as moby.txt):

unix>  time ./analysis_program -ngram 2 < moby.txt

Lab - Tasks

This program analyzes the n-gram statistics of a text document. A n-gram is a sequence of n words occurring in a sequence. If n=1, the program analyzes individual words, while if n=2, the program analyzes pairs of words (referred to as bigrams), etc...

The analysis program performs the following operations when run:

Read each word from the file and convert to lowercase. Combine adjacent words into n-gram strings.
Apply a hash function to each n-gram string to create a number between 0 and s-1. This determines which of s hash table buckets this string will be placed into.
Place the n-gram string in the hash bucket. Each bucket is organized as a linked list. If a matching string already exists in the linked list, its frequency counter is incremented. Otherwise, a new list element is added at the end with its frequency counter set to 1.
Once all words have been read from the file, all elements in the hash table are sorted according to frequency. This process is destructive to the hash table. All of the linked lists in the hash table are destroyed, and a single new linked list of all elements (in sorted order) is created.
Print statistics and exit

You will use this program to analyze two text files.

Full text of Moby Dick by Herman Melville
Full text of the collected works of William Shakespeare.

Lab Report (for unoptimized program):
(1) Measure the execution time for moby.txt (for n=2)
(2) Copy the program output for moby.txt (for n=2) into your report.
(3) Measure the execution time for shakespeare.txt (for n=2)
(4) Copy the program output for shakespeare.txt (for n=2) into your report.
(5) Take one screenshot of the Valgrind kcachegrind tool for the unoptimized program.
(This screenshot should show the functions used in the program and their computation cost. In question (7) below, you will take another kcachegrind screenshot of the optimized program. Comparing the two should clearly show where your performance optimizations were made.)

As you noticed, this program is quite slow, particularly on the larger Shakespeare text file. Further, upon reviewing the source code, errors in the use of dynamic memory are present. While they don't currently result in a program crash or error, they could in the future if the code is modified or extended. Thus, your tasks on this program are the following:

Task 1: Fix program memory usage so that the Valgrind memory report is 100% clean with zero errors reported

Dynamic memory is allocated via malloc() or calloc() function calls. This memory, however, is not consistently returned to the operating system via a call to free(). While not currently fatal, this sloppy programming has the potential to cause errors later if the program is modified. You must fix this!

Task 2: Improve program perform by at least 80x (when compared to the un-optimized program you measured above). The rough target (which will be relaxed for older computers) is an execution time significantly less than 1 second for the moby.txt file. You can use any techniques you want, provided the program output remains correct (i.e. unchanged from the original version).
Task 3: Document your changes via version control. I suggest checking in the original code immediately before making any changes. The lab report requires you to provide a 'diff' between the original program and the final optimized program, and this will be easy if all your changes are in version control.

Lab Report (for final optimized program):
(6) Copy the Valgrind memcheck report for your optimized program, clearly showing no errors.
(7) Take one screenshot of the Valgrind kcachegrind tool for the final optimized program
(circle the appropriate area of the screenshot that shows where at least part of your total performance improvement comes from, when compared to the previous screenshot)
(8) Measure the execution time for moby.txt (for n=2) for your optimized program. What is the speedup compared to the original program?
Include the command used to run your program, since you will (almost certainly) want to modify some of the arguments.
Copy the program output into your report.
(9) Measure the execution time for shakespeare.txt (for n=2) for your optimized program. What is the speedup compared to the original program?
Include the command used to run your program, since you will (almost certainly) want to modify some of the arguments.
Copy the program output into your report.
(10) Provide a diff between the original program and your final optimized program, so I can clearly see all changes that were made to all files. Version control can be helpful here, because Mercurial can show the difference between two arbitrary revisions in your repository. Use the BitBucket website to find the revision numbers of commit "changesets" you wish to compare. They should be a hexadecimal tag like 7a72e1f.
Example command: hg diff -r <OLD-version-number> -r <NEW-version-number> analysis.c analysis.h
See hg diff for more details.
(11) Write a half-page narrative describing the optimization process you followed. Be sure to mention any false starts and unsuccessful changes too!

Lab Report - Wrapup:
(1) What was the best aspect of this lab?
(2) What was the worst aspect of this lab?
(3) How would you suggest improving this lab in future semesters?

Task 1 - Instructions and Tips

For the first task, you should run Valgrind in memcheck mode to track memory usage. You must let the program finish completely - if you terminate the program with a CTRL-C it before it ends normally, then (obviously) the program will not have run any cleanup code, and Valgrind will (incorrectly, but reasonably) report memory leaks. I recommend using the shorter moby.txt file to make this faster. The report will be quite long, so concentrate on fixing the loss records that report hundreds of thousands of reachable bytes at the end, as fixing them will clean up many of the other reported problems. You should be able to fix this with only a few small blocks of code.

Tip 1: The following Valgrind command will run the desired memcheck report for the analysis program:

unix>  valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 ./analysis_program -ngram 2 < moby.txt

Tip 2: The program has a few global variables that store statistics. Some of these variables are pointers to dynamically allocated strings. It would be bad (and will produce a "Still Reachable" warning in Valgrind) if the pointer was left intact after the corresponding memory was freed. Setting the pointer to NULL is the correct behavior.

Task 2 - Instructions and Tips

For the second task, you should run Valgrind in callgrind mode to track instructions and their frequency. Again, let the program finish completely to get an accurate picture of where execution time is being spent. This will guide your optimizations.

Tip 1: The following Valgrind command will run the desired callgrind report for the analysis program, and display the results:

unix>  valgrind --tool=callgrind --dump-instr=yes --callgrind-out-file=callgrind.out ./analysis_program -ngram 2 < moby.txt
unix>  kcachegrind callgrind.out &

Tip 2: The C standard library has a very helpful function already implemented for you. Type man qsort at the command line to (gasp!) read the documentation. Feel free to use this routine instead of writing your own.

Tip 3: Which design would provide better search performance given a text document with a large numbers of strings?

A small hash table with a few buckets and long linked lists?
A large hash table with many buckets and short linked lists?

What does the original analysis program do?

Tip 4: The -hash_table_size XXX argument can be used to modify the hash table size in your program without the need to recompile. Experiment with a range of values.

Tip 5: Look closely at how the hash function works. What bucket numbers does it commonly produce? Will it evenly distribute the strings across all buckets in the table, or will some buckets get far more strings than others? How would that impact performance?

This page discusses the pros/cons of the original hash function used in the analysis program: http://research.cs.vt.edu/AVresearch/hashing/strings.php (While this page has Java code, the algorithm is the same!) Once you understand why the existing hash function is of poor quality, google for other C hash functions that might perform better.

Example Program Output

Output for Moby Dick input:

unix>  ./analysis_program -ngram 2 -hash-table-size <<REDACTED>> < moby.txt 
Running analysis program...

Options used when running program:
ngram	2
details	10
hash-table-size	<<REDACTED>>
N-gram size 2

Running analysis... (This can take several minutes or more!)
 Initializing hash table...
 Inserting all n-grams into hash table in lowercase form...
 Sorting all hash table elements according to frequency...

Analysis Details:
(Top 10 list of n-grams)
1840	'of the'
1142	'in the'
714	'to the'
435	'from the'
375	'the whale'
367	'of his'
362	'and the'
350	'on the'
328	'at the'
323	'to be'

Analysis Summary:
214365 total n-grams
114421 unique n-grams
91775 singleton n-grams (occur only once)
Most common n-gram (with 1840 occurrences) is 'of the' 
Longest n-gram (4 have length 29) is 'phrenological characteristics'
Total time = 0.200000 seconds

Output for Shakespeare input:

unix> ./analysis_program -ngram 2 -hash-table-size <<REDACTED>> < shakespeare.txt 
Running analysis program...

Options used when running program:
ngram	2
details	10
hash-table-size <<REDACTED>>
N-gram size 2

Running analysis... (This can take several minutes or more!)
 Initializing hash table...
 Inserting all n-grams into hash table in lowercase form...
 Sorting all hash table elements according to frequency...

Analysis Details:
(Top 10 list of n-grams)
1892	'i am'
1804	'i ll'
1753	'in the'
1709	'my lord'
1681	'to the'
1660	'i have'
1603	'i will'
1554	'of the'
1118	'it is'
1020	'to be'

Analysis Summary:
965028 total n-grams
363039 unique n-grams
266018 singleton n-grams (occur only once)
Most common n-gram (with 1892 occurrences) is 'i am' 
Longest n-gram (1 have length 32) is 'honorificabilitudinitatibus thou'
Total time = 1.000000 seconds

This lab is based on Chapter 5 of "Computer Systems: A Programmer's Perspective", Second Edition, by Randal E. Bryant and David R. O'Hallaron