Lab 5: Performance Optimization

Overview

This lab is concerned with performance optimization. Specifically,

How can the compiler improve application performance?
How can the programmer do a better job than the compiler at improving application performance? (And why is this?)

Pre-Lab

Motivation: This lab is memory intensive. We want to create a large vector in memory, process it using different algorithms, and compare algorithm performance. The larger the vector is, the easier it will be to distinguish between the performance of the different algorithms. But, if it is too large, it will overflow RAM and start swapping to the hard drive, severely degrading performance and making the test useless.

In a pre-lab report, answer the following questions about the computer system you will use for the in-class lab. (If you are booting Linux directly instead of inside a virtual machine, state that, and skip questions 1-5)

Pre-Lab Report:
(1) What is the total physical RAM installed in the system? (In MB)
(2) With no applications running (beyond the web browser with this page), how much RAM is used by the native operating system? (e.g. Windows)
(3) With no applications running (beyond the web browser with this page), how much RAM is available?
(4) Check the virtual machine configuration. How much RAM is currently allocated to Linux in your virtual machine?
Note: Your answer to question 4 must be less than your answer to question 3! Otherwise, your system will use slow virtual memory (i.e. swapping data to the hard disk) when running this lab.
(5) Try to increase your virtual machine memory allocation, if possible, to the maximum allowed based on your free RAM. Leave ~256MB free for the virtual machine program itself. Now how much RAM is allocated to Linux in your virtual machine?
(6) Boot Linux. With no applications running in Linux, how much RAM is available inside the virtual machine? The "System Monitor" program should report that information. This is the space that is actually available for our test application.

To begin this lab, start by obtaining the necessary boilerplate code for the prelab and postlab. Enter the class repository:

unix>  cd ~/bitbucket/2012_fall_ecpe170_boilerplate/

Pull the latest version of the repository, and update your local copy of it:

unix>  hg pull
unix>  hg update

Copy the files you want from the class repository to your private repository:
(In this case, it is two folders you want)

unix>  cp -R ~/bitbucket/2012_fall_ecpe170_boilerplate/lab05/prelab ~/bitbucket/2012_fall_ecpe170/lab05
unix>  cp -R ~/bitbucket/2012_fall_ecpe170_boilerplate/lab05/postlab ~/bitbucket/2012_fall_ecpe170/lab05

Enter your private repository now, specifically the lab05 folder:

unix>  cd ~/bitbucket/2012_fall_ecpe170/lab05

Add the new files to version control in your private repository:
(Technically, prelab and postlab are directories, but Mercurial is smart enough to just add all the files in these directories with this command)

unix>  hg add prelab postlab

Commit the new files in your personal repository, so you can easily go back to the original starter code if necessary

unix>  hg commit -m "Starting Lab 5 with boilerplate code"

Push the new commit to the bitbucket.org website

unix>  hg push

Enter the prelab folder

unix>  cd prelab

You should see the following code:

Makefile that can compile the program
main() function in main.c

Key processing algorithm in combine.c (including empty functions for your work):
Vector management functions in vec.c

Examine the boilerplate code for this lab, and answer the following questions:

Pre-Lab Report:
(7) What is the code doing? (Describe the algorithm in a paragraph, focusing on the combine1() function.)
(8) What is the largest number of elements that the vector can hold WITHOUT using swap storage (virtual memory), and how much memory does it take?
Tip: The program reports memory usage when run. If you can compile and run it, you can experiment with the size of the vector to confirm your answer. If you see your disk activity light grinding away (either the physical LED, or the on-screen indicator in your virtual machine), you have chosen poorly...
Tip 2: Run the System Monitor program in Linux in the background, then run your program. You can watch the Memory Usage indicator in the monitor increase when the program is running. The goal is to reach 90% of memory, but not to start using swap memory.

Checkpoint to obtain pre-lab credit - due by start of class:
(1) Show me your in-progress lab report answering questions 1-8 above.

Lab Part 1 - Compiler Optimizations

First, we will explore what GCC can do for us to improve program performance without changing the source code. GCC has several optimization levels:

No optimization (the default)
O1 level: Reduce code size and execution time without lengthy compile time
O2 level: Spend more time compiling, and use all techniques that do not increase program size in order to decrease execution time
O3 level: Spend the most time compiling, and use techniques that increase program size to decrease execution time

** Note that this is a capital letter 'O' (for optimization), not the number zero. This argument is CASE SENSITIVE. **

The optimization level is set at the command-line when invoking GCC. Because we use a Makefile to run GCC, the correct place to add this flag is in the CFLAGS line. For example, this line settings GCC to use the O1 optimization level, in addition to several other common settings:

CFLAGS=-c -Wall -std=c99 -O1

Run the "combine1" algorithm for a vector of containing the number of elements you determined in the pre-lab. Use the time utility to record the compilation time and program execution time, specifically the "real time" result, which represents the total time.

unix>  make clean
unix>  time make
unix>  time ./combine_program 1 <YOUR-VECTOR-SIZE>

Note: We will all have different vector sizes for this lab (due to differing amounts of physical memory available). The absolute time measurements don't matter. What is important is the relative time difference between experiments, and also the software implementation of each algorithm.

Note 2: Repeat these measurements at least 2 times. I found that the first time I ran the program, it was much slower than subsequent times.
(Hypothesis: Perhaps the Virtual Machine didn't really allocate all the memory to Linux that it said it did? If so, when this program goes to actually *use* the memory, the virtual machine has to scramble to allocate the resources.)

Lab Report:
(1) What vector size are you using for all experiments in this lab?
(2) How much time does the compiler take to finish with (a) no optimization, (b) with -O1 optimization, (c) with -O2 optimization, and (d) with -O3 optimization? Report the Real time, which is the "wall clock" time. Create both a table and a graph in LibreOffice Calc.
(3) How much time does the program take to finish with (a) no optimization, (b) with -O1 optimization, (c) with -O2 optimization, and (d) with -O3 optimization? Report the Real time, which is the "wall clock" time. Create both a table and a graph in LibreOffice Calc.
Note: No credit will be given for sloppy graphs that lack X and Y axis labels, a legend, and a title.

Lab Part 2 - Code Optimizations

Now that you've measured the application performance with a variety of compiler optimizations, it's time to see what tricks a programmer can employ to make it easier for the compiler to produce efficient code. (And, by making the compiler's job easier, end up with better application performance than before.)

In this part of the lab, you will write new code for the combine2() - combine6() functions, where each function adds a new optimization technique. Note that this lab is carefully designed to step you through an optimization process where the performance improves with each step. On a real program, though, you might have to use trial-and-error to determine which techniques are most effective for your specific problem.

Lab Report:
(4) After implementing each function, benchmark it for a variety of data types and mathematical operations. Fill in the table below as you write each function.
(5) Using LibreOffice Calc, create a single graph that shows the data in the table created, specifically the four time columns. (You don't need to plot vector size)
Note: No credit will be given for a sloppy graph that lack X and Y axis labels, a legend, and a title.
(6) As a reminder, you should be using version control to track your code, and ensure that the final code is checked in along with your report PDF.

Configuration	Vector Size (elements)	Vector Size (MB)	Time for Integer Add	Time for Integer Multiply	Time for FP (float) Add	Time for FP (float) Multiply
combine1()
combine2()
combine3()
combine4()
combine5x2()
combine5x3()
combine6()

Important notes for a consistent experiment:

Change your Makefile to use the -O1 optimization level (as a good compromise between compile time and application performance).
Always process vectors of the same size (and document that size in the lab in your report)
Do **not** use the time utility to benchmark your new functions. If you inspect the source code, you'll see that the program already records and prints the execution time of *just* the key algorithm we are optimizing, ignoring all initialization and cleanup code. Use that time for your table/graph.

combine2() Algorithm - Code Motion

Problem: When looking at the combine1() function, the compiler must assume that the array size could (potentially) change as the loop proceeds. But, as the programmer, you know that is impossible for this specific program.
Solution: For combine2(), move the function call to vec_length() out of the body of the for loop, and save it in a variable instead. Use the variable in your for loop.
This is called "code motion" - identifying code that does not need to be re-executed on each iteration, and moving it elsewhere.

combine3() Algorithm - Reducing Procedure Calls

Problem: In the combine2() function, get_vec_element() is called on every iteration (to obtain a data element). Each function call is expensive. Further, that function checks every time to see if the element requested is within the overall vector length.
Solution: You can improve performance (at the expense of program modularity) by removing the function call to get_vec_element(). Instead, create a new function called get_vec_start() that returns the starting address of the data array. Call this function once at the beginning of combine3(), save the address it returns, and use it to directly access array elements during processing.
Note: The performance improvement of this (by itself) is minimal, but it will unlock future optimizations
Tip: Use this function declaration: data_t *get_vec_start(vec_ptr v);

combine4() Algorithm - Eliminate Unneeded Memory Accesses

Problem: In the combine3() function, the result of the accumulate operation is written to memory at the location designated by the pointer dest. Then, on the next iteration, that value is retrieved from memory and used in the next calculation. While the compiler does not know if that specific memory location will change between iterations, the programmer does, and the answer is no!
Solution: Eliminate these unnecessary trips to memory in combine4(). Create a temporary variable (called accumulate) and use it to store the result of each operation (and retrieve it the next time). At the end of the function, copy the value in accumulate to the desired final memory address.

Note: Isn't accumulate just a memory location too? No! Any minimally-competent compiler will put an important local variable like that in a processor register stored directly on the CPU. No trips to the memory are required to access it, and we can retrieve data from the registers in 1-2 clock cycles. (We'll explore processor registers further when we learn assembly programming)

combine5x2() Algorithm - Loop Unrolling with a Factor of 2
combine5x3() Algorithm - Loop Unrolling with a Factor of 3

Problem: The for-loop itself incurs overhead. At the beginning of each iteration, the current position must be compared against the length. Only if this conditional is true will the for-loop body be executed. And, at the end of each iteration, the loop counter must be incremented by one.

Note: The conditional branch (loop comparison) is worse than it might seem, due to the way modern microprocessors execute programs. (Ask me about pipelining!)

Solution: In the combine5x2() algorithm, instead of processing the vector one element at a time (per iteration), process the vector two elements at a time (per iteration).

In your loop, increment the counter 'i' by two each time.
In the loop body, do two combine operations for consecutive items. It should look like this: accumulate = (accumulate OP data[i]) OP data[i+1];
Be careful not to go past the end of the array with your memory references.
Your solution must still work if there is an odd number of items in the vector! You'll need to add an if-claus after your loop to catch the last item (if any).

This technique (loop unrolling) can be generalized. You need to implement two different examples:

Loop unrolling with a factor of 2 - combine5x2() - Process two vector items per iteration
Loop unrolling with a factor of 3 - combine5x3() - Process three vector items per iteration

combine6() Algorithm - Two Parallel Accumulators

Problem: The combine5x2() (and 5x3) algorithms are pretty good, with a low factor of overhead on the primary task of reading the vector and performing a calculation on it. But, this algorithm is extremely sequential. The math operation is performed on the first vector element, and that result is used in a calculation with the second vector element, and so on, until the entire vector is processed. This leaves unused a key capability of modern microprocessors: the ability to perform more than one mathematical operation (i.e. instruction) at the same time.

The compiler can do nothing to help us here! But, as a human, you know that you could easily divide this vector processing into smaller pieces, and then combine then at the end to get the same result.

Solution: In combine6(), start with the previous combine5x2() function. You will still be processing two elements per iteration. But, do that processing using two completely separate sets of variables! In other words, combine the odd elements into one variable, the even elements into another variable, and then after the loop has finished, combine both temporary variables together.

The body of your loop should look like this:
accumulate0 = accumulate0 OP data[i];
accumulate1 = accumulate1 OP data[i+1];
By dividing work in this manner, one part of the CPU can process the odd elements, and at the same time, a different part of the CPU can process the even elements. This enables instruction-level parallelism.

Note: This is not multi-core. We still have one program running on one processor core. We are just processing more than one instruction in that one program at the same time.

As a reminder, be sure to complete the table and graph requested above, and include all source code with your final Lab Report PDF when submitting this lab.

Post-Lab

In this post-lab assignment, you are going to optimize a program for performance and correct its use of dynamic memory. This program analyzes the n-gram statistics of a text document. A n-gram is a sequence of n words occurring in a sequence. If n=1, the program analyzes individual words, while if n=2, the program analyzes pairs of words (referred to as bigrams), etc...

The analysis program performs the following operations when run:

Read each word from the file and convert to lowercase. Combine adjacent words into n-gram strings.
Apply a hash function to each n-gram string to create a number between 0 and s-1. This determines which of s hash table buckets this string will be placed into.
Place the n-gram string in the hash bucket. Each bucket is organized as a linked list. If a matching string already exists in the linked list, its frequency counter is incremented. Otherwise, a new list element is added at the end with its frequency counter set to 1.
Once all words have been read from the file, all elements in the hash table are sorted according to frequency. This process is destructive to the hash table. All of the linked lists in the hash table are destroyed, and a single new linked list of all elements (in sorted order) is created.
Print statistics and exit

You will use this program to analyze two text files.

Full text of Moby Dick by Herman Melville
Full text of the collected works of William Shakespeare.

You already obtained the postlab code at the same time you retrieved the prelab code. Enter your private repository now, specifically the lab05/postlab folder:

unix>  cd ~/bitbucket/2012_fall_ecpe170/lab05/postlab

Uncompress the corpus.tar.gz file. (A "corpus" is a large collection of text for statistical analysis)

unix>  tar -xzf corpus.tar.gz

Compile the program via the included Makefile

unix>  make

Run the program to analyze bigrams in a given text file (such as moby.txt):

unix>  time ./analysis_program -ngram 2 < moby.txt

Post-Lab Report:
(1) Measure the execution time for moby.txt (for n=2)
(2) Copy the program output for moby.txt (for n=2) into your report.
(3) Measure the execution time for shakespeare.txt (for n=2)
(4) Copy the program output for shakespeare.txt (for n=2) into your report.

As you noticed, this program is quite slow, particularly on the larger Shakespeare text file. Further, upon reviewing the source code, errors in the use of dynamic memory are present. While they don't currently result in a program crash or error, they could in the future if the code is modified or extended. Thus, your tasks on this program are the following:

Task 1: Fix program memory usage so that the Valgrind memory report is 100% clean with zero errors reported

Dynamic memory is allocated via malloc() or calloc() function calls. This memory, however, is not consistently returned to the operating system via a call to free(). While not currently fatal, this sloppy programming has the potential to cause errors later if the program is modified. You must fix this!

Task 2: Improve program perform by at least 80x (when compared to the un-optimized program you measured above). The rough target (which will be relaxed for older computers) is an execution time significantly less than 1 second for the moby.txt file. You can use any techniques you want, provided the program output remains correct (i.e. unchanged from the original version).
Task 3: Document your changes via version control. I suggest checking in the original code immediately before making any changes. The lab report requires you to provide a 'diff' between the original program and the final optimized program, and this will be easy if all your changes are in version control.

Post-Lab Report:
(5) Copy the Valgrind memcheck report for your optimized program, clearly showing no errors.
(6) Take two screenshots of the Valgrind kcachegrind tool: one for the unoptimized program, and one for the final optimized program.
(7) Measure the execution time for moby.txt (for n=2) for your optimized program. What is the speedup compared to the original program?
Include the command used to run your program, since you will (almost certainly) want to modify some of the arguments.
Copy the program output into your report.
(8) Measure the execution time for shakespeare.txt (for n=2) for your optimized program. What is the speedup compared to the original program?
Include the command used to run your program, since you will (almost certainly) want to modify some of the arguments.
Copy the program output into your report.
(9) Provide a diff between the original program and your final optimized program, so I can clearly see all changes that were made to all files. Version control can be helpful here, because Mercurial can show the difference between two arbitrary revisions in your repository. Use the BitBucket website to find the revision numbers of commit "changesets" you wish to compare. They should be a hexadecimal tag like 7a72e1f.
Example command: hg diff analysis.c analysis.h -r <OLD-version-number> -r <NEW-version-number>
See hg diff for more details.
(10) Write a half-page narrative describing the optimization process you followed. Be sure to mention any false starts and unsuccessful changes too!

Post-Lab Report - Wrapup:
(1) What was the best aspect of this lab?
(2) What was the worst aspect of this lab?
(3) How would you suggest improving this lab in future semesters?

Task 1 - Instructions and Tips:

For the first task, you should run Valgrind in memcheck mode to track memory usage. You must let the program finish completely - if you terminate the program with a CTRL-C it before it ends normally, then (obviously) the program will not have run any cleanup code, and Valgrind will (incorrectly, but reasonably) report memory leaks. I recommend using the shorter moby.txt file to make this faster. The report will be quite long, so concentrate on fixing the loss records that report hundreds of thousands of reachable bytes at the end, as fixing them will clean up many of the other reported problems. You should be able to fix this with only a few small blocks of code.

Tip 1: The following Valgrind command will run the desired memcheck report for the analysis program:

unix>  valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 ./analysis_program -ngram 2 < moby.txt

Tip 2: The program has a few global variables that store statistics. Some of these variables are pointers to dynamically allocated strings. It would be bad (and will produce a "Still Reachable" warning in Valgrind) if the pointer was left intact after the corresponding memory was freed. Setting the pointer to NULL is the correct behavior.

Task 2 - Instructions and Tips:

For the second task, you should run Valgrind in callgrind mode to track instructions and their frequency. Again, let the program finish completely to get an accurate picture of where execution time is being spent. This will guide your optimizations.

Tip 1: The following Valgrind command will run the desired callgrind report for the analysis program, and display the results:

unix>  valgrind --tool=callgrind --dump-instr=yes --callgrind-out-file=callgrind.out ./analysis_program -ngram 2 < moby.txt
unix>  kcachegrind callgrind.out &

Tip 2: The C standard library has a very helpful function already implemented for you. Type man qsort at the command line to (gasp!) read the documentation. Feel free to use this routine instead of writing your own.

Tip 3: Which design would provide better search performance better given a text document with a large numbers of strings:

A small hash table with a few buckets and long linked lists?
A large hash table with many buckets and short linked lists?

What does the original analysis program do?

Tip 4: The -hash_table_size XXX argument can be used to modify the hash table size in your program without the need to recompile. Experiment with a range of values.

Tip 5: Look closely at how the hash function works. What bucket numbers does it commonly produce? Will it evenly distribute the strings across all buckets in the table, or will some buckets get far more strings than others? How would that impact performance?

This page discusses the pros/cons of the original hash function used in the analysis program: http://research.cs.vt.edu/AVresearch/hashing/strings.php (While this page has Java code, the algorithm is the same!)

Example program output:

Moby Dick input:

unix>  ./analysis_program -ngram 2 -hash-table-size <<REDACTED>> < moby.txt 
Running analysis program...

Options used when running program:
ngram	2
details	10
hash-table-size	<<REDACTED>>
N-gram size 2

Running analysis... (This can take several minutes or more!)
 Initializing hash table...
 Inserting all n-grams into hash table in lowercase form...
 Sorting all hash table elements according to frequency...

Analysis Details:
(Top 10 list of n-grams)
1840	'of the'
1142	'in the'
714	'to the'
435	'from the'
375	'the whale'
367	'of his'
362	'and the'
350	'on the'
328	'at the'
323	'to be'

Analysis Summary:
214365 total n-grams
114421 unique n-grams
91775 singleton n-grams (occur only once)
Most common n-gram (with 1840 occurrences) is 'of the' 
Longest n-gram (4 have length 29) is 'phrenological characteristics'
Total time = 0.200000 seconds

Shakespeare input:

unix> ./analysis_program -ngram 2 -hash-table-size <<REDACTED>> < shakespeare.txt 
Running analysis program...

Options used when running program:
ngram	2
details	10
hash-table-size <<REDACTED>>
N-gram size 2

Running analysis... (This can take several minutes or more!)
 Initializing hash table...
 Inserting all n-grams into hash table in lowercase form...
 Sorting all hash table elements according to frequency...

Analysis Details:
(Top 10 list of n-grams)
1892	'i am'
1804	'i ll'
1753	'in the'
1709	'my lord'
1681	'to the'
1660	'i have'
1603	'i will'
1554	'of the'
1118	'it is'
1020	'to be'

Analysis Summary:
965028 total n-grams
363039 unique n-grams
266018 singleton n-grams (occur only once)
Most common n-gram (with 1892 occurrences) is 'i am' 
Longest n-gram (1 have length 32) is 'honorificabilitudinitatibus thou'
Total time = 1.000000 seconds

This lab is based on Chapter 5 of "Computer Systems: A Programmer's Perspective", Second Edition, by Randal E. Bryant and David R. O'Hallaron