Lab 6: Performance Optimization

Overview

This lab is concerned with performance optimization. Specifically,

How can the compiler improve application performance?
How can the programmer do a better job than the compiler at improving application performance? (And why is this?)

Lab - Getting Started

To begin this lab, start by obtaining the necessary boilerplate code. Enter the class repository:

unix>  cd ~/bitbucket/2013_spring_ecpe170_boilerplate/

Pull the latest version of the repository, and update your local copy of it:

unix>  hg pull
unix>  hg update

Copy the files you want from the class repository to your private repository:
(In this case, we want all the files in lab06)

unix>  cp -R ~/bitbucket/2013_spring_ecpe170_boilerplate/lab06/* ~/bitbucket/2013_spring_ecpe170/lab06

Enter your private repository now, specifically the lab06 folder:

unix>  cd ~/bitbucket/2013_spring_ecpe170/lab06

Add the new files to version control in your private repository:
(With the *, we are asking hg to add all files/folders in the current directory)

unix>  hg add *

Commit the new files in your personal repository, so you can easily go back to the original starter code if necessary

unix>  hg commit -m "Starting Lab 6 with boilerplate code"

Push the new commit to the bitbucket.org website

unix>  hg push

In your current folder, you should see the following code:

Makefile that can compile the program
main() function in main.c

Key processing algorithm in combine.c (including empty functions for your work):
Vector management functions in vec.c

Lab Part 1 - Program Operation

Motivation: This lab is memory intensive. We want to create a large vector in memory, process it using different algorithms, and compare algorithm performance. The larger the vector is, the easier it will be to distinguish between the performance of the different algorithms. But, if it is too large, it will overflow RAM and start swapping to the hard drive, severely degrading performance and making the test useless.

Answer the following questions about the computer system you will use for the in-class lab. (If you are booting Linux directly instead of inside a virtual machine, state that, and skip questions 1-5)

Lab Report:
(1) What is the total physical RAM installed in the system? (In MB)
(2) With no applications running (beyond the web browser with this page), how much RAM is used by the native operating system? (e.g. Windows)
(3) With no applications running (beyond the web browser with this page), how much RAM is available?
(4) Check the virtual machine configuration. How much RAM is currently allocated to Linux in your virtual machine?
Note: Your answer to question 4 must be less than your answer to question 3! Otherwise, your system will use slow virtual memory (i.e. swapping data to the hard disk) when running this lab.
(5) Try to increase your virtual machine memory allocation, if possible, to the maximum allowed based on your free RAM. Leave ~256MB free for the virtual machine program itself. Now how much RAM is allocated to Linux in your virtual machine?
(6) Boot Linux. With no applications running in Linux, how much RAM is available inside the virtual machine? The "System Monitor" program should report that information. This is the space that is actually available for our test application.

Examine the boilerplate code for this lab, and answer the following questions:

Lab Report:
(7) What is the code doing? (Describe the algorithm in a paragraph, focusing on the combine1() function.)
(8) What is the largest number of elements that the vector can hold WITHOUT using swap storage (virtual memory), and how much memory does it take? Be sure to leave enough memory for Firefox and LibreOffice, since you'll need those when running this lab as well.
Tip: The program reports memory usage when run. If you can compile and run it, you can experiment with the size of the vector to confirm your answer. If you see your disk activity light grinding away (either the physical LED, or the on-screen indicator in your virtual machine), you have chosen poorly... :-(
Tip 2: Run the System Monitor program in Linux in the background, then run your program. You can watch the Memory Usage indicator in the monitor increase when the program is running. The goal is to reach approximately 85% of memory, but not to start using swap memory.

Lab Part 2 - Compiler Optimizations

First, we will explore what GCC can do for us to improve program performance without changing the source code. GCC has several optimization levels:

No optimization (the default)
O1 level: Reduce code size and execution time without lengthy compile time
O2 level: Spend more time compiling, and use all techniques that do not increase program size in order to decrease execution time
O3 level: Spend the most time compiling, and use techniques that increase program size to decrease execution time

** Note that this is a capital letter 'O' (for optimization), not the number zero. This argument is CASE SENSITIVE. **

The optimization level is set at the command-line when invoking GCC. Because we use a Makefile to run GCC, the correct place to add this flag is in the CFLAGS line. For example, this line settings GCC to use the O1 optimization level, in addition to several other common settings:

CFLAGS=-c -Wall -std=c99 -O1

Run the "combine1" algorithm for a vector of containing the number of elements you determined earlier in this lab. Use the time utility to record the compilation time and program execution time, specifically the "real time" result, which represents the total time.

unix>  make clean
unix>  time make
unix>  time ./combine_program 1 <YOUR-VECTOR-SIZE>

Note: We will all have different vector sizes for this lab (due to differing amounts of physical memory available). The absolute time measurements don't matter. What is important is the relative time difference between experiments, and also the software implementation of each algorithm.

Note 2: Repeat these measurements at least 3 times. I found that the first time I ran the program, it was much slower than subsequent times.
(Hypothesis: Perhaps the Virtual Machine didn't really allocate all the memory to Linux that it said it did? If so, when this program goes to actually *use* the memory, the virtual machine has to scramble to allocate the resources.)

Lab Report:
(9) What vector size are you using for all experiments in this lab?
(10) How much time does the compiler take to finish with (a) no optimization, (b) with -O1 optimization, (c) with -O2 optimization, and (d) with -O3 optimization? Report the Real time, which is the "wall clock" time. Create both a table and a graph in LibreOffice Calc.
(11) How much time does the program take to finish with (a) no optimization, (b) with -O1 optimization, (c) with -O2 optimization, and (d) with -O3 optimization? Report the Real time, which is the "wall clock" time. Create both a table and a graph in LibreOffice Calc.
Note: No credit will be given for sloppy graphs that lack X and Y axis labels, a legend, and a title.

Lab Part 3 - Code Optimizations

Now that you've measured the application performance with a variety of compiler optimizations, it's time to see what tricks a programmer can employ to make it easier for the compiler to produce efficient code. (And, by making the compiler's job easier, end up with better application performance than before.)

In this part of the lab, you will write new code for the combine2() - combine6() functions, where each function adds a new optimization technique. Note that this lab is carefully designed to step you through an optimization process where the performance improves with each step. On a real program, though, you might have to use trial-and-error to determine which techniques are most effective for your specific problem.

Lab Report:
(12) After implementing each function, benchmark it for a variety of data types and mathematical operations. Fill in the table below as you write each function.
(13) Using LibreOffice Calc, create a single graph that shows the data in the table created, specifically the four time columns. (You don't need to plot vector size)
Note: No credit will be given for a sloppy graph that lack X and Y axis labels, a legend, and a title.
(14) As a reminder, you should be using version control to track your code, and ensure that the final code is checked in along with your report PDF.

Configuration	Vector Size (elements)	Vector Size (MB)	Time for Integer Add	Time for Integer Multiply	Time for FP (float) Add	Time for FP (float) Multiply
combine1()
combine2()
combine3()
combine4()
combine5x2()
combine5x3()
combine6()

Important notes for a consistent experiment:

Change your Makefile to use the -O1 optimization level (as a good compromise between compile time and application performance).
Always process vectors of the same size (and document that size in the lab in your report)
Do **not** use the time utility to benchmark your new functions. If you inspect the source code, you'll see that the program already records and prints the execution time of *just* the key algorithm we are optimizing, ignoring all initialization and cleanup code. Use that time for your table/graph.

combine2() Algorithm - Code Motion

Problem: When looking at the combine1() function, the compiler must assume that the array size could (potentially) change as the loop proceeds. But, as the programmer, you know that is impossible for this specific program.
Solution: For combine2(), move the function call to vec_length() out of the body of the for loop, and save it in a variable instead. Use the variable in your for loop.
This is called "code motion" - identifying code that does not need to be re-executed on each iteration, and moving it elsewhere.

combine3() Algorithm - Reducing Procedure Calls

Problem: In the combine2() function, get_vec_element() is called on every iteration (to obtain a data element). Each function call is expensive. Further, that function checks every time to see if the element requested is within the overall vector length.
Solution: You can improve performance (at the expense of program modularity) by removing the function call to get_vec_element(). Instead, create a new function called get_vec_start() that returns the starting address of the data array. Call this function once at the beginning of combine3(), save the address it returns, and use it to directly access array elements during processing.
Note: The performance improvement of this (by itself) is minimal, but it will unlock future optimizations
Tip: Use this function declaration: data_t *get_vec_start(vec_ptr v);

combine4() Algorithm - Eliminate Unneeded Memory Accesses

Problem: In the combine3() function, the result of the accumulate operation is written to memory at the location designated by the pointer dest. Then, on the next iteration, that value is retrieved from memory and used in the next calculation. While the compiler does not know if that specific memory location will change between iterations, the programmer does, and the answer is no!
Solution: Eliminate these unnecessary trips to memory in combine4(). Create a temporary variable (called accumulate) and use it to store the result of each operation (and retrieve it the next time). At the end of the function, copy the value in accumulate to the desired final memory address.

Note: Isn't accumulate just a memory location too? No! Any minimally-competent compiler will put an important local variable like that in a processor register stored directly on the CPU. No trips to the memory are required to access it, and we can retrieve data from the registers in 1-2 clock cycles. (We'll explore processor registers further when we learn assembly programming)

combine5x2() Algorithm - Loop Unrolling with a Factor of 2
combine5x3() Algorithm - Loop Unrolling with a Factor of 3

Problem: The for-loop itself incurs overhead. At the beginning of each iteration, the current position must be compared against the length. Only if this conditional is true will the for-loop body be executed. And, at the end of each iteration, the loop counter must be incremented by one.

Note: The conditional branch (loop comparison) is worse than it might seem, due to the way modern microprocessors execute programs. (Ask me about pipelining!)

Solution: In the combine5x2() algorithm, instead of processing the vector one element at a time (per iteration), process the vector two elements at a time (per iteration).

In your loop, increment the counter 'i' by two each time.
In the loop body, do two combine operations for consecutive items. It should look like this: accumulate = (accumulate OP data[i]) OP data[i+1];
Be careful not to go past the end of the array with your memory references.
Your solution must still work if there is an odd number of items in the vector! You'll need to add an if-claus after your loop to catch the last item (if any).

This technique (loop unrolling) can be generalized. You need to implement two different examples:

Loop unrolling with a factor of 2 - combine5x2() - Process two vector items per iteration
Loop unrolling with a factor of 3 - combine5x3() - Process three vector items per iteration

combine6() Algorithm - Two Parallel Accumulators

Problem: The combine5x2() (and 5x3) algorithms are pretty good, with a low factor of overhead on the primary task of reading the vector and performing a calculation on it. But, this algorithm is extremely sequential. The math operation is performed on the first vector element, and that result is used in a calculation with the second vector element, and so on, until the entire vector is processed. This leaves unused a key capability of modern microprocessors: the ability to perform more than one mathematical operation (i.e. instruction) at the same time.

The compiler can do nothing to help us here! But, as a human, you know that you could easily divide this vector processing into smaller pieces, and then combine then at the end to get the same result.

Solution: In combine6(), start with the previous combine5x2() function. You will still be processing two elements per iteration. But, do that processing using two completely separate sets of variables! In other words, combine the odd elements into one variable, the even elements into another variable, and then after the loop has finished, combine both temporary variables together.

The body of your loop should look like this:
accumulate0 = accumulate0 OP data[i];
accumulate1 = accumulate1 OP data[i+1];
By dividing work in this manner, one part of the CPU can process the odd elements, and at the same time, a different part of the CPU can process the even elements. This enables instruction-level parallelism.

Note: This is not multi-core. We still have one program running on one processor core. We are just processing more than one instruction in that one program at the same time.

As a reminder, be sure to complete the table and graph requested above, and include all source code with your final Lab Report PDF when submitting this lab.

This lab is based on Chapter 5 of "Computer Systems: A Programmer's Perspective", Second Edition, by Randal E. Bryant and David R. O'Hallaron