Lab 5: Performance Measurement

Overview

This lab will walk you through the tools to measure your program's performance and use of CPU and memory resources, as well as identify specific locations in the code where performance bottlenecks may exist. In future labs, we will work on optimizing program performance through a variety of compiler and human methods.

Lab

A boilerplate array sort program has been written for you to build upon. An example bubble sort function has already been implemented. You need to extend the existing code to add the quick sort algorithm into the empty function that already exists in the source code.

To begin this lab, start by obtaining the necessary boilerplate code. Enter the class repository:

unix>  cd ~/bitbucket/2016_spring_ecpe170_boilerplate/

Pull the latest version of the repository, and update your local copy of it:

unix>  hg pull
unix>  hg update

Copy the files you want from the class repository to your private repository:
(In this case, it is one folder you want)

unix>  cp -R ~/bitbucket/2016_spring_ecpe170_boilerplate/lab05/part1 ~/bitbucket/2016_spring_ecpe170/lab05

Enter your private repository now, specifically the lab05 folder:

unix>  cd ~/bitbucket/2016_spring_ecpe170/lab05

Add the new files to version control in your private repository:
(Technically, part1 is a directory, but Mercurial is smart enough to just add all the files in these directories with this command)

unix>  hg add part1

Commit the new files in your personal repository, so you can easily go back to the original starter code if necessary

unix>  hg commit -m "Starting Lab 5 with boilerplate code"

Push the new commit to the bitbucket.org website

unix>  hg push

Enter the part1 folder

unix>  cd part1

You should see the following code:

Makefile that can compile the complete program, including the correct GCC settings to enable the C99 language standard. (Needed since we use the "bool" type that was not present in the original C standard)
main() function in sorting.c

Helper functions in helper_functions.c:

Initialize array with random data
Verify that array has been properly sorted
Bubble sort algorithm

Empty functions in your_functions.c

Quick sort - Implement your code here!

Lab Part 1 - Sorting Algorithms

Assignment:
(1) Implement the quick sort algorithm in the file your_functions.c.
(2) Run both algorithms: bubble sort and quick sort, and ensure that the verification function reports success for each algorithm
(3) Check your code (including Makefile) into version control, and push to BitBucket.
Warning: ***Zero*** points will be awarded for missing source code files that were never added to version control, broken Makefiles, or other any errors that prevent your program from compiling/running! (You might want to do a test checkout to a different directory and verify that you can still compile your program)

Program usage instructions:

Compile the program:

unix>  make

Run the bubble sort routine.

unix>  ./sorting_program bubble

Note the program output, particularly the self-test at the end that verifies the sorting function worked correctly.

Initializing 100000 elements in array...
Using bubble sort algorithm...
Verifying array sorting...
Congrats! Array is sorted correctly

A note for students with old computers: **30 seconds** is plenty of execution time for the bubble sort. On my laptop, that is about 100,000 array elements. If that array size takes too long on your computer, feel free to experiment with a smaller array size (25,000? 12,500?) and change it in the file sorting.c. I'm interested in the relative performance difference, not the absolute numbers.

Lab Part 2 - CPU Usage with Time

Assuming you can run both sort algorithms correctly, let's decide which is "best". First up - performance! How long does it take to run?

Use the time utility to measure program execution time. This program works as follows: time <command to measure>. To measure your sorting algorithms, use:

unix>  time ./sorting_program bubble
unix>  time ./sorting_program quick

For the first command, the utility you are running is called "time". Everything after that is the argument that you give to the time program, i.e. "./sorting_program bubble". The time program will take this argument and use it as the command being measured.

This utility will report three metrics:

"Real" time is the total time (equivalent to your stopwatch)
"User" time is the time spent running your user application code
"Sys" time is the time the Linux kernel spent servicing your application code

When you use the time command on real programs (beyond this toy sorting example program), the real time may be much larger than the sum of the user and system time. This is OK. The most likely explanation is that the program was waiting for the disk to retrieve data, and thus the CPU was idle. An idle CPU does not count towards user or system time, but the real time (i.e. wall clock) is still ticking away!

Lab Report:
(1) Create a table that shows the real, user, and system times measured for the bubble and quick sort algorithms
(2) In the sorting program, what actions take user time?
(3) In the sorting program, what actions take kernel time?
(4) Which sorting algorithm is fastest? (It's so obvious, you don't even need a timer)

Lab Part 3 - CPU Usage with Valgrind

Although we've measured how much time our program takes to execute, we don't know where in the code that time is used. (For the trivial sorting program, you can make a good guess, but guessing might be less effective on a larger program, and you might guess wrong!) To see what code is taking the most time, you need to profile your program. There are many tools that can do this, but unfortunately some of the more common on Linux (OProfile and Perf) don't work inside a virtual machine because they tap into low-level processor features that are not virtualized. Fortunately, one tool does function correctly: Valgrind. In this section, we will use Valgrind to annotate your source code with the fraction of time each line spends executing while your program runs.

Valgrind operates by converting your program (in real time) to run on a simulated CPU, and then simulates it. This is similar to how Java programs are compiled to the Java virtual machine (a simulated machine), and then executed. Although this process is relatively slow, it allows Valgrind to record cycle-by-cycle information about the behavior of your program.

Valgrind requires that the program to be profiled be built using the -g option to ensure that the program includes debugging information linking specific instructions in the compiled program back to the original lines in the source code. Make the following addition of -g to the Makefile:

CFLAGS=-c -Wall -std=c99 -g

After that, clean out the old program and recompile:

unix>  make clean
unix>  make

The valgrind program works as follows: valgrind <valgrind options> <program to monitor>. We have a lot of options for Valgrind here:

--tool=callgrind : Callgrind is the name of the call-graph module in Valgrind (a call-graph is a list of what function called what other functions). Enable this module.
--dump-instr=yes : Configures Callgrind to perform event counting at the source line granularity (allowing for source code annotation later).
--callgrind-out-file=callgrind.out : By default, Callgrind puts logging information in a different file name each time (based on the program ID number). It's easier for this lab to specify a constant file name.

To process the log file into human readable results, the callgrind_annotate utility is run after valgrind in order to display the output.

--auto=yes : The annotate program will attempt to find the original source code, and label each line with execution time
--threshold=100 : The annotate program will list all functions consuming resources, even if they took minimal time. (Setting a lower threshold, like 90, indicates that you are only interested in the top 90% of lines consuming CPU resources)
The output of this program can then by redirected into a file by using the output redirection operator: >

To put this all together, the commands you would use to run Valgrind on your bubble sort program is:

unix>  valgrind --tool=callgrind --dump-instr=yes --callgrind-out-file=callgrind.out ./sorting_program bubble
<< This will take a long time to finish!  5x slower in Valgrind >>
unix>  callgrind_annotate --auto=yes --threshold=100 callgrind.out > callgrind_results.txt
unix>  gedit callgrind_results.txt &

In the output file, callgrind is reporting activity at the Instruction Read (IR) level. In other words, each time a new instruction is loaded by the processor, callgrind records it. These instructions are *not* lines of code in your program. Rather, they are assembly-level instructions that the compiler produced from your code. (We'll cover assembly programming later in this semester). Even the simplest lines in C are almost always composed of more than one assembly instructions. Thus, you should not expect to see counts of "1" instruction in the output file.

In the first section of the output file, each line shows the cumulative number of Instruction Reads (IRs) for each function of your program, sorted from most popular to least popular.

In the second section of the output file, the source code for your program is annotated on a line-by-line basis, showing the number of instruction reads required to execute that line. In the case of functions, the IR count is the cumulative count of that function and all functions it calls. Note that not every line of (assembly) code executed can be annotated. At the bottom of the report is a list of files that the annotate tool could not locate the source code for. Notice that these are all library functions. (The source code certainly exists somewhere online, but probably not on your computer. And the libraries are likely compiled without debugging support, to improve performance, which makes annotation impossible).

The plain text output file of callgrind_annotate, although useful, is a bit unwieldy to use, especially as the program complexity increases. An alternate approach to viewing the same callgrind data is offered by a GUI tool: kcachegrind. To run this utility:

unix>  kcachegrind callgrind.out &

Explore the GUI on your own to get a feel for its operation. The two key columns on the left-hand side are:

"Self" - The instruction reads needed for just that function (i.e. "Self Cost").
"Incl" - The instruction reads needed for that function and all functions it calls (i.e. "Inclusive Cost").

When ready, your task is:

Run Callgrind separately for each sort algorithms. Using either the plain text output or the GUI tool (as appropriate), answer the following lab report questions below.

Tip 1: If you create a separate output file for each profile, you'll avoid having to re-run the profiler multiple times per algorithm.

Tip 2: The callgrind tool is *very* CPU intensive. I estimate it takes 10 times longer to run a program with profiling than without profiling. For this section, you are permitted to decrease the size of the array to be sorted by 1/2, 1/4, 1/8, etc... by changing the #define in sorting.c provided that you (a) use a consistent size across all three programs, and (b) document the size in your lab report, so I'm not left wondering why your total instruction counts are so much smaller than your classmates!

Tip 3: I expect the bubble sort to take forever in Valgrind (forever = 10 or more minutes). You could start work on the first part of Lab 6 (answering questions about your system configuration) while Valgrind runs. (Yes, it will distort your measurements a bit using the time command, but we already expect bubble sort to yield terrible performance, so it won't make a noticeable difference....)

Lab Report:
(5) Create a table that shows the total Instruction Read (IR) counts for the bubble and quick sort routines. (In the text output format, IR is the default unit used. In the GUI program, un-check the "% Relative" button to have it display absolute counts of instruction reads).
(6) Create a table that shows the top 3 most active functions for the bubble and quick sort programs by IR count (excluding main()) - Sort by the "self" column if you're using kcachegrind, so that you see the IR counts for *just* that function, and not include the IR count for functions that it calls.
(7) Create a table that shows, for the bubble and quick sort programs, the most CPU intensive line that is part of the most CPU intensive function. (i.e. First, take the most active function for a program. Second, find the annotated source code for that function. Third, look inside the whole function code - what is the most active line? If by some chance it happens to be another function, drill down one more level and repeat.)

Lab Part 4 - Memory Leaks with Valgrind

One of the most common uses for the Valgrind tool is to monitor programs while they are running, and detect possible memory-related errors. The Memcheck module in Valgrind can detect the following errors:

Program accesses memory it shouldn't (e.g., areas not yet allocated, areas that have been freed, areas past the end of heap blocks, inaccessible areas of the stack)
Program uses uninitialized values in dangerous ways
Program leaks memory
Program frees memory in improper ways (e.g., double frees, mismatched frees)
Program passes overlapping source and destination memory blocks to memcpy() and related functions

For this simple demo, we are going to create a memory leak in the merge sort algorithm, and then observe Valgrind detect it. A memory leak is when a programmer has dynamically allocated some piece of memory, but forgot to return it to the operating system when finished. This memory cannot be used by another program, and thus is "lost".

To create our memory leak, edit the file sorting.c and comment out the following line that is part of the merge sort algorithm:

// free(temp_array);

Recompile your program to include the memory leak:

unix>  make

The valgrind program works as follows: valgrind <valgrind options> <program to monitor>. We have a lot of options for Valgrind here:

--tool=memcheck : Memcheck is the name of the memory usage module in Valgrind. Enable this module.
--leak-check=yes : Search for memory leaks when the program exits.
--show-reachable=yes : Display all suspicious blocks of memory.
--num-callers=20 : The depth of the stack trace to print out. Increase for very complex programs.
--log-file=memcheck.txt : Save the analysis information to a log file on disk. The program output still goes to the console.

To put this all together, the commands you would use to run Valgrind on your quick sort program is:

unix>  valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --log-file=memcheck.txt ./sorting_program merge
unix>  gedit memcheck.txt &

The output display is described in detail in the program documentation. But briefly, you should (at a minimum) see an entry in the "HEAP SUMMARY" section reading "#### bytes in 1 blocks are definitely lost in loss record 1 of 1", followed by a stack trace highlighting where that memory was originally allocated in your program.

When finished, clean up the memory leak (edit sorting.c and uncomment the following line):

free(temp_array);

Re-compile your program (via make), re-run Valgrind and verify that the output file is now clean. You should see a message like this at the bottom of the output log:

HEAP SUMMARY:
in use at exit: 0 bytes in 0 blocks
total heap usage: 1 allocs, 1 frees, ###### bytes allocated

All heap blocks were freed -- no leaks are possible

For counts of detected and suppressed errors, rerun with: -v
ERROR SUMMARY: 0 errors from 0 contexts

Lab Report:
(8) Show the Valgrind output file for the merge sort with the intentional memory leak. Clearly highlight the line where Valgrind identified where the block was originally allocated.
(9) How many bytes were leaked in the buggy program?
(10) Show the Valgrind output file for the merge sort after you fixed the intentional leak.
(11) Show the Valgrind output file for your quick sort.
If you are unlucky enough to have other errors in your sorting routine, you should fix them now in order to receive full credit. (For example, I didn't realize that my original sort algorithm was accessing the element *1 position before* the start of the array. It didn't crash, and still "correctly" sorted the array, but it was certainly clobbering memory it wasn't supposed to!) "Suppressed errors" reported by Valgrind can be ignored. These are errors in library functions that we don't have control over.

Lab Part 5 - Profiling Canny Edge Detector

For the final part of the lab, you are going to exercise the same time / Valgrind (callgrind) skills on a larger non-trivial software program: the Canny Edge Detector. Canny Edge Detector is an image processing program that detects edges in an image. It involves a number of signal processing steps including: image convolution, image magnitude and direction evaluation, double thresholding, hysteresis and edge-linking. All of these complex steps yield an Edge map for the input image. The source code for Canny Edge Detector is in part5 folder in 2016_spring_ecpe170_boilerplate.

Copy the files from the class repository to your private repository:

unix>  cp -R ~/bitbucket/2016_spring_ecpe170_boilerplate/lab05/part5 ~/bitbucket/2016_spring_ecpe170/lab05/

Enter the lab05/part5 directory. Note that there is one simple version of Makefile in this directory. Modify it to match the advanced "Makefile-4" from Lab 3, but call it "Makefile" here. Make sure to include the -g option needed by Valgrind. Note: Your Makefile 'clean' should clean up any previously produced executable and program output files (images).

Type 'make' to compile the Canny Edge Detector program.

unix>  make

<< This will/may produce a LOT of warnings.. >>

The output of compilation should be an executable file "canny_detector" in the current directory.

Use canny_detector to produce Edge map of an input image:

unix> ./canny_detector IMAGES/cameraman.pgm 5 1

 1. Reading input image...
 Image reading complete...
 2. Performing convolution for IGy and IGx..
 Convolution complete...
 3. Obtaining magnitude image...
<< More output follows >>

Also try the following executions (the famous Lenna image of size 1024 x 1024):

unix> ./canny_detector IMAGES/Lenna_org_1024.pgm 11 1.1

The cameraman:

unix> ./canny_detector IMAGES/cameraman.pgm 5 1

The flower image:

unix> ./canny_detector IMAGES/flower_512.pgm 5 1

Do you notice the parameters that follow the executable name? These are command line arguments for canny_detector. The first argument for your canny_detector is the the input image, the second parameter is the width of Gaussian mask for convolution, and the last parameter is the 'sigma' parameter for the Gaussian mask. These parameters are needed for the Canny Edge Detector to work. Try the following and observe what happens:

unix> ./canny_detector

To view the intermediate and output images, use the Image Viewer program in Ubuntu.

eog Edges.pgm &      # Output file
eog Gxy_img.pgm &    # Intermediate file
eog I_Angle.pgm &    # Intermediate file
eog IMAGES/ &        # Directory of input files (Use previous / next buttons in GUI)

Repeat the same command with the profiling tools, and complete the lab report.

Lab Report:
(12) How many seconds of real, user, and system time were required to complete the canny_detector with Lenna image (Lenna_org_1024.pgm)? Document the command used to measure this.
(13) Why does real time != (user time + system time)?
(14) In your report, put the intermediate and output images for the Lenna image. Specifically, put Edges.pgm, Gxy_img.pgm, and I_Angle.pgm.
(15) Excluding main() and everything before it, what are the top three functions run by canny_detector when you do count sub-functions in the total? Document the commands used to measure this. Tip: Sorting by the "Incl" (Inclusive) column in kcachegrind should be what you want.
(16) Include a screen capture that shows the kcachegrind utility displaying the callgraph for the canny_detector, starting at main(), and going down for several levels.
(17) Which function dominates the execution time? What fraction of the total execution time does this function occupy?
(18) Does this program show any signs of memory leaks?

Mercurial Cleanup:

We have not checked Canny Edge detector into version control, and there is really no point in doing so, as we haven't modified its source code. Unfortunately, now if you run hg status, you will see a huge list of untracked files. What a mess! Let's tell Mercurial to ignore this directory.

Create a special file called .hgignore in your repository root folder.

unix>  cd ~/bitbucket/2016_spring_ecpe170
unix>  gedit .hgignore &

Put the following two regular expressions in the file. Any item matching any expression is ignored by Mercurial.

^lab05/part5/
.*~

The first expression matches starting from the repository root (indicated by ^) anything containing /lab05/part5/
The second expression matches any file ending in the ~ symbol, such as the temporary files produced by gedit. A dot (.) matches a non-newline character, the * indicates that it can repeat 0 or more times, and the ~ is the specific character we want to find.

Now, add this file to Mercurial, and run hg status. The output should be much cleaner now.

unix>  hg add .hgignore
unix>  hg status

(Optional) Lab Report:
(1) How would you suggest improving this lab in future semesters?