Lab 4: Performance Measurement

Overview

This lab will walk you through the tools to measure your program's performance and use of CPU and memory resources, as well as identify specific locations in the code where performance bottlenecks may exist. In future labs, we will work on optimizing program performance through a variety of compiler and human methods.

Lab

A boilerplate array sort program has been written for you to build upon. An example bubble sort function has already been implemented. You need to extend the existing code to add the tree sort algorithm into the empty function(s) that already exist in the source code.

NOTE: A stub has been provided for your convenience, however, you may choose to create your OWN functions and not utilize the stub at all. You may also modify the existing stub functions to your coding needs. It is perfectly OK!

To begin this lab, start by obtaining the necessary boilerplate code. Download the boilerplate files from here extract the ZIP file (part1 and part5) and put them in your lab4 folder (~/2022_sepring_ecpe170/lab04). After adding you should have "~/2022_spring_ecpe170/lab04/part1" and "~/2022_spring_ecpe170/lab04/part5"

unix>  cd ~/2022_spring_ecpe170/lab04

Enter the part1 folder

unix>  cd part1

You should see the following code:

Makefile that can compile the complete program, including the correct GCC settings to enable the C99 language standard. (Needed since we use the "bool" type that was not present in the original C standard)
main() function in sorting.c

Helper functions in helper_functions.c:

Initialize the array with random data
Verify that array has been properly sorted
Bubble sort algorithm

Empty functions in your_functions.c

Tree sort - Implement your functions here!

Lab Part 1 ("prelab") - Sorting Algorithms

Note: There is no separate submission for pre-lab. However, you are encouraged to get started with 1-3 and make as much progress as possible before the next class. Your final submission must have the answers for questions 1-3 in your report.

Assignment:
(1) Implement the tree sort algorithm in the file your_functions.c. Use the BTreeNode structure in your_functions.h to construct the binary tree.
(2) Run both algorithms: bubble sort and tree sort, and ensure that the verification function reports success for each algorithm. An algorithm that runs fast but gives incorrect results is more egregious than a sluggish algorithm that produces correct results!
(3) Put your code (including Makefile) into a ZIP folder and submit.
Warning: ***Zero*** points will be awarded for missing source code files that were not managed correctly, broken Makefiles, or other any errors that prevent your program from compiling/running! (You might want to do a test checkout to a different directory and verify that you can still compile your program)

Program usage instructions:

Compile the program:

unix>  make

Run the bubble sort routine.

unix>  ./sorting_program bubble

Note the program output, particularly the self-test at the end that verifies the sorting function worked correctly.

Initializing 100000 elements in array...
Using bubble sort algorithm...
Verifying array sorting...
Congrats! Array is sorted correctly

A note for students with slower computers: **30 seconds** is plenty of execution time for the bubble sort. On my laptop, that is about 100,000 array elements. If that array size takes too long on your computer, feel free to experiment with a smaller array size (25,000? 12,500?) and change it in the file sorting.c. I'm interested in the relative performance difference, not the absolute numbers.

Lab Part 2 - CPU Usage with Time

Assuming you can run both sort algorithms correctly, let's decide which is "best". First up - performance! How long does it take to run?

Use the time utility to measure program execution time. This program works as follows: time <command to measure>. To measure your sorting algorithms, use:

unix>  time ./sorting_program bubble
unix>  time ./sorting_program tree

For the first command, the utility you are running is called "time". Everything after that is the argument that you give to the time program, i.e. "./sorting_program bubble". The time program will take this argument and use it as the command being measured. CONSULT man pages for time utility and read the documentation thoroughly.

This utility will report three metrics:

"Real" time is the total time (equivalent to your stopwatch)
"User" time is the time spent running your user application code
"Sys" time is the time the Linux kernel spent servicing your application code

When you use the time command on real programs (beyond this toy sorting example program), the real time may be much larger than the sum of the user and system time. This is OK. The most likely explanation is that the program was waiting for the disk to retrieve data, and thus the CPU was idle. An idle CPU does not count towards user or system time, but the real time (i.e. wall clock) is still ticking away!

Lab Report:
(1) Create a table that shows the real, user, and system times measured for the bubble and tree sort algorithms.
(2) In the sorting program, what actions take user time?
(3) In the sorting program, what actions take kernel time?
(4) Which sorting algorithm is fastest? (It's so obvious, you don't even need a timer)
Execute both the bubble sort and tree sort algorithms for array sizes: 100, 1000, 10,000, and 100,000. The speedup achieved by tree sort over bubble sort is S=(T_bubble/T_tree) where T_bubble is the execution time of the bubble sort and T_tree is the execution time of the tree sort. Create a scatter plot for speedup vs. array-size using LibreOffice. Make thorough comments (in 5-6 lines) on any trend(s) that you observe (speed grows linearly? exponentially? something else?).

Lab Part 3 - CPU Usage with Valgrind

Although we've measured how much time our program takes to execute, we don't know where in the code that time is used. (For the trivial sorting program, you can make a good guess, but guessing might be less effective on a larger program, and you might guess wrong!) To see what code is taking the most time, you need to profile your program. There are many tools that can do this, but unfortunately some of the more common on Linux (OProfile and Perf) don't work inside a virtual machine because they tap into low-level processor features that are not virtualized. Fortunately, one tool does function correctly: Valgrind. In this section, we will use Valgrind to annotate your source code with the fraction of time each line spends executing while your program runs.

Valgrind operates by converting your program (in real time) to run on a simulated CPU, and then simulates it. This is similar to how Java programs are compiled to the Java virtual machine (a simulated machine), and then executed. Although this process is relatively slow, it allows Valgrind to record cycle-by-cycle information about the behavior of your program.

Valgrind requires that the program to be profiled be built using the -g option to ensure that the program includes debugging information linking specific instructions in the compiled program back to the original lines in the source code. Make the following addition of -g to the Makefile:

CFLAGS=-c -Wall -std=c99 -g

After that, clean out the old program and recompile:

unix>  make clean
unix>  make

The valgrind program works as follows: valgrind <valgrind options> <program to monitor>. We have a lot of options for Valgrind here:

--tool=callgrind : Callgrind is the name of the call-graph module in Valgrind (a call-graph is a list of what function called what other functions). Enable this module.
--dump-instr=yes : Configures Callgrind to perform event counting at the source line granularity (allowing for source code annotation later).
--callgrind-out-file=callgrind.out : By default, Callgrind puts logging information in a different file name each time (based on the program ID number). It's easier for this lab to specify a constant file name.

To process the log file into human readable results, the callgrind_annotate utility is run after valgrind in order to display the output.

--auto=yes : The annotate program will attempt to find the original source code, and label each line with execution time
--threshold=100 : The annotate program will list all functions consuming resources, even if they took minimal time. (Setting a lower threshold, like 90, indicates that you are only interested in the top 90% of lines consuming CPU resources)
The output of this program can then by redirected into a file by using the output redirection operator: >

To put this all together, the commands you would use to run Valgrind on your bubble sort program is:

unix>  valgrind --tool=callgrind --dump-instr=yes --callgrind-out-file=callgrind.out ./sorting_program bubble
<< This will take a long time to finish!  5x slower in Valgrind >>
unix>  callgrind_annotate --auto=yes --threshold=100 callgrind.out > callgrind_results.txt
unix>  gedit callgrind_results.txt &

In the output file, callgrind is reporting activity at the Instruction Read (IR) level. In other words, each time a new instruction is loaded by the processor, callgrind records it. These instructions are *not* lines of code in your program. Rather, they are assembly-level instructions that the compiler produced from your code. (We'll cover assembly programming later in this semester). Even the simplest lines in C are almost always composed of more than one assembly instructions. Thus, you should not expect to see counts of "1" instruction in the output file.

In the first section of the output file, each line shows the cumulative number of Instruction Reads (IRs) for each function of your program, sorted from most popular to least popular.

In the second section of the output file, the source code for your program is annotated on a line-by-line basis, showing the number of instruction reads required to execute that line. In the case of functions, the IR count is the cumulative count of that function and all functions it calls. Note that not every line of (assembly) code executed can be annotated. At the bottom of the report is a list of files that the annotate tool could not locate the source code for. Notice that these are all library functions. (The source code certainly exists somewhere online, but probably not on your computer. And the libraries are likely compiled without debugging support, to improve performance, which makes annotation impossible).

The plain text output file of callgrind_annotate, although useful, is a bit unwieldy to use, especially as the program complexity increases. An alternate approach to viewing the same callgrind data is offered by a GUI tool: kcachegrind. To run this utility:

unix>  kcachegrind callgrind.out &

Explore the GUI on your own to get a feel for its operation. The two key columns on the left-hand side are:

"Self" - The instruction reads needed for just that function (i.e. "Self Cost").
"Incl" - The instruction reads needed for that function and all functions it calls (i.e. "Inclusive Cost").

When ready, your task is:

Run Callgrind separately for each sort algorithms. Using either the plain text output or the GUI tool (as appropriate), answer the following lab report questions below.

Tip 1: If you create a separate output file for each profile, you'll avoid having to re-run the profiler multiple times per algorithm.

Tip 2: The callgrind tool is *very* CPU intensive. I estimate it takes 10 times longer to run a program with profiling than without profiling. For this section, you are permitted to decrease the size of the array to be sorted by 1/2, 1/4, 1/8, etc... by changing the #define in sorting.c provided that you (a) use a consistent size across all three programs, and (b) document the size in your lab report, so I'm not left wondering why your total instruction counts are so much smaller than your classmates!

Tip 3: I expect the bubble sort to take forever in Valgrind (forever = 10 or more minutes). You could start work on the first part of Lab 6 (answering questions about your system configuration) while Valgrind runs. (Yes, it will distort your measurements a bit using the time command, but we already expect bubble sort to yield terrible performance, so it won't make a noticeable difference....)

Lab Report:
(5) Create a table that shows the total Instruction Read (IR) counts for the bubble and tree sort routines. (In the text output format, IR is the default unit used. In the GUI program, un-check the "% Relative" button to have it display absolute counts of instruction reads).
(6) Create a table that shows the top 3 most active functions for the bubble and tree sort programs by IR count (excluding main()) - Sort by the "self" column if you're using kcachegrind, so that you see the IR counts for *just* that function, and not include the IR count for functions that it calls.
(7) Create a table that shows, for the bubble and tree sort programs, the most CPU intensive line that is part of the most CPU intensive function. (i.e. First, take the most active function for a program. Second, find the annotated source code for that function. Third, look inside the whole function code - what is the most active line? If by some chance it happens to be another function, drill down one more level and repeat.)

Lab Part 4 - Memory Leaks with Valgrind

One of the most common uses for the Valgrind tool is to monitor programs while they are running, and detect possible memory-related errors. The Memcheck module in Valgrind can detect the following errors:

Program accesses memory it shouldn't (e.g., areas not yet allocated, areas that have been freed, areas past the end of heap blocks, inaccessible areas of the stack)
Program uses uninitialized values in dangerous ways
Program leaks memory
Program frees memory in improper ways (e.g., double frees, mismatched frees)
Program passes overlapping source and destination memory blocks to memcpy() and related functions

For this simple demo, we are going to create a memory leak in the merge sort algorithm, and then observe Valgrind detect it. A memory leak is when a programmer has dynamically allocated some piece of memory, but forgot to return it to the operating system when finished. This memory cannot be used by another program, and thus is "lost".

To create our memory leak, edit the file sorting.c and comment out the following line that is part of the merge sort algorithm:

// free(temp_array);

Recompile your program to include the memory leak:

unix>  make

The valgrind program works as follows: valgrind <valgrind options> <program to monitor>. We have a lot of options for Valgrind here:

--tool=memcheck : Memcheck is the name of the memory usage module in Valgrind. Enable this module.
--leak-check=yes : Search for memory leaks when the program exits.
--show-reachable=yes : Display all suspicious blocks of memory.
--num-callers=20 : The depth of the stack trace to print out. Increase for very complex programs.
--log-file=memcheck.txt : Save the analysis information to a log file on disk. The program output still goes to the console.

To put this all together, the commands you would use to run Valgrind on your merge sort program is:

unix>  valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --log-file=memcheck.txt ./sorting_program merge
unix>  gedit memcheck.txt &

The output display is described in detail in the program documentation. But briefly, you should (at a minimum) see an entry in the "HEAP SUMMARY" section reading "#### bytes in 1 blocks are definitely lost in loss record 1 of 1", followed by a stack trace highlighting where that memory was originally allocated in your program.

When finished, clean up the memory leak (edit sorting.c and uncomment the following line):

free(temp_array);

Re-compile your program (via make), re-run Valgrind and verify that the output file is now clean. You should see a message like this at the bottom of the output log:

HEAP SUMMARY:
in use at exit: 0 bytes in 0 blocks
total heap usage: 1 allocs, 1 frees, ###### bytes allocated

All heap blocks were freed -- no leaks are possible

For counts of detected and suppressed errors, rerun with: -v
ERROR SUMMARY: 0 errors from 0 contexts

Lab Report:
(8) Show the Valgrind output file for the merge sort with the intentional memory leak. Clearly highlight the line where Valgrind identified where the block was originally allocated.
(9) How many bytes were leaked in the buggy program?
(10) Show the Valgrind output file for the merge sort after you fixed the intentional leak.
(11) Show the Valgrind output file for your tree sort.
If you are unlucky enough to have other errors in your sorting routine, you should fix them now in order to receive full credit. (For example, I didn't realize that my original sort algorithm was accessing the element *1 position before* the start of the array. It didn't crash, and still "correctly" sorted the array, but it was certainly clobbering memory it wasn't supposed to!) "Suppressed errors" reported by Valgrind can be ignored. These are errors in library functions that we don't have control over.

Lab Part 5 - Profiling Image Amplifier Code

For the final part of the lab, you are going to exercise the same time / Valgrind (callgrind) skills on a larger non-trivial software program: the Image Amplifier. It is an image processing program that detects edges in an image. It involves a number of signal processing steps including: edge detection, edge rod amplification, and pixel interpolation . All of these complex steps yield an amplified image for the input image. The source code for Image Amplifier is in the boilerplate folder under lab04/part5.

unix>  cd  ~/2022_spring_ecpe170/lab04/part5

Enter the lab04/part5 directory. Note that there is a version of Makefile in this directory. Modify it to match the advanced "Makefile-4" from Lab 3, but call it "Makefile" here. Make sure to include the -g option needed by Valgrind. In this Makefile, include 'clean' option to clean up any previously produced executable and program output files (images).

With this modified Makefile, user should be able to type 'make clean' to clear up previous executable, object files, and intermediate output files (images).

unix>  make clean

<< This should remove any previous executables, object files, and image files. >>

Type 'make' to compile the Image Amplifier program.

unix>  make

<< This will/may produce a LOT of warnings.. >>

The output of compilation should be an executable file "amplify" in the current directory.

Use amplify executable to produce amplified image of an input image:

unix> ./amplify IMAGES/Lenna_org_1024.pgm 11 1.1 2

 1. Executing the Canny Edge Detector...
 1a. Producing magnitude and angle maps...
 1b. Running non-maximal suppression and edge-linking...
 1c. Edge map ready...
 2. Replicating vertical edge rods...

<< More output follows >>

Also try the following executions (the famous Lenna image of size 1024 x 1024):

unix> ./amplify IMAGES/Lenna_org_1024.pgm 11 1.1 2

The cameraman:

unix> ./amplify IMAGES/cameraman.pgm 11 1.1 2

The flower image:

unix> ./amplify IMAGES/flower_512.pgm 11 1.1 2

Do you notice the parameters that follow the executable name? These are command line arguments for the amplify executable. The first argument for your amplify executable is the input image, the second argument is the width of Gaussian mask for convolution, the third parameter is the 'sigma' parameter for the Gaussian mask, and the final parameter is amplification factor set to 2. These parameters are needed for the Image Amplifier to work. Try the following and observe what happens:

unix> ./amplify

To view the intermediate and output images, use the Image Viewer program in Ubuntu.

eog output2048.pgm &      # A sample Output file
eog IMAGES/ &        # Directory of input files (Use previous / next buttons in GUI)

Repeat the same command with the profiling tools, and complete the lab report.

Lab Report:
(12) How many seconds of real, user, and system time were required to complete the amplify program execution with Lenna image (Lenna_org_1024.pgm)? Document the command used to measure this.
(13) Research and answer: why does real time != (user time + system time)?
(14) In your report, put the output image for the Lenna image titled output<somenumber>.pgm.
(15) Excluding main() and everything before it, what are the top three functions run by amplify executable when you do count sub-functions in the total? Document the commands used to measure this. Tip: Sorting by the "Incl" (Inclusive) column in kcachegrind should be what you want.
(16) Include a screen capture that shows the kcachegrind utility displaying the callgraph for the amplify executable, starting at main(), and going down for several levels.
(17) Which function dominates the execution time? What fraction of the total execution time does this function occupy?
(18) Does this program show any signs of memory leaks? Optional: Remove as many memory leaks as possible for 5 extra credits. Document the specific leak(s) you found and the specific code change(s) you made in your lab report.

(Optional) Lab Report: (1) How would you suggest improving this lab in future semesters?